Since Ouster Lidar has an embedded IMU with time synchronization with lidar scans it should be beneficial to utilize it for Lidar based odometry and, possibly, reduce the drift, avoid system degradation on feature-less environments and improve results in fast movement scenarios.
As a good starting point to review the current state of Lidar based odometry and different approaches you can check this “LiDAR Odometry Survey [1]“.
In my experiment I will extend on the KissICP [2] implementation since it’s already tested with Ouster sensors and it’s SDK [3]. The main test dataset used for experiments is Newer College Dataset 2021 extension [4] with OS-0 128 beam sensor and latest Ouster sensors sample raw data recordings [5]
The input to the system is an Ouster lidar raw sensor data in the form of UDP packets payload from pcap/bag container.
ES EKF performs state prediction and error-state tracking on every incoming IMU measurement. When the new scan is available KissICP registers it, optionally using the ES EKF state as the initial pose guess, and the resulting pose used for the ES EKF update step. The output of the system is the ES EKF state which is the smoothed trajectory, see Fig. 1 below.
NOTE: Even though the imu packets and lidar packets is the Ouster wire data format and the pipeline works with live sensors too one should make additional efforts to tune the setup for the specific system performance which depends on machine setup (compute, memory, network), sensor resolution (32, 64 or 128 beams), KissICP parameters (internal map size, downsampling ratios, etc.) and sensor configuration (lidar mode, lidar profile, etc).
The main effect of the ES EKF contribution to the pipeline is to integrate the available IMU information and fuse it with the KissICP trajectory output while also providing the better initial guess for the point cloud registration step.
The exposition of the ES EKF below is very brief and more like to provide the overall structure of what is inside with the references to other sources with fully detailed derivations, possible modifications to update steps, etc. I heavily relied on the work “Quaternion kinematics for the error-state Kalman filter” by Joan Sola [6] where the Error-state EKF derivations and formulations were cleanly presented for the quaternions but could be easily adopted for the exponential representations.
Other important sources: in-depth IMU mechanisations for low-cost sensors could be found in [7], Kalman filters modular design approaches [8], multi sensor data fusion [9] and useful background theory and results on Lie Groups for 3D [10, 11, 12]
The nominal filter state is defined as the following:
\[\boldsymbol{x} = \{ \boldsymbol{t}, \boldsymbol{v}, \boldsymbol{R}, \boldsymbol{b_g}, \boldsymbol{b_a} \}\]where $\boldsymbol{t},\boldsymbol{v} \in \mathbb{R}^{3}$ are the position and velocity in global frame, $\boldsymbol{R} \in SO(3)$ (or alternatively $\boldsymbol{q} \in \mathbb{H}$ can be used) is a rotation in global frame (as a rotation matrix or quaternion) and $\boldsymbol{b_g}, \boldsymbol{b_a} \in \mathbb{R}^{3}$ are gyroscope and accelerometer biases to be estimated.
Nominal state discrete-time kinematics, i.e. predict step that happens on every new IMU measurement:
\[\begin{aligned} \boldsymbol{\tilde{t}_k} & = t_{k-1} + v_{k-1} \Delta{t} + \frac{1}{2}(R_{k-1}(a_k - b_{a,k-1}) + \boldsymbol{g})\Delta{t}^2 \\ \boldsymbol{\tilde{v}_k} & = v_{k-1} + (R_{k-1}(a_k - b_{a,k-1}) + \boldsymbol{g})\Delta{t} \\ \boldsymbol{\tilde{R}_k} & = R_{k-1} Exp((\omega_k - b_{g,k-1})\Delta{t})) \\ \boldsymbol{\tilde{b}_{g,k}} & = b_{g,k-1} \\ \boldsymbol{\tilde{b}_{a,k}} & = b_{a,k-1} \end{aligned}\]Let’s define the nominal state evolved during the IMU prediction step as $\boldsymbol{\tilde{x}_k}$.
The error-state is defined as following:
\[\boldsymbol{\delta{x}} = \{ \boldsymbol{\delta{t}}, \boldsymbol{\delta{v}}, \boldsymbol{\delta{\phi}}, \boldsymbol{\delta{b_g}}, \boldsymbol{\delta{b_a}} \}\]where $\boldsymbol{\delta{t}}, \boldsymbol{\delta{v}}, \boldsymbol{\delta{\phi}} \in \mathbb{R}^3$ is the position, velocity and rotation errors, $\boldsymbol{\delta{b_g}}, \boldsymbol{\delta{b_a}} \in \mathbb{R}^3$ is the bias errors of gyroscope and accelerometer.
Error-state is evolving together with a nominal state in-between the update steps on IMU measurements and its kinematics in discrete-time:
\[\begin{aligned} \boldsymbol{\delta{t}_k} & = \delta{t}_{k-1} + \delta{v}_{k-1} \Delta{t} \\ \boldsymbol{\delta{v}_k} & = \delta{v}_{k-1} + (-R_{k-1}[\mathbf{a}_k - b_{a, k-1}]_{\times}\delta{\phi}_{k-1} - R_{k-1}\delta{b_{a,k-1}})\Delta{t} + \mathbf{v_i} \\ \boldsymbol{\delta{\phi}_k} & = Exp((\mathbf{\omega}_k - b_{g, k-1})\Delta{t})^{T}\delta{\phi}_{k-1} - \delta{b_{g, k-1}}\Delta{t} + \mathbf{\phi_i} \\ \boldsymbol{\delta{b_{g, k}}} & = \delta{b_{g, k-1}} + \mathbf{\omega_i} \\ \boldsymbol{\delta{b_{a, k}}} & = \delta{b_{a, k-1}} + \mathbf{a_i} \end{aligned}\]where $\mathbf{v_i}, \mathbf{\phi_i}, \mathbf{\omega_i}$ and $\mathbf{a_i}$ are the random impulses modeled as white Gaussian processes. Integrating the corresponding covariances of accelerometer/gyriscope biases and random walks we obtain the noise covariance matrices:
\[\begin{aligned} \boldsymbol{V_i} & = \sigma_{a_n}^2 \Delta{t}^{2} \boldsymbol{I} \\ \boldsymbol{\Phi_i} & = \sigma_{\omega_n}^2 \Delta{t}^{2} \boldsymbol{I} \\ \boldsymbol{\Omega_i} & = \sigma_{\omega_w}^2 \Delta{t} \boldsymbol{I} \\ \boldsymbol{A_i} & = \sigma_{a_w}^2 \Delta{t} \boldsymbol{I} \end{aligned}\]Introducing the input signal vector, from IMU measurement, and random process impulses vector as $\boldsymbol{u_k} = [a_k, \omega_k]^T$, $\boldsymbol{i} = [\boldsymbol{v_i}, \boldsymbol{\phi_i}, \boldsymbol{a_i}, \boldsymbol{\omega_i}]^T$ the error-state system can be written as:
\[\boldsymbol{\delta{x}_k} \leftarrow f(\boldsymbol{\tilde{x}_{k-1}}, \boldsymbol{\delta{x_{k-1}}}, \boldsymbol{u_k}, \boldsymbol{i}) = F_x(\boldsymbol{x_{k-1}}, \boldsymbol{u_k}) \delta{x} + F_i \boldsymbol{i}\]ES EKF predict step equations then (for error-state mean $\boldsymbol{\delta_x} \in \mathbb{R}^{15}$ and process covariance $\boldsymbol{P} \in \mathbb{R}^{15 \times 15}$ estimates):
\[\begin{aligned} \boldsymbol{\delta{x}_k} & = F_x(\boldsymbol{x_{k-1}}, \boldsymbol{u_k}) \delta{x} \\ \boldsymbol{P_k} & = F_x \boldsymbol{P_{k-1}} F_{x}^T + F_i \boldsymbol{Q_i} F_{i}^T \end{aligned}\]In expressions above $F_x$ is a Jacobian of the error-state transition function at the current point $\boldsymbol{x}_k, \boldsymbol{u_k}$ and can be written as:
\[F_x = \frac{\partial{f}}{\partial{\delta{x}}}\bigg|_{x_{k-1}, u_k} = \begin{bmatrix} \boldsymbol{I} && \boldsymbol{I} \Delta{t} && 0 && 0 && 0 \\ 0 && \boldsymbol{I} && -R_{k-1}[\mathbf{a}_k - b_{a, k-1}]_{\times}\Delta{t} && 0 && - R_{k-1}\Delta{t} \\ 0 && 0 && Exp((\mathbf{\omega}_k - b_{g, k-1})\Delta{t})^{T} && - \boldsymbol{I}\Delta{t} && 0 \\ 0 && 0 && 0 && \boldsymbol{I} && 0 \\ 0 && 0 && 0 && 0 && \boldsymbol{I} \end{bmatrix}\]and $F_i$ is a Jacobian with respect to random impulses $\boldsymbol{i}$ which can be written as:
\[F_i = \frac{\partial{f}}{\partial{\boldsymbol{i}}}\bigg|_{x_{k-1}, u_k} = \begin{bmatrix} 0 && 0 && 0 && 0 \\ \boldsymbol{I} && 0 && 0 && 0 \\ 0 && \boldsymbol{I} && 0 && 0 \\ 0 && 0 && \boldsymbol{I} && 0 \\ 0 && 0 && 0 && \boldsymbol{I} \end{bmatrix} \,, \quad Q_i = \begin{bmatrix} \boldsymbol{V_i} && 0 && 0 && 0 \\ 0 && \boldsymbol{\Phi_i} && 0 && 0 \\ 0 && 0 && \boldsymbol{\Omega_i} && 0 \\ 0 && 0 && 0 && \boldsymbol{A_i} \end{bmatrix}\]KissICP registers every lidar frame, with optional initial guess pose obtained from the ES EKF filter, and returns the new system pose $z_k = T_{kicp, k} = [ R_{kicp, k} \, \vert \, t_{kicp, k} ]$ which then used in the update state of the filter.
The measurement model $h(\hat{x}_k)$ that is using the best true-state at the time of update $\hat{x}_k = \tilde{x} \oplus \delta{x}_k$ is:
\[h(\hat{x}_k) = h(\tilde{x} \oplus \delta{x}_k) = [ \tilde{R}_k Exp(\delta{\phi}_k) \, \vert \, \tilde{t}_k + \delta{t}_k ]\]The Jacobian $H$ of the measurement model with respect to the error-state $\delta{x}$ is defined as:
\[H = \frac{\partial{h}}{\partial{\delta{x}}} \bigg|_x = \begin{bmatrix} \boldsymbol{I} && 0 && 0 && 0 && 0 \\ 0 && 0 && \boldsymbol{I} && 0 && 0 \end{bmatrix} \quad \in \mathbb{R}^{6 \times 15}\]The measurement residual thus is:
\[y_k = z_k \ominus h(\hat{x}) = \begin{bmatrix} t_{kicp,k} - \tilde{t}_k - \delta{t}_k \\ Log(Exp(\delta{\phi}_k)^T \tilde{R}_k^T R_{kicp, k}) \end{bmatrix} \, \in \mathbb{R}^{1 \times 6}\]Now the complete set of equations for the ES EKF update step can be expressed as:
\[\begin{aligned} S & = H \boldsymbol{P}_{k-1} H^T + \boldsymbol{V} \\ K & = \boldsymbol{P}_{k-1} H^T S^{-1} \\ \delta{x}_k & \leftarrow K y_k \\ \boldsymbol{P}_k & \leftarrow (\boldsymbol{I} - K H) \boldsymbol{P}_{k-1} \end{aligned}\]where $\boldsymbol{V} \in \mathbb{R}^{6 \times 6}$ is the measurement error covariances of pose translation and pose rotation, which highly depends on the sensor used and it’s accuracy specs.
After we’ve got a KissICP pose $T_{kicp,k}$ estimate and updated the error-state mean $\delta{x}_k$ and covariance $P_k$ we need to incorporate the error-state into nominal state $\boldsymbol{x}_k = \tilde{x}_k \oplus \delta{x}_k$, or in expanded form:
\[\begin{aligned} \boldsymbol{t}_k & = \tilde{t}_k + \delta{t}_k \\ \boldsymbol{v}_k & = \tilde{v}_k + \delta{v}_k \\ \boldsymbol{R}_k & = \tilde{R}_k Exp(\delta{\phi}_k) \\ \boldsymbol{b}_{g,k} & = \tilde{b}_{g,k} + \delta{b}_{g,k} \\ \boldsymbol{b}_{a,k} & = \tilde{b}_{a,k} + \delta{b}_{a,k} \end{aligned}\]Then we reset the error-state $\delta{x} \leftarrow 0$ and ready to receive the new IMU measurements and execute predict step and so on.
Covariance $\boldsymbol{P}$ maybe left unchanged on the reset step or corrected to decrease the long-term error and odometry drift, please refer to [6], eqns 285-287 for details and derivations of such covariance correction.
The ES EKF implementation was done as part of the ptudes ekf-bench
(Point eTudes) playground with metrics visualization, ground truth comparison (with ATE calculations) and point cloud visualization (with flyby
viz for qualitative reviews). Newer College Dataset 2021 [4] extension was used to compare the results with ground truth.
The basic operation of ES EKF was validated with simulated IMU with/without noise that can be compared to the dead reckoning INS in the absence of the accelerometer/gyroscope noise and biases and using the ground truth poses as a correction input to check the filter convergence (see ptudes ekf-bench sim
and ptudes ekf-bench nc
commands correspondingly).
ptudes ekf-bench nc
)The ES EKF performance of tracking the IMU inputs with direct ground truth poses as an update on the collection1/quad-hard sequence of the Newer College 2021 Dataset shown on Fig. 2 above, the Average Trajectory Errors (ATE) for position and rotation resulting in 0.0007 m
and 0.0004 deg
correspondingly (see [13] for ATE calculations details).
The full pipeline with Ouster Lidar data and its IMU data streams was implemented in the ptudes ekf-bench ouster
command that accepts Ouster raw data recordings in pcap/bag formats. Additionally ground truth can be provided in Newer College Dataset format using --gt-file
param which is used only to compare the results and compare the graphs of trajectories (using --plot graphs
) and the adaptive threshold of the KissICP registration together with it’s corrected pose translation and rotation components (the less value of sigma of the adaptive threshold the more “confident” KissICP registration is for the particular scan, see [2] for KissICP algorithm details).
Tests was performed using sequences quad-easy/medium/hard and stairs from collection1 and park0, park1 and cloister0 from collection2, with/without using ES EKF imu prediction guess for KissICP registration. Additionally the tests with manually reduced beams number (32,64 and 128) on quad-medium sequence was compared. In majority of test cases the ES EKF imu predicted initial guess improved the resulting pose with reduced ATE errors, the most significant gains observed on 32 beams cases and harder movements like quad-hard or stairs sequences, see Table 1. for full results.
Sequence name | Lidar Beams | KissICP range (m) min - max | ATE (deg) / trans (m) | ATE rot (deg) / trans (m) w imu prediction |
---|---|---|---|---|
nc-quad-easy | 128 | 1 - 35 | 0.0256 / 0.1383 | 0.0282 / 0.1834 |
nc-quad-easy | 64 | 1 - 35 | 0.0318 / 0.0978 | 0.0791 / 0.8966 |
nc-quad-easy | 32 | 1 - 35 | 1.0932 / 7.6802 | 0.3722 / 6.1207 |
nc-quad-medium | 128 | 1 - 35 | 0.0715 / 0.3817 | 0.0434 / 0.1121 |
nc-quad-medium | 64 | 1 - 35 | 0.1884 / 1.5852 | 0.2190 / 1.8996 |
nc-quad-medium | 32 | 1 - 35 | 5.7766 / 36.7806 | 0.5252 / 6.1870 |
nc-quad-hard | 128 | 1 - 35 | 0.1072 / 0.3465 | 0.0954 / 0.3582 |
nc-quad-hard | 64 | 1 - 35 | 0.3947 / 5.5880 | 0.1045 / 0.3003 |
nc-quad-hard | 32 | 1 - 35 | 1.8676 / 22.8112 | 0.3869 / 6.9918 |
nc-stairs | 128 | 1 - 17 | 1.1885 / 0.3389 | 0.3467 / 0.1182 |
nc-park0 | 128 | 1 - 35 | 0.5302 / 24.3595 | 0.5379 / 24.5963 |
nc-park1 | 128 | 1 - 35 | 0.1151 / 8.3812 | 0.0593 / 2.4570 |
nc-cloister0 | 128 | 1 - 17 | 0.3031 / 6.0323 | 0.3014 / 5.7621 |
nc-cloister0 | 128 | 1 - 35 | 0.1598 / 2.9759 | 0.2001 / 3.8562 |
Table 1. Test results using ES EKF pipeline with IMU predicted initial guess vs linear KissICP initial guess (baseline).
The adaptive threshold value of KissICP tracks in most cases lower with IMU prediction pose guess (correspondingly the difference to corrected pose after ICP iterations is also lower), see Fig 3.
The ES EKF + KissICP trajectory smoothing developed as part of ptudes ekf-bench
(Point eTudes) playground and can be installed from PyPi (python 3.7-3.11, Win/Mac/Linux supported):
pip install ptudes-lab
For example to run on the cloister0 sequence bag from Newer College Dataset and visualize the result you can run:
# process raw Ouster packets from .bag and run KissICP + ES EKF smoothing and save traj poses
ptudes ekf-bench ouster ./newer-college/2021-ouster-os0-128-alphasense/collection2/2021-12-02-10-15-59_0-cloister.bag \
--save-nc-gt-poses cloister0-traj.csv
# visualize the part of the bag in a flyby tool
ptudes flyby ./newer-college/2021-ouster-os0-128-alphasense/collection2/2021-12-02-10-15-59_0-cloister.bag \
--nc-gt-poses cloister0-traj.csv \
--start-scan 400 \
--end-scan 800
With the viz result shown below:
ptudes flyby
over the cloister0 sequence scans 400-800Simulated IMU runs, ES EKF with real IMU topic but ground truth poses as a correction or comparing various trajectories with ATE scores calculated use ptudes ekf-bench sim
, ptudes ekf-bench nc
and ptudes ekf-bench cmp
correspondingly.
To get the overall statistic of the IMU/Lidar scans from the pcap/bag
you can use ptudes stat
which may hint you the optimal value of --kiss-min-range/--kiss-max-range
params to use:
ptudes stat ./newer-college/2021-ouster-os0-128-alphasense/collection2/2021-12-02-10-15-59_0-cloister.bag -t 0
With an output:
StreamStatsTracker[dt: 186.5307 s, imus: 18652, scans: 1866]:
range_mean: 4.457 m,
range_std: 4.139 m (s3 span: [0.234 - 16.874 m])
range min max: 0.234 - 220.712 m
acc_mean: [-0.18887646 0.41429864 9.72023424] m/s^2
acc_std: [0.63840631 0.59747534 2.30738675]
gyr_mean: [-0.01103721 -0.00591911 0.07017822] rad/s
gyr_std: [0.09329166 0.14244756 0.34110766]
Gravity vector estimation: [-0.01940998 0.0425756 0.99890469]
NOTE: estimated gravity vector doesn’t account for the movements and valid only when the lidar was stationary during the recording.
Checkout the previous blog post on (Point eTudes) for other nits.
ES EKF is a common building block in state-of-the art and recent efforts in Lidar Inertial Odometry like FAST-LIO2 [15], LIO-EKF [14], SR-LIO and many others. However, as mentioned above algorithms are tightly coupled LIO systems and they usually has much better performance than loosely-coupled counterparts, so it’s worth considering them for further adoption and experimenting to get the better Lidar-Inertial odometry.
ptudes
(Point eTudes) playground as a place to easily run and replicate experiments with lidar based odometry, slam and mapping.
It’s heavily based on my latest work on Ouster SDK which was in the areas related to (the list is far from exhaustive):
kiss_icp_pipeline --deskew ouster.pcap
command.ouster-cli * viz
commands (OpenGL/C++/Python).ouster.sdk.pose_util
with linear SE(3)
continuous-time trajectory interpolations and other point cloud ops like dewarping and extrinsics handling.ouster.viz.scans_accum.ScansAccumulator
to make a registered point cloud from scans + poses and display it all in PointViz
.With the new ptudes
(Point eTudes) two things are added to the above mentioned list:
ptudes flyby
- flyby 3d visualizer over the registered lidar scans with poses.ptudes flyby ./OS-0-128_v3.0.1_1024x10.pcap --kitti-poses ./OS-0-128_v3.0.1_poses_kitti.txt
Ouster sample data of OS-0 128 beams with KISS-ICP poses
ptudes viz
- visualizer of ROS BAGs with raw sensor data (i.e. Newer College Dataset)ptudes viz ./2021-ouster-os0-128-alphasense/collection1/2021-07-01-10-37-38-quad-easy.bag \
--meta ./2021-ouster-os0-128-alphasense/beam_intrinsics_os0-128.json
Newer College Dataset 2021, collection 1, quad-medium BAG
How to get the ptudes
CLI and run examples you can find on the Github project page. I hope it will be valuable to other people who deal with Ouster lidar data and odometry/mapping algos.
While direct deep-learning methods somewhat works for 6DOF pose regression, they are not yet precise, and research papers increasingly use a combination of the following methods: Structure from Motion (SfM) techniques, geometric-based constraints, pose graph optimizations [15] and 3D maps for scene understanding, visual odometry, and SLAM tasks.
In this project, I explore the traditional SfM pipeline and build sparse 3D reconstruction from the Apolloscape ZPark sample dataset with simultaneous OpenGL visualization.
My primary goal was to learn from the first principles and implement the pipeline myself. To make SfM in production or research, you might look into available libraries: COLMAP [3,4], MVE [5], openMVG [6], VisualSFM [7], PMVS2/CMVS [9] and Bundler [8].
In this text, I use words like image, camera, or view interchangeably to relate to the same concept of an image taken with a camera in a particular location (e.g., view), and it almost always means the same in my writing and code. Sometimes it’s an image when I am preparing a dataset; a camera when I calculate the distance or projective matrix; and a view when I am processing 3D maps and merging 3D points of different maps.
NOTE: Code with build instructions and a reconstructed 3D map example available in my GitHub repo.
In my previous article, I visualize and explore the dataset. Here is the typical record (one of 13
for the ZPark sample):
Here is an SfM 3D reconstruction obtained from the corner shown above on a video piece:
In total, there 1,499
image pairs spread across 13
records in the Apolloscape ZPark sample dataset.
And below is the description of the behind the scenes SfM process.
The 3D reconstruction process consists of 6 major steps:
The described pipeline doesn’t include a couple steps from traditional SfM because we already know camera poses and need to find only world-map 3D points. These additional steps are:
The above parts can be easily added to the existing structure later when, and if, the need arises.
First, we need to have distinctive points in the image for which we can compare and establish relationships in order to estimate image transformations or, as in the current task of reconstruction, to estimate their location in the space using multiple images.
There are lots of known reliable feature detectors: SURF, SIFT, ORB, BRISK, and AKAZE. I tried a couple of them from OpenCV and decided to stick with AKAZE, which gave enough points in a reasonable amount of time, with fast descriptor matching.
For further in-depth information about how different feature detectors compare, I recommend “A comparative analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK” [10].
The next step is to find the keypoints correspondence for every image pair. One of the common methods is to find two closest neighbors per point and compare the distance between them, aka Lowe’s ratio test [11]. If two closest neighbor points are located at the same distance from the original point, and they are not distinctive enough, we can skip the keypoint completely. Lowe’s paper concludes that ratio of 0.7
is a good predictor; however, in this case, I selected a ratio of 0.5
because it seemed to work better.
An additional test for keypoint correspondence is the epipolar constraint, which can be applied by having camera poses and computed fundamental matrix between two cameras. So I checked the distance between keypoints and the correspondent epipolar line, and then filtered the keypoints with a distance larger than threshold values (default is 10px
).
The last step is to filter image pairs that have the small number of matched keypoints remaining after Lowe’s ratio test and epipolar constraint filtering. I set the value to seven
matched points for small reconstructions (up to 200
images) and approximately 60
points for reconstructions with more than 1000
images.
Another important consideration was how to make image pairs for keypoints matching. The easiest way is to generate all pairs, but it takes too long to match the features for all image combinations. We can instead reduce the number of pairs because we know camera locations and thus include only pairs with cameras that are located nearby.
After we extract keypoints from images and match image pairs, we create a connected components graph to quickly find images with the most common connections. Here is an example of building connected components for three images.
First, we have one image pair [1,2]
:
Then we add image pair [1,3]
and continue connecting keypoints into components:
And finally we add matches for third image pair [2,3]
:
Internally connected components are implemented as a tree, with the balanced depth of the subtrees, in order to support fast find and union operation. Fast check for connectedness is important in the merge-maps step, when we need to check whether points belong to the same component and therefore can be merged.
The best image pair is the one with the most matched keypoints, so we can use it for the initial triangulation step. The more keypoints we have from the first image pair in the reconstruction, the greater the chance that we will have to connect corresponding 3D points from different image pairs in subsequent steps.
It’s also important to have connected points because Bundle Adjustment Optimization will tie different point clouds together and minimize re-projection errors from the same 3D point on multiple images.
Linear triangulation is implememnted using the classical DLT methods described in Hartley/Zisserman (12.2 p312) [12] which describes finding a solution for the system of equations Ax=0
via SVD decomposition and taking the vector with the smallest singular value. For this purpose I used OpenCV function cv::triangulatePoints
which is a pure SVD-based method.
For every 3D point, we are storing the list of both views and keypoints used to reconstruct this point.
Initial reconstruction step in 3D visualization, with filtered outliers and Bundle Adjustment Optimization that further minimizes the re-projection error of the survived points.
We then check the most connected image from the list of unprocessed views against the list of the views that were already used for reconstruction. Next, we iterate the previous step until all views are used. Finally, we triangulate the corresponding pairs to obtain the local point cloud.
Then local point clouds are merged into the global map along the points that belong to the same connected component, if the distance between two connected points lies within the threshold of 3.0m
(hyperparameter).
If the distance between connected points is bigger than the threshold, we discard both points from the map. Distinctive points without a connected points counterpart are copied to the global map without changes.
Below is an example of three subsequent steps after the initial reconstruction and its 3D visualization.
Thus we are increasing the resulting map with more and more 3D points while processing additional views.
After every step of merging the map and increasing the views list of 3D points, we can perform map optimization and jointly minimize the re-projection error for every point on every originating view.
Mathematically, the problem statement is to minimize the loss function:
\[\min_{\mathbf{\hat{X}_j}} \sum_{ij} d(P^i \mathbf{\hat{X}_j}, x_j^i)^2\]where \(d(a, b)\) is the geometric distance between two points; \(\mathbf{\hat{X}_j}\) is an estimated 3D point in a world space; \(P^i\) is a projection matrix for camera \(i\), \(x_j^i\) is 2D coordinates of a keypoint in image \(i\) that corresponds to the 3D point \(\mathbf{\hat{X}_j}\) and \(P^i \mathbf{\hat{X}_j}\) is a backprojection of point \(\mathbf{\hat{X}_j}\) to image \(i\).
I need to mention that this is a simpler formulation than usually encountered in full SLAM problems because we are not optimizing camera projection matrix \(P^i\) here (It’s known in the Apolloscape dataset). Furthermore, there is also no weighted matrix that accounts for variances in error contributions between different world points.
Below, we continue the reconstruction of our three image examples with the resulting merged map of ten 3D points that correspond to 46
equations in the Bundle Adjustment Optimization problem.
Without camera poses computation, as in a full SLAM problem, we set only 3D map points as parameters to the Ceres solver, which performs Non-linear Least Squares optimization using the Levenberg-Marquardt method.
Ceres solver was optimized to work with huge problems, so the optimizations of 1.4M
3D points is not too large for the library to handle (though it is demanding for CPU computation on my MacBook Pro:)
In order to save computation time, I run a Bundle Adjustment Optimization with Ceres solver only after I merge 40k
new 3D points to the global map. Such a sparse optimization approach works because the problem is a constraint in just 3D map-point optimization with known camera poses, and thus is more or less localized in the parameter space. There also no such events like loop closures, as in SLAM problems, which might wreak havoc on the map without a proper optimization of the current graph reconstruction.
I visualize the 3D map, cameras, images, and back-projection points with OpenGL using GLFW, glad, glm and OpenCV for key-points visualization. The idea behind is to have camera parameters for OpenGL visualization identical to camera parameters of the dataset; this provides more intuition in multi-view projective geometry to allow for the exploration of the scene reconstruction, camera projections, and locations in one environment.
Everything in visualization is done with vertices, vertex array buffers, vertex/fragment/geometry shaders, and ambient and diffuse lightning. In addition, the functionality to load arbitrary 3D objects from common file formats (.obj, .fbx via assimp library) is helpful as visual clues for some experiments.
As for 3D points visualization, we can add color information estimated from keypoint vertices. Every point represented as a square with four distinctive colors on the vertices as a texture. Vertex colors are estimated as the average colors extracted from pixels of the corresponding vertices for every connected keypoint, adjusted to the rotation angle and the size of the keypoints.
This method provides a good understanding of keypoints and the region from which they were extracted (building, tree, road, etc.). However, many possible improvements can be done, for example, to add a center color point or to extract colors from the corresponding scaled version of the image. These methods allow for various keypoints octave and the visualization of squares in different sizes that account for variations in keypoint size.
SfM is a classical pipeline that is still widely used in SLAM, Visual Odometry, and Localization approaches.
Having completed this project, I can now much better appreciate the challenges of common computer vision problems, their algorithmic and computational complexities, the difficulty of getting 3D space back from the 2D images, as well as the importance of visualization tools and intuition in understanding algorithms that form the basis of software in AR glasses, VR headsets, and self-driving robots.
While finishing this write up, I discovered the new Localization challenge for CVPR 2019 as part of the workshop “Long-Term Visual Localization under Changing Conditions”. You can learn more at visuallocalization.net [16]
Yep, seems like I’ve found a new interesting problem and open datasets to play with for my next side project :)
Baidu released its massive self-driving Apolloscape dataset in March and now has a couple of ongoing challenges for ECCV 2018 Workshop: Vision-based Navigation for Autonomous Driving conference:
I pick the self-localization problem and create the whole pipeline to localize a car which is based only on camera images.
This article covers the very beginning of the journey and includes the reading and visualization of the Apolloscape dataset for localization task. Implement PoseNet [2] architecture for monocular image pose prediction and visualize results. I use Python and Pytorch for the task.
NOTE: If you want to jump straight to the code here is the GitHub repo. It’s is still an ongoing work where I intend to implement Vidloc [7], Pose Graph Optimization [3,8] and Structure from Motion [9] pipelines for Apolloscape Dataset in the context of the localization task.
NOTE 2 (update Mar 12, 2019): Today I’m releasing the second part of my journey with Structure from Motion (SfM) pipeline and OpenGL visualization for Apolloscape ZPark Sample implemented on C++.
Apolloscape dataset emerged from the Baidu effort in 2017 to collect enough data with the modern sensors that can be used for self-driving car research. Dataset provides camera images, poses, dense LIDAR point clouds, 3d semantic maps, 3d lane markings, 2d segmentation labels. Eventually, they plan to reach 200K images, captured on 20km roads covering 5 sites from 3 cities. Initial dataset was released in March 2018 and continuously updated for the ongoing ECCV challenges.
For the challenge purpose, Apolloscape provides separate archives for the self-localization task. You can download the small (4Gb) self-localization-examples
ZPark from Self-Localization Dataset page.
I use ZPark sample dataset for almost everything in this article from visualization to PoseNet training.
For Pytorch I need to have a Dataset
object that prepares and feeds the data to the loader and then to the model. I want to have a robust dataset class that can:
I am not putting here the full listing of the Apolloscape
dataset and concentrate solely on how to use it and what data we can get from it. For the full source code, please refer to the Github file datasets/apolloscape.py
.
Here how to create a dataset:
from datasets.apolloscape import Apolloscape
from torchvision import transforms
# Path to unpacked data folders
APOLLO_PATH = "./data/apolloscape"
# Resize transform that is applied on every image read
transform = transforms.Compose([transforms.Resize(250)])
apollo_dataset = Apolloscape(root=os.path.join(APOLLO_PATH), road="zpark-sample",
transform=transform, train=True, pose_format='quat',
stereo=True)
print(apollo_dataset)
Output:
Dataset: Apolloscape
Road: zpark-sample
Record: None
Train: None
Normalize Poses: False
Stereo: True
Length: 1499 of 1499
Cameras: ['Camera_2', 'Camera_1']
Records: ['Record001', 'Record002', 'Record003', 'Record004', 'Record006', 'Record007', 'Record008', 'Record009', 'Record010', 'Record011', 'Record012', 'Record013', 'Record014']
APOLLO_PATH
is a folder with unpacked Apolloscape datasets, e.g. $APOLLO_PATH/road02_seg
or $APOLLO_PATH/zpark
. Download data from Apolloscape page and unpack iot. Let’s assume that we’ve also created a symlink ./data/apolloscape
that points to $APOLLO_PATH
folder.
We can view the list of available records with a number of data samples in each:
# Show records with numbers of data points
recs_num = apollo_dataset.get_records_counts()
recs_num = sorted(recs_num.items(), key=lambda kv: kv[1], reverse=True)
print("Records:")
print("\n".join(["\t{} - {}".format(r[0], r[1]) for r in recs_num ]))
Output:
Records:
Record008 - 122
Record007 - 121
Record006 - 121
Record012 - 121
Record001 - 121
Record009 - 121
Record010 - 121
Record003 - 121
Record013 - 120
Record004 - 120
Record002 - 120
Record011 - 120
Record014 - 50
We can draw a route for one record with a sampled camera image:
from utils.common import draw_record
# Draw path of a record with a sampled datapoint
record = 'Record008'
draw_record(apollo_dataset, record)
plt.show()
Output:
Alternatively, we can see all records at once in one chart:
# Draw all records for current dataset
draw_record(apollo_dataset)
plt.show()
Output:
Another option is to see it in a video:
from utils.common import make_video
# Generate and save video for the record
outfile = "./output_data/videos/video_{}_{}.mp4".format(apollo_dataset.road, apollo_dataset.record)
make_video(apollo_dataset, outfile=outfile)
Output (cut gif version of the generated video):
For the PoseNet training we will use mono images with zero-mean normalized poses and camera images center-cropped to 250px:
# Resize and CenterCrop
transform = transforms.Compose([
transforms.Resize(260),
transforms.CenterCrop(250)
])
# Create train dataset with mono images, normalized poses, enabled cache_transform
train_dataset = Apolloscape(root=os.path.join(APOLLO_PATH), road="zpark-sample",
transform=transform, train=True, pose_format='quat',
normalize_poses=True, cache_transform=True,
stereo=False)
# Draw path of a single record (mono with normalized poses)
record = 'Record008'
draw_record(apollo_dataset, record)
plt.show()
Output:
Implemented Apolloscape Pytorch dataset also supports cache_transform
option which is when enabled saves all transformed pickled images to a disk and retrieves it later for the subsequent epochs without the need to redo convert and transform operations every image read event. Cache saves up to 50% of the time during training time though it’s not working with image augmentation transforms like torchvision.transforms.ColorJitter
.
Also, we can get the mean and the standard deviation that we need later to recover true poses translations:
poses_mean = train_dataset.poses_mean
poses_std = train_dataset.poses_std
print('Translation poses_mean = {} in meters'.format(poses_mean))
print('Translation poses_std = {} in meters'.format(poses_std))
Output:
Translation poses_mean = [ 449.95782055 -2251.24771214 40.17147932] in meters
Translation poses_std = [123.39589457 252.42350964 0.28021513] in meters
You can find all mentioned examples in Apolloscape_View_Records.ipynb
notebook.
And now let’s turn to something useful and more interesting, for example, training PoseNet deep convolutional network to regress poses from camera images.
In general case, online localization task formulation goes like this: find the current robot pose \(\mathbf{X}_t\) given its previous state \(\mathbf{X}_{t-1}\) and current sensors observations \(\mathbf{Z}_t\):
\[\mathbf{X}_t = f( \mathbf{X}_{t-1}, \mathbf{Z}_{t} )\]PoseNet deep convolutional neural network regresses robot pose from monocular image; thus we are not taking in account previous robot state \(\mathbf{X}_{t-1}\) at all:
\[\mathbf{X}_t = f( \mathbf{Z}_t )\]Our observation sensor is a camera that gives us a monocular image \(\mathbf{I}_t\). After removing subscript indexes for simplicity, we can define our task as to find a pose \(\mathbf{X}\) given a monocular image \(\mathbf{I}\).
\[\mathbf{X} = f( \mathbf{I} )\]where \(\mathbf{X}\) represents the full pose with translation \(\mathbf{x}\) and rotation \(\mathbf{q}\) components:
\[\mathbf{X} = [ \mathbf{x}, \mathbf{q} ] \\ \mathbf{x} = [x, y, z], \quad \mathbf{q} = [ q_1, q_2, q_3, q_4]\]Rotation is represented in quaternions because they do not suffer from a wrap around \(2\pi\) radians as Euler angles or axis-angle representations and more straightforward to deal than 3x3 rotation matrices.
For more information about the selection of different pose representation for deep learning refer to the excellent paper “Geometric loss functions for camera pose regression with deep learning” by Alex Kendall et al.
I build a DNN-based regressor for camera pose on ResNet and modify it by adding a global average pooling layer after the last convolutional layer and introducing a fully-connected layer with 2048 neurons. Finally, it’s concluded with 6 DoF camera pose regressor for translation \((x, y, z)\), and rotation \((q_1, q_2, q_3, q_4)\) vectors.
A Pytorch implementation of the PoseNet model using a mono image:
import torch
import torch.nn.functional as F
class PoseNet(torch.nn.Module):
def __init__(self, feature_extractor, num_features=128, dropout=0.5,
track_running_stats=False, pretrained=False):
super(PoseNet, self).__init__()
self.dropout = dropout
self.feature_extractor = feature_extractor
self.feature_extractor.avgpool = torch.nn.AdaptiveAvgPool2d(1)
fc_in_features = self.feature_extractor.fc.in_features
self.feature_extractor.fc = torch.nn.Linear(fc_in_features, num_features)
# Translation
self.fc_xyz = torch.nn.Linear(num_features, 3)
# Rotation in quaternions
self.fc_quat = torch.nn.Linear(num_features, 4)
def extract_features(self, x):
x_features = self.feature_extractor(x)
x_features = F.relu(x_features)
if self.dropout > 0:
x_features = F.dropout(x_features, p=self.dropout, training=self.training)
return x_features
def forward(self, x):
x_features = self.extract_features(x)
x_translations = self.fc_xyz(x_features)
x_rotations = self.fc_quat(x_features)
x_poses = torch.cat((x_translations, x_rotations), dim=1)
return x_poses
For further experiments I’ve also implemented stereo version (currently it’s simply processes two images in parallel without any additional constraints), option to switch off stats tracking for BatchNorm layers and Kaiming He normal for weight initialization [4]. Full source code is here models/posenet.py
As a loss function I use a weighted combination of losses for translation and orientation as described in the original PoseNet paper [2]:
\[\mathcal{L}_{\beta} (\mathbf{I}) = \mathcal{L}_{x}(\mathbf{I}) + \beta \mathcal{L}_{q}(\mathbf{I})\]Scaling factor \(\beta\) was introduced to balance the losses of two variables expressed in different units and of different scales. Weighting param depends on the task itself and should be selected as a model hyper-parameter.
The second option for a loss function is to use a learning approach to find an optimal weighting for translation and orientation:
\[\mathcal{L}_{\sigma}(\mathbf{I}) = \frac{\mathcal{L}_x(\mathbf{I})}{\hat{\sigma}_x^2} + \log \hat{\sigma}_x^2 + \frac{\mathcal{L}_q(\mathbf{I})}{\hat{\sigma}_q^2} + \log \hat{\sigma}_q^2\]\( \hat{\sigma}_x^2, \hat{\sigma}_q^2 \), homoscedastic uncertainties, represent free scalar values that we learn through backpropagation with respect to the loss function. Their effect is to decrease or increase the corresponding loss component automatically. And \( \log \hat{\sigma}_x^2, \log \hat{\sigma}_q^2 \) are the corresponding regularizers that prevent these values to become too big.
In practice, to prevent the potential division by zero, authors [1] suggest learning \( \hat{s} := \log \hat{\sigma}^2 \) which is more numerically stable. So the final loss expression looks like:
\[\mathcal{L}_{\sigma}(\mathbf{I}) = \mathcal{L}_x(\mathbf{I}) exp(-\hat{s}_{x}) + \hat{s}_{x} + \mathcal{L}_q(\mathbf{I}) exp(-\hat{s}_q) + \hat{s}_q\]Initial values for \(\hat{s}_{x}\) and \(\hat{s}_q\) could be set via a best guess and I follow the suggestion from paper and set them to the values of \(\hat{s}_{x} = 0, \hat{s}_q = -3.0\)
For more details on where it came from and intro to Bayesian Deep Learning (BDL) you can refer to an excellent post by Alex Kendall where he explains different types of uncertainties and its implications to the multi-task models. And even more results you can find in papers “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics.” [5] and “What uncertainties do we need in Bayesian deep learning for computer vision?.” [6].
Pytorch implementation for both versions of a loss function is the following:
class PoseNetCriterion(torch.nn.Module):
def __init__(self, beta = 512.0, learn_beta=False, sx=0.0, sq=-3.0):
super(PoseNetCriterion, self).__init__()
self.loss_fn = torch.nn.L1Loss()
self.learn_beta = learn_beta
if not learn_beta:
self.beta = beta
else:
self.beta = 1.0
self.sx = torch.nn.Parameter(torch.Tensor([sx]), requires_grad=learn_beta)
self.sq = torch.nn.Parameter(torch.Tensor([sq]), requires_grad=learn_beta)
def forward(self, x, y):
# Translation loss
loss = torch.exp(-self.sx) * self.loss_fn(x[:, :3], y[:, :3])
# Rotation loss
loss += torch.exp(-self.sq) * self.beta * self.loss_fn(x[:, 3:], y[:, 3:]) + self.sq
return loss
If learn_beta
param is False
it’s a simple weighted sum version of the loss and if learn_beta
is True
it’s using sx
and sq
params with enabled gradients that trains together with other network parameter with the same optimizer.
Now let’s combine it all to the training loop. I use torch.optim.Adam
optimizer with learning rate 1e-5
, ResNet34
pretrained on ImageNet as a feature extractor and 2048
features on the last FC layer before pose regressors.
from torchvision import transforms, models
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from datasets.apolloscape import Apolloscape
from utils.common import save_checkpoint
from models.posenet import PoseNet, PoseNetCriterion
APOLLO_PATH = "./data/apolloscape"
# ImageNet normalization params because we are using pre-trained
# feature extractor
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
# Resize data before using
transform = transforms.Compose([
transforms.Resize(260),
transforms.CenterCrop(250),
transforms.ToTensor(),
normalize
])
# Create datasets
train_dataset = Apolloscape(root=os.path.join(APOLLO_PATH), road="zpark-sample",
transform=transform, normalize_poses=True, pose_format='quat', train=True, cache_transform=True, stereo=False)
val_dataset = Apolloscape(root=os.path.join(APOLLO_PATH), road="zpark-sample",
transform=transform, normalize_poses=True, pose_format='quat', train=False, cache_transform=True, stereo=False)
# Dataloaders
train_dataloader = DataLoader(train_dataset, batch_size=80, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=80, shuffle=True)
# Select primary device
if torch.cuda.is_available():
device = torch.device('cuda')
else:
device = torch.device('cpu')
# Create pretrained feature extractor
feature_extractor = models.resnet34(pretrained=True)
# Num features for the last layer before pose regressor
num_features = 2048
# Create model
model = PoseNet(feature_extractor, num_features=num_features, pretrained=True)
model = model.to(device)
# Criterion
criterion = PoseNetCriterion(stereo=False, learn_beta=True)
criterion = criterion.to(device)
# Add all params for optimization
param_list = [{'params': model.parameters()}]
if criterion.learn_beta:
# Add sx and sq from loss function to optimizer params
param_list.append({'params': criterion.parameters()})
# Create optimizer
optimizer = optim.Adam(params=param_list, lr=1e-5, weight_decay=0.0005)
# Epochs to train
n_epochs = 2000
# Main training loop
val_freq = 200
for e in range(0, n_epochs):
train(train_dataloader, model, criterion, optimizer, e, n_epochs, log_freq=0,
poses_mean=train_dataset.poses_mean, poses_std=train_dataset.poses_std,
stereo=False)
if e % val_freq == 0:
validate(val_dataloader, model, criterion, e, log_freq=0,
stereo=False)
# Save checkpoint
save_checkpoint(model, optimizer, criterion, 'zpark_experiment', n_epochs)
A little bit simplified train
function below with error calculation that is used solely for logging purposes:
def train(train_loader, model, criterion, optimizer, epoch, max_epoch,
log_freq=1, print_sum=True, poses_mean=None, poses_std=None,
stereo=True):
# switch model to training
model.train()
losses = AverageMeter()
epoch_time = time.time()
gt_poses = np.empty((0, 7))
pred_poses = np.empty((0, 7))
end = time.time()
for idx, (batch_images, batch_poses) in enumerate(train_loader):
data_time = (time.time() - end)
batch_images = batch_images.to(device)
batch_poses = batch_poses.to(device)
out = model(batch_images)
loss = criterion(out, batch_poses)
# Training step
optimizer.zero_grad()
loss.backward()
optimizer.step()
losses.update(loss.data[0], len(batch_images) * batch_images[0].size(0) if stereo
else batch_images.size(0))
# move data to cpu & numpy
bp = batch_poses.detach().cpu().numpy()
outp = out.detach().cpu().numpy()
gt_poses = np.vstack((gt_poses, bp))
pred_poses = np.vstack((pred_poses, outp))
# Get final times
batch_time = (time.time() - end)
end = time.time()
if log_freq != 0 and idx % log_freq == 0:
print('Epoch: [{}/{}]\tBatch: [{}/{}]\t'
'Time: {batch_time:.3f}\t'
'Data Time: {data_time:.3f}\t'
'Loss: {losses.val:.3f}\t'
'Avg Loss: {losses.avg:.3f}\t'.format(
epoch, max_epoch - 1, idx, len(train_loader) - 1,
batch_time=batch_time, data_time=data_time, losses=losses))
# un-normalize translation
unnorm = (poses_mean is not None) and (poses_std is not None)
if unnorm:
gt_poses[:, :3] = gt_poses[:, :3] * poses_std + poses_mean
pred_poses[:, :3] = pred_poses[:, :3] * poses_std + poses_mean
# Translation error
t_loss = np.asarray([np.linalg.norm(p - t) for p, t in zip(pred_poses[:, :3], gt_poses[:, :3])])
# Rotation error
q_loss = np.asarray([quaternion_angular_error(p, t) for p, t in zip(pred_poses[:, 3:], gt_poses[:, 3:])])
if print_sum:
print('Ep: [{}/{}]\tTrain Loss: {:.3f}\tTe: {:.3f}\tRe: {:.3f}\t Et: {:.2f}s\t{criterion_sx:.5f}:{criterion_sq:.5f}'.format(
epoch, max_epoch - 1, losses.avg, np.mean(t_loss), np.mean(q_loss),
(time.time() - epoch_time), criterion_sx=criterion.sx.data[0], criterion_sq=criterion.sq.data[0]))
validate
function is similar to train
except model.eval()
/model.train()
modes, logging and error calculations. Please refer to /utils/training.py
on GitHub for full-versions of train
and validate
functions.
The training converges after about 1-2k epochs. On my machine, with GTX 1080 Ti it takes about 22 seconds per epoch on ZPark sample train dataset with 2242 images pre-processed and scaled to 250x250 pixels. Total training time – 6-12 hours
After 2k epochs of training, the model was managed to get a prediction of pose translation with a mean 40.6 meters
and rotation with a mean 1.69 degrees
.
After learning, PoseNet criterion uncertainties became equal of \(\hat{s}_{x} = 1.5606, \hat{s}_q = -3.8471\). Interestingly, these values are equivalent to the value of \(\beta = 223.12\) which is close to those that can be derived from the PoseNet’ original paper [2].
You can replicate all results from this article using my GitHub repo of the project.
Established results are far from one that can be used in autonomous navigation where a system needs to now its location within accuracy of 15cm
. Such precision is vital for a car to act safely, correctly predict the behaviors of others and plan actions accordingly. In any case, it’s a good baseline and building blocks of the pipeline to work with Apolloscape dataset that I can develop and improve further.
There many things to try next:
And what’s more importantly, all above-mentioned methods need no additional information but that we already have in ZPark sample road from Apolloscape dataset.
move_base
ROS package that in turn uses the underlying navigation stack with global and local planners to build plan and eventually send commands to the controller that drives robot in simulation on Gazebo. That was the last requirement for the final project. I’ve tested it a bit more, prepared the archive with all ROS packages that contained custom nodes, configurations, scripts, launch files and send for the review.
I’ve just finished a great journey through Robotics Software Engineer Nanodegree program at Udacity as a part of the first cohort where we’ve deep dived into how a robot works and how to write software for them. There were a vast amount of materials to learn and tinkering from custom build simulations and python APIs to control robots or tuning PID controllers to deep reinforcement learning agents that run on Jetson TX2 and control simulated robotics arm.
During the program, there were 9 projects, and here I briefly describe them with the results, personal reflections and links to the related materials so you can use it to learn more about the field.
Note: Each project description contains GitHub links to solution code and project write-up that addresses questions needed for the submission. If you are current Udacity student in the program, I would highly recommend you build your solution first before looking inside.
The project based on NASA Sample Return Competition which is a great way to introduce to the field and show how it can be applied in the real settings. NASA’s competition involves building autonomous rovers with a capability to locate and retrieve rock samples and return those samples to a designated zone in a specified amount of time with limited mapping data.
In contrast to NASA’s competition, we don’t need to build a robot, and our objective is to control robot simulation through Python API and build a simple map of the environment, locate rock samples, and retrieve them. The simulator is built on Unity and provides an easy to use the environment to test perception and control algorithms.
To complete the objective one need to build perception and decision pipelines. During the perception stage camera image cropped, transformed to the bird-eye view and through the series of thresholds walls, grounds and diamonds are located and marked on maps with different colors.
Free space detection pipeline
The decision step is either move forward
to the free area and lean a bit to the direction of the wall (follow_wall
mode) or forward_stop
when rock sample is detected or rotate_right
if rover stuck and not moving. Simple perception algorithm can’t detect holes or flying rocks that can trap our rover, so current speed is continuously monitored and switch to the rotate_right
mode happens when the robot is stuck.
I’ve spent lot’s of time on things that weren’t needed in the project — like trying to build an accurate map and guide robot using A* path planner or tuning separate speed and yaw PID controllers for control step. However, eventually, I’ve returned to the simplest solution with wall follower algorithm and decision step based on a number of pixels of the free space and diamonds.
Here is the resulting rover in action. Not so smooth and sometimes acts roughly and jerky.
There are many things one could try to improve upon current implementation:
You can play yourself with the starter project and clone it from RoboND-Rover-Project GitHub links where to download the simulator you find in README file.
My project GitHub and writeup links.
Related links:
Here are the hardware basics of robotics - modeling robot description and defining its Denavit-Hartenberg (DH) parameters and calculate forward kinematics matrix and inverse kinematics solutions for the robotics arm Kuka KR210, and test it all in ROS with Gazebo.
Together with math and geometry you should also catch-up with ROS, Gazebo, RViz, MoveIt! and how to deploy everything on local Linux machine. Though it was possible to run everything within a virtual machine with all pre-installed software using Udacity provided image I’ve installed everything manually on a dedicated PC with an NVidia GPU 1080Ti. Which was the right decision because 13-inch MacBook Pro just wasn’t performant enough to run VM simulation with Linux and Gazebo faster than 0.1-0.2x from real-time.
I like math, so it was a great chance to remember trigonometry, so I’ve done all calculations by hand with iPad and Pencil. Below is the solution for DH table and forward kinematics that calculates the position of the end-effector given angles of all joints.
Modified DH parameters for Kuka KR210 Robotics Arm.
Forward Kinematics transform for Kuka KR210 Robotics Arm.
The whole derivation of forward and inverse kinematics with correction transforms and special cases of singularity you can find in my project write-up.
One might ask “Why do we need to bother with this math when ROS can give provide you all transformations through tf
package?” I think it’s important to understand it all because:
I would also prefer to learn more about dynamics and mechanical parts, but it seems a vast topic and Udacity would probably need to launch a separate Robotics Hardware Nanodegree to cover it in a meaningful way.
Here is the links to my implementation of the project on GitHub and a write-up.
Related links & materials:
What I like about Udacity’s course materials that they expose you to the real world problems and ongoing challenges in the robotics field.
Thus, the third project based on Amazon Robotics Challenge where the goal is to build a recognition pipeline for the set of known objects using RGB-D sensor data and send the correct pose for the robot to perform a pick-and-place operation.
Though that’s not enough for the participants because the full task for Amazon Robotics Challenge is to build an autonomous manipulator that can perform perception, motion planning, 3D mapping, kinematics and more to locate specific objects in cluttered dynamic environment retrieve and place them into a designated bin.
To train the model we need to collect point cloud data for each object in different orientations. Here is how Gazebo environment used for data collection by sequentially showing every object in random poses to the sensor and record HOG and surface normal feature vector.
It was excellent learning for me that with a quite simple 3D point cloud pipeline that goes through voxel downsampling, passthrough thresholding, RANSAC plane segmentation and Euclidean clustering we can separate individual points for objects on the table and apply SVM classifier to recognize them. Everything seems much complex in 3D than in 2D but this pipeline was clean, simple and produced great results (96% accuracy rate for SVM classifier).
Confusion matrix for SVM Classifier
3D Perception pipeline: HOG & Surface normals features + SVM Classifier
Look at my implementation of the project on GitHub and a write-up link.
Related links & materials:
Well, that’s nice. If you are following drone industry updates, you probably know a couple of cases when drone companies claimed to build “follow me” feature and that drone be able to automatically fly beside you, take beautiful shots and avoid obstacles. Then they didn’t quite deliver on the promise because the task is just hard.
In this project, we’ve had a chance to build a part of this pipeline: person recognition through deep learning image segmentation and feed the data back to the drone simulator to test the following feature.
Udacity’s simulator is built on Unity game engine, and it’s an excellent tool as a playground for testing perception and control algorithms.
Udacity’s Drone Simulator
Image segmentation was build on Fully Connected Network (FCN) architecture that consists of convolutional encoder & decoder. It’s the first architecture that you can start from and test other approaches like Mask R-CNN, for example.
Simple FCN architecture for image segmentation task
For the hero segmentation task with such an FCN I was able to get average IoU score 89.6%
.
Hero segmentation: Original image, Ground Truth, Segmentation Result
Here are links to my implementation of the project on GitHub and a write-up.
Related links & materials:
In the first project of Term 2 we’ve explored the NVidia DIGITS system for training deep learning models for image classifications.
My choice was to build a model that can classify finger gestures so I can use it later for my home robots projects.
Data was collected manually using a camera on MacBook Pro.
Fingers Classification: Example dataset
Interestingly enough that LeNet was able to train and recognize fingers gesture with high accuracy and as a result the inference performance on Tesla K80 GPU was 0.19 ms
!
Details about the project and my implementation you can find in a previous article “Fingers Classification” and the source code of supporting tools on GitHub.
Related links:
Here we are going deeper into a robot description and exploring Monte-Carlo Localization algorithm implementation in the ROS.
One part of the project is to build a robot model using URDF, test it in Gazebo and another is to configure localization stack using AMCL package.
Robot Model, with a laser scan and camera sensors
Current and future projects are teaching us to deal with dozens of parameters that you need to tune to make your system work. Robot model parameters – inertial, visual, collision; AMCL parameters – particles number, velocities limits, transform tolerance, odometry characteristics; Move Base parameters – inflation radius, robot footprint, local & global cost map settings, planner settings and so on …
Sometimes it’s converging quickly, and you have a working robot, but sometimes it’s not, and you are moving circles making a lot of tweaks and tests launching ROS, RViz, and Gazebo many times.
My custom robot executing localization and navigation to the goal
Here are links to my implementation of the project on GitHub and a write-up in PDF.
Related links:
It’s a SLAM project where we’ve explored different approaches to SLAM during the lessons: Full SLAM vs. Online SLAM, FastSLAM (EKF-based and Grid-based), GraphSLAM.
Durrant-Whyte et al. (2006)
We’ve used the mapping capabilities of rtabmap
to map the environment using the previously build URDF robot model with a newly equipped depth camera and necessary configurations.
We’ve built the map of the provided environment and explored loop closure on the resulting map. Then we’ve modeled our environment in Gazebo and built the map for it with our configured robot using rtabmap
implementation.
3D map of custom environment. Built with rtabmap
.
SLAM is very complex problem algorithmically and computationally, and it’s worth exploring it deeper because you can find it everywhere from AR/MR/VR headsets, controllers, mobile phones to self-driving cars and Mars rovers.
Here are links to my implementation of the project on GitHub and a write-up in PDF.
Related links:
Deep Reinforcement Learning is the next big thing in deep learning, and you’ve probably heard about DeepMind, AlphaGo Zero or Open AI agent that plays Dota 2 tournaments against humans. This project is about an application of deep reinforcement learning to robotic tasks.
Objectives of the project were to create a DQN Agent and tune hyperparameters to achieve the two following tasks:
Project environment was built on ROS and Pytorch LSTM Network with C++ integrations to Gazebo plugin.
Deep RL Arm Manipulator trained for Task #2
The main difficulty in the project was to select the good reward function that would be able to learn the expected behaviors. I’ve spent more than 20 hours of GPU’s compute time on Udacity Workspace on just training and testing different reward params.
Links to my implementation of the project on GitHub and a write-up in PDF.
Relevant materials & links:
Finally, I am returning to the project that was mentioned at the beginning of the article.
It’s an integration project to build the structure of home service robot that can build a map by executing a wall follower algorithm, navigate to the pick-up location, collect an object, navigate to the drop-off location and leave the object there.
Home Service Robot completing Goal #1
Main challenges were to integrate everything in the ROS using and re-using packages from Turtlebot while implementing three custom nodes:
Here the links to my implementation of the project on GitHub.
Relevant materials & links:
That was a great journey through the vast amount of information and tools. I’ve tried to break the ice of ROS and Gazebo myself 2 years ago but lost motivation without the task at hand and well-defined structure.
I think the primary value that you get from Udacity’s Nanodegree is the organized and straightforward success path for learning, projects tasks and environments that you can start playing with and community of like-minded people who are passionate about robotics and going through the challenges together with you.
My future work in the field would be:
Happy learning and building autonomous robots.
PS: If you are in San Francisco - Bay Area, building robots or want to collaborate on some robotics/self-driving car projects I would be happy to chat with you.
]]>Microsoft is very serious about 2 Billion first line workers and building solutions to tap into their workflows and processes.
Once engaged into Microsoft stack businesses can’t stop but move forward integrating more and more Microsoft technology in their workflows.
Microsoft has strong consultancy wing that builds custom enterprise solutions and helps grow the ecosystem of service providers who, in turn, integrate more and more Microsofts/Azure tech into enterprises.
Microsoft helps integrators and agencies only when they bring substantial value to Microsoft itself - big customer that uses Azure products, or great use case with enterprise customer that they can make case study or video about it, etc. (though it’s like every other company)
There wasn’t much mobile news anymore; it’s just a platform where users should have access to Microsoft Graph, Microsoft Teams, Office 365, Azure Cognitive Services and interact with new platforms like Mixed Reality.
I am (almost) complete stranger in Microsoft Ecosystem (and still is), but I see big enterprises are there, ecosystem is there, and Microsoft’s pace of innovation is on par and in some spaces even better than Amazon, Facebook, Google, Apple (consider AI stuff, HoloLens and Microsoft Graph).
As a conclusion, Amazon AWS and Microsoft Azure (+ everything else in MS stack) is the great place to be for any consultancy and service business if they want to work with big enterprises, but MS looks like a hard die for everyone who doesn’t have previous experience with MS products and switched long time ago to all Mac/Linux paradigm. However, MS is closing this gaps with an introduction of Linux shell support and event Notepad support Unix style line endings :)
Everyone is migrating to the cloud, from on-premise and privately owned data centers. It started occurring 3-5 years ago within enterprises and will be continuing further. More data are coming to companies more migration efforts needed in the future. (Just copying petabytes of data is not an easy endeavor).
More data leads to more requests in data analysis and AI infrastructure needs to retrieve and process data. And more opportunities for services based on Microsoft Graph and Spatial Analytics to provide value for customers.
Cortana will be on par with Google’s Duplex in 2-3 years but for enterprises. Make chores like meeting confirmations, reminders, gathering information according to the brief or soliciting requirements.
It will help with meetings and execution of mundane tasks in companies from travel arrangements to customer research. Outlook once killed assistants’ job and Cortana powered with Microsoft Graph will free time and boost productivity for accountants, lawyers, event managers and HR specialists.
Real-time is hard, and MS is continuing moving inference and data processing to the edge. It’s playing a card of smart factories and enterprises. I think this is a huge market where technology can add substantial value to real-world problem like making factory floors safer, more predictable and satisfying.
Containers and Kubernetes are everywhere in web dev, in AI/ML, IoT, etc. It’s like the new standard in software engineering.
MS strong in AI API’s, algorithms and they are not losing their pace. We can expect more intelligence in NLP and vision (especially for enterprise tasks - manufacturing, factory floor, drones, etc.)
~25 AI services are available and easy to train and use. And now they are even available as a function in Excel connected to the whole Azure infrastructure and DBs.
Microsoft’s research in AI and FPGA based hardware are strong. You can easily encounter on their papers ranging from NLP to 3D pose tracking with the state-of-the-art results and datasets.
Sensors + Software + Hardware + Cloud we can see this formula is unrolling in Project Kinect for Azure and Spatial Analytics Audio device that they showcased on expo floors.
Thousands of people are working on Mixed Reality in Microsoft. It’s a big bet for them and they are leading the innovation here with a) HoloLens and b) Project Kinect for Azure sensors.
It wasn’t directly confirmed but all signs that MS is working on HoloLens v2. We can expect more computational power, better tracking with new Kinect sensors hardware and integrated collaboration services like Microsoft Assist and Microsoft Teams. (speculation)
Mixed Reality is for collaboration and shared experiences.
While Amazon and Google are building connected homes, MS is building connected offices, factory floors, and warehouses.
Everything always was around people and collaboration. Hence avatars, multi-player, body/faces expression/eyes tracking, etc.
All services and media are about the collaboration of people with other people. In MR, in chats, in training and remote assistance, and so on and so on.
MS is pushing forward with the vision of smart connected floor that continuously tracks everything and provides back insights and analytics.
Spatial Analytics: Audio Array + Project Azure Kinect (the best in class depth sensor) + Smart Camera with on the edge processing. All this will be glued together with dedicated smart Azure services in 1-2 years.
Cortana + Spatial Analytics “feelings” and we have a factory supervisor like intelligent spaceship “Justice of Toren” from “Ancillary Justice”.
Vision is easy; execution is much-much harder. Switch gears and tap into execution and delivery of real systems and solutions that work for enterprises of the future.
]]>Mobile robots need to sense the world around and reason about it. Different sensors convey different modality on which we are building perception, control, localization, mapping and behavioral planning. Visual perception is the key to a robot perception system due to high information density that images contain. Though images were always difficult to process both algorithmically and computationally. With the advances in deep learning and GPU computing chips, we can easily build end-to-end systems for a specific task with the fraction of efforts that we needed a decade ago.
Gestures is a natural human interaction method that we can apply for mobile robot control. Different systems of hand-gestures mobile robots control was proposed: image pre-processing and feature of extraction state [1], wearable inertial-sensor-based system [3], system based on hidden markov models [4], or neural networks EOG gesture recognition [5].
Here I am exploring the end-to-end system for gesture recognition that was trained on a manually collected dataset and using an NVidia DIGITS for training.
Data was manually collected using a MacBook Pro built-in camera and a python script grab_images.py
. All script sources are available on my Github repo bexcite/fingers_classification.
python grab_images.py --size 28 --rate 10 --gray
It captures frames at 10 Hz rate
, resizes it to 28x28 (resize
), and saves in a grayscale gray
.
Different lighting conditions, backgrounds variations, angles of the gestures and spatial rotations were used to cover the broad spectrum of possible uses.
Additional data augmentation was used to flip images horizontally to additionally increase the dataset and variations. (source)
python augment_images.py --input_dir <data> --output_dir <aug_data> --resize 28
Below is an example of images collected before resizing.
Examples of processed images that were used during the training (scaled for the better view).
Fingers recognition for the mobile robot should be as fast as possible and don’t take computer time from control and localization pipelines. I’ve tested two ready-made networks in DIGITS, LeNet and GoogLeNet, with the fastest results and good accuracy achieved on a standard LeNet convolutional network with gray images of size 28x28 pixels.
Network converged after around 10 epochs in a matter of seconds.
It was trained using SGD with the starting learning rate 0.01 and a scheduled decrease.
Fingers dataset trained on LeNet network showed the good accuracy for all classes and inference speed 0.19ms
on a Tesla K80 GPU.
Below is the result output from TensorRT for inference speed evaluation:
Confusion matrix for a test dataset:
It was unexpected that LeNet and 28x28 gray images of 6 possible combinations of fingers gesture could produce acceptable results. Though if we look closer into 28x28 images of training dataset, it becomes clear that it’s possible to recognize the depicted gesture and amount of fingers in almost all images so convolutional layers were able to extract the meaningful features too.
Next I’m planning to load this network on Jetson TX2 and see how it performs in real-time on smaller hardware so eventually my robot could recognize hand gesture commands while navigating around an apartment.
Ali, Muaammar Hadi Kuzman, M. Asyraf Azman, and Zool Hilmi Ismail. “Real-time hand gestures system for mobile robots control.” Procedia Engineering 41 (2012): 798-804. link >
Manigandan, Mr, and I. Manju Jackin. “Wireless vision based mobile robot control using hand gesture recognition through perceptual color space.” Advances in Computer Engineering (ACE), 2010 International Conference on. IEEE, 2010. link >
Stančić, Ivo, Josip Musić, and Tamara Grujić. “Gesture recognition system for real-time mobile robot control based on inertial sensors and motion strings.” Engineering Applications of Artificial Intelligence 66 (2017): 33-48. link >
Iba, Soshi, et al. “An architecture for gesture-based control of mobile robots.” Intelligent Robots and Systems, 1999. IROS’99. Proceedings. 1999 IEEE/RSJ International Conference on. Vol. 2. IEEE, 1999. link >
SASAKI, Minoru, and Yanagido Gifu. “Mobile Robot Control by Neural Networks EOG Gesture Recognition.” link >
Sabour, Sara, Nicholas Frosst, and Geoffrey E. Hinton. “Dynamic routing between capsules.” Advances in Neural Information Processing Systems. 2017. link >
LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324. link >
Low Earth Orbit (LEO) starts around 160 km and up to 2,000 km. Below 160 km atmospheric drag is so strong that objects are quickly losing speed, so orbit is rapidly decaying.
To stay long enough on LEO without decaying objects needs to use some propulsion to boost their speed periodically. For example, ISS reboosting itself a couple of times per year.
Delta-v requirements to reach Low Earth Orbits starts approximately at 9.4 km/s.
]]>Smartwatch User Interaction (2015) — How users interact with a smartwatch and what UI components do they use
Pros and Cons of Using Embedded Android for a Non-Mobile Device (2015)
Why Your Startup Needs a Cross-functional Team to be Successful (2015)
Slack Native Advertising Bot (2015) — how ads might look like in our chats
3 Types of Software Architecture for Connected Devices (2015) — A Smart Light Bulb Case
What we’ve learned from 10 years of being in the software development business (2015)
We need more Virtual Reality content (2015) — It is the future of display technology.
]]>