Monocular Depth Estimation for Autonomous Driving

CVC has a new PhD on its record!

Akhil Gurram successfully defended his dissertation on Computer Science on March 10, 2022, and he is now Doctor of Philosophy by the Universitat Autònoma de Barcelona.

Download thesis

What is the thesis about?

3D geometric information is essential for on-board perception in autonomous driving and driver assistance. Autonomous vehicles (AVs) are equipped with calibrated sensor suites. As part of these suites, we can find LiDARs, which are expensive active sensors in charge of providing the 3D geometric information. Depending on the operational conditions for the AV, calibrated stereo rigs may be also sufficient for obtaining 3D geometric information, being these rigs less expensive and easier to install than LiDARs.

However, ensuring a proper maintenance and calibration of these types of sensors is not trivial. Accordingly, there is an increasing interest on performing monocular depthestimation (MDE) to obtain 3D geometric information on-board. MDE is very appealingsince it allows for appearance and depth being on direct pixelwise correspondence without further calibration. Moreover, a set of single cameras with MDE capabilities would still be a cheap solution for on-board perception, relatively easy to integrate andmaintain in an AV.

Best MDE models are based on Convolutional Neural Networks (CNNs) trainedin a supervised manner, i.e., assuming pixelwise ground truth (GT). Accordingly, theoverall goal of this PhD is to study methods for improving CNN-based MDE accuracyunder different training settings. More specifically, this PhD addresses different research questions that are described below. When we started to work in this PhD, state-of-theart methods for MDE were already based on CNNs. In fact, a promising line of work consisted in using image-based semantic supervision (i.e., pixel-level class labels) while training CNNs for MDE using LiDAR-based supervision (i.e., depth). It was common practice to assume that the same raw training data are complemented by both types of supervision, i.e., with depth and semantic labels. However, in practice, it was more common to find heterogeneous datasets with either only depth supervision or only semantic supervision. Therefore, our first work was to research if we could train CNNs for MDE by leveraging depth and semantic information from heterogeneous datasets.We show that this is indeed possible, and we surpassed the state-of-the-art results onMDE at the time we did this research. To achieve our results, we proposed a particularCNN architecture and a new training protocol.

After this research, it was clear that the upper-bound setting to train CNN-based MDE models consists in using LiDAR data as supervision. However, it would be cheaper and more scalable if we would be able to train such models from monocular sequences.Obviously, this is far more challenging, but worth to research. Training MDE models using monocular sequences is possible by relying on structure-from-motion (SfM) principles to generate self-supervision. Nevertheless, problems of camouflaged objects, visibility changes, static-camera intervals, textureless areas, and scale ambiguity, diminish the usefulness of such self-supervision. To alleviate these problems, we perform MDE by virtual-world supervision and real-world SfM self-supervision. We call our proposal MonoDEVSNet. We compensate the SfM self-supervision limitations by leveraging virtual-world images with accurate semantic and depth supervision, as well as addressing the virtual-to-real domain gap. MonoDEVSNet outperformed previous MDE CNNs trained on monocular and even stereo sequences. We have publicly released MonoDEVSNet at <https://github.com/HMRC-AEL/MonoDEVSNet>.

Finally, since MDE is performed to produce 3D information for being used in downstream tasks related to on-board perception. We also address the question of whether the standard metrics for MDE assessment are a good indicator for future MDE-based driving-related perception tasks. By using 3D object detection on point clouds as proxy of on-board perception, we conclude that, indeed, MDE evaluation metrics give rise to a ranking of methods which reflects relatively well the 3D object detection results we may expect.