Improving mapping and localization using deep learning

Improving mapping and localization using deep learning

Mohammad Altillawi successfully defended his dissertation on Computer Science on October 11, 202, and he is now Doctor of Philosophy by the Universitat Autònoma de Barcelona.

What is the thesis about?

Localizing a camera from a single image in a previously visited area enables applications in many fields, such as robotics, autonomous driving, and augmented/virtual reality.

With the recent advancements in computational resources and the rapid emergence of accompanying open-source machine learning frameworks, deep learning has swiftly gained traction within both research communities and industries. As a result, recent works opted for data-driven solutions for the problems at hand. In this dissertation, we propose solutions to utilize deep learning to improve camera re-localization from a single image. After the introductory chapter, we define and address the limitations of current approaches and provide data-driven solutions in the following chapters.

In chapter two, we propose a method that learns a geometric prior to inform about reliable regions in a given scene. It selects reliable pixel features from the input single image and estimates their corresponding 3D Coordinates directly for pose estimation. In essence, the proposed method does not consider the whole image as useful for localization. It avoids selecting keypoints (thus the corresponding 3D points) from non-discriminative image regions such as the sky and trees, dynamic objects such as cars and pedestrians, and occlusions. By bypassing these outlier sources, the proposed method selects a controllable number of correspondences, enhancing localization efficiency and accuracy.

In chapter three, we propose to leverage a novel network of spatial and temporal geometric constraints in the form of relative camera poses that are obtained from adjacent and distant cameras to guide the training of a deep network for localization. In this context, the deep network acts as the map of the scene. By employing our constraints, we guide the network to encode these geometric constraints in its weights. By encompassing these encoded constraints, the localization pipeline obtains better accuracy at inference.

In chapter four, we propose a novel pipeline to utilize the minimal available labels (i.e., poses) for a data-driven pose estimation pipeline to obtain the geometry of the scene. The proposed framework uses a differentiable rigid alignment module to pass gradients into the deep network to adjust the learned geometry. It learns two 3D geometric representations (X, Y, Z coordinates) of the scene, one in camera coordinate frame and the other in global coordinate frame. Given a single image at inference, the proposed pipeline triggers the deep network to obtain the two geometric representations of the observed scene and aligns them to estimate the pose of the camera. As a result, the proposed method learns and incorporates geometric constraints for a more accurate pose estimation.

In chapter five, we propose to explore the power of generative models for data generation for data-driven localization. We thus contribute with a novel data-centric method to generate correspondences across different viewing and illumination conditions to enhance the robustness of localization towards long-term changes (daytime, weather, season). The proposed method represents the scene with a number of implicit representations (based on NeRFs), each corresponds to different illumination condition. Consequently, it utilizes the underlying geometry based on these representations to generate accurate correspondences across the different illumination variations. Using these correspondences enhances localization across long-term changes. Besides, we built an evaluation benchmark to assess and evaluate the performances of feature extraction and description networks towards localization across long-term illumination changes. Our work serves as a substantial stride toward robust long-term localization.

In chapter six, we propose a point-based generative model to synthesize novel views. The proposed work points out and solves a mismatch issue between geometry (point cloud) and appearance (images), which generates degraded renderings. The proposed method employs a connectivity graph between appearance and geometry. In contrast to using the whole point cloud of a scene, it retrieves points from a large point cloud that are observed from the current camera perspective and uses them for rendering. This makes the rendering pipeline faster and more scalable. We emphasize as well the power of this connectivity graph to the recent 3D Gaussian splatting scene representation. Our proposal employs image reconstruction with generative adversarial training for enhanced rendering quality. Such a pipeline can be used to generate novel views to augment/create more labeled data for pose estimation.