Towards Handling 3D Shape,Terrain Elevation, and Visual Relocalization with Implicit Neural Representation
Shun Yao successfully defended his dissertation on Computer Science on November 7, 2024, and he is now Doctor of Philosophy by the Universitat Autònoma de Barcelona.
What is the thesis about?
Our real world is located in the physical space field, however, humans need to quantify physical properties in computer vision applications. For example, we represent the visual information as RGB intensity, terrain as elevation values, stereo shapes as surfaces, entities as occupied volumes, \emph{etc}. With advances in machine learning technology, implicit neural representation (INR) models, which parameterize these physical properties using coordinate-based mapping functions, offer promising solutions that are more accurate, higher fidelity, more expressive, faster to implement, and more memory-efficient. This dissertation focuses on developing INR models for 3D shape representation, terrain elevation representation, multi-scale DEM super-resolution, and multi-scene visual relocalization.
In the case of 3D shape representation, we explore the use of hierarchical and topological structures to learn latent representations of 3D geometric data. Noting the limitations of existing graph convolution networks in resolution and structural complexity, we introduce INRs to improve representation granularity and flexibility. Instead of using explicit formats like points, lines, and surfaces, the INR aims to regress the signed distance from any arbitrary 3D point to the shape's surface. The 3D shape is then represented as an iso-surface extracted from the predicted signed distances. However, directly using a single neural network to approximate the entire 3D shape would result in long training times and require numerous network parameters. To improve learning efficiency and reduce training parameters, we propose an INR model that uses multiple latent codes to learn local geometries rather than the entire 3D shape. Additionally, we introduce an auxiliary graph convolution network to transmit these latent codes to specific shape parts and propose a novel geometric loss function to facilitate mutual learning among the latent codes.
For terrain elevation representation, we study the representation precision problem caused by the discretization of existing digital elevation models. Furthermore, different applications require specific discrete representations, necessitating format conversions. However, these conversions inevitably compromise the fidelity of elevation data. To solve the above problems that lead to inaccurate representation of elevation data, we develop a new continuous representation model (CDEM), an INR model that allows height values to be obtained at any arbitrary query position, aiming to preserve the continuity of topographic elevation data in the real world.
Next, we train an encoder-decoder network to learn CDEM from discrete elevation data for multi-scale DEM super-resolution tasks. To improve model accuracy, we propose predicting the bias of elevation values between the query position and its closest known position. To facilitate the model’s ability to predict high-frequency variations, we introduce positional encoding to map query positions into a higher-dimensional space. Our experiments demonstrate that our model can achieve more accurate elevation values and preserve more detailed terrain structures than other methods.
For visual relocalization across multiple scenes, we focus on efficient learning without using prepared scene geometry information and time-consuming pre-built scenario representations. Recently, scene coordinate (SC) regression-based models have demonstrated that accurate visual relocalization can be efficiently achieved using only posed images and camera intrinsic parameters. However, extending SC regression models to multiple scenes typically requires retraining model parameters or using pre-built reference landmarks, both of which are time-consuming. To enhance efficiency and avoid this process, we propose representing multiple scenes within a global reference coordinate and training an SC regression model (\ie, an INR model) using posed images from all scenes simultaneously. To reduce the impact of visual ambiguities, we introduce scene embedding as a prior condition for our model predictions. To enhance our model’s generalizability across multiple scenes, we propose the scene-conditional regression-adjust (SCRA) module, which dynamically generates parameters to adapt flexibly to the scene embedding. Additionally, we introduce modulation and complement modules to enhance the model’s applicability at both the image sample and scene levels.
Keynotes
Implicit neural representation, 3D shape representation, continuous DEM model, DEM super-resolution, visual relocalization, scene coordinate prediction, conditional adaption, deep learnin.