On the Interplay of 2D and 3D Representations for Robotic Perception


Loading...

Author / Producer

Date

2025

Publication Type

Doctoral Thesis

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

Artificial intelligence (AI) is increasingly shaping our daily lives by enabling novel applications in the digital domain. One of the avenues with the greatest potential for societal impact is in bringing these technologies into the physical world, designing robots that can use AI to autonomously act in unconstrained environments and support humans in complex tasks. A crucial challenge for achieving this kind of general-purpose embodied intelligence concerns the way autonomous agents perceive the world. In order to carry out tasks involving reasoning and action, robots need to form powerful representations of the environment and of its objects, and to be able to perceive the world as three-dimensional. However, current robotic pipelines often rely on coarse 3D representations that mostly encode geometry and on modules for perception that are trained before deployment and process input data simply as 2D signals. As a consequence of this design, 3D representations often do not sufficiently capture semantic content required for complex tasks, while 2D perception modules inherently lack 3D awareness. This thesis studies the interplay between 2D modules and 3D representations for robotic perception. Our hypothesis is that a tighter integration between these two components can provide mutual benefit and improve the performance of autonomous agents in perception tasks. In the first part, we investigate techniques to enforce knowledge transfer between the 3D representation and the 2D perception module. We adopt neural fields learned through neural inverse rendering as our 3D representation, and propose a framework to train them jointly with a perception network, exploiting the differentiability of both modules. We apply this framework in the context of the continual adaptation of a semantic segmentation network as an agent visits multiple scenes in a sequence, and use the view consistency enforced by the 3D representation as a self-supervision signal for the 2D network. We show that our joint training approach effectively improves the performance of both components. We further propose to exploit the novel view synthesis ability of the 3D representation to generate data to provide as input to the network. We demonstrate that this generative approach not only allows counteracting forgetting of previous knowledge as the network is adapted to new environments, but also enables the model to continue learning on previous scenes. We adopt our framework based on joint training and differentiable rendering from novel views also in the context of learning object-level representations. We combine the neural 3D representation with a correspondence learning method and propose a framework to train a pose estimation network for novel objects using only real images as input. Our approach also demonstrates the effectiveness of neural fields as flexible object representations that can accurately model geometry and appearance and encode task-specific features. Our proposed pipeline allows easily and compactly incorporating new object representations, which we envision can form a long-term database that can support robots in higher-level tasks. The second part of this thesis follows a different vision for incorporating 3D knowledge into perception modules, focusing on structural biases for geometry that can be incorporated into 2D architectures. We center our study on the problem of surface normal integration, which deals with reconstructing a surface as a depth map, starting from a corresponding input surface normal map. We propose a novel mathematical formulation to model the relation between surface normals and depth, and incorporate it into a framework that allows explicitly modeling depth discontinuities and, for the first time, enables handling normal maps captured from non-ideal (central) cameras. We subsequently address the limited scalability of state-of-the-art normal integration methods and propose to recast this problem as the estimation of the relative scales of continuous surface components. Our approach reduces the complexity of existing methods by one order of magnitude and enables scalability to high-resolution input normal maps. In conclusion, this thesis investigates the design, use, and interplay of effective representations and data-driven modules for 3D-aware robotic perception. We conclude by presenting an outlook for future research directions and discussing potential developments in learning-based modules for 3D geometry.

Publication status

published

Editor

Contributors

Examiner : Siegwart, Roland
Examiner : Tombari, Federico
Examiner : Vedaldi, Andrea

Book title

Journal / series

Volume

Pages / Article No.

Publisher

ETH Zurich

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Robotics; Computer vision; Perception; 3D representation; inductive bias; Surface normal; Pose estimation; Neural fields

Organisational unit

03737 - Siegwart, Roland Y. / Siegwart, Roland Y.

Notes

Funding

101017008 - Enhancing Healthcare with Assistive Robotic Mobile Manipulation (EC)

Related publications and datasets