A Farewell to Supervision: Towards Self-supervised Autonomous Systems


Loading...

Author / Producer

Date

2023

Publication Type

Doctoral Thesis

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

In the past decade, computer vision has progressed by leaps and bounds. Deep learning based methods have crushed benchmark after benchmark in a paradigm shift that has converted precision engineered, hand-crafted approaches into neural networks that simply learn from millions and millions of input-output examples. As each neural network is task and application specific, this means that to tackle a new task, the main problem has become how to create the datasets that will be able to teach a neural network to solve the task. Often, these datasets are built up by hand by annotating examples one by one through computer user interfaces, often outsourcing the work to low income countries. This creates additional challenges as the workers might not be domain experts on the data being annotated and they, in turn, have to be taught. This severely limits the tasks and domains where we can deploy robots with advanced perception skills, as creating these datasets is expensive. The costs are only bearable for applications which are very general and have massive markets. Creating robots for niche industrial use cases or simply adapting existing robots to new domains, is infeasible. The costs of retraining and annotating data also often have to be borne when significant changes are made to the hardware of the robot, as the data distribution has changed. In this thesis, we develop techniques to tackle this data problem for robot perception tasks. We approach it from multiple different directions, both by making better use of unlabeled data and by constructing ways in which we can better make use of the human teacher's time. In the first part of this thesis, we develop a method by which we can quickly build up 3D object keypoint datasets to teach robots about semantic points on objects that are relevant for custom tasks. We design a pipeline to make use of proprioceptive sensing built into the robot and 3D geometry to propagate examples from one annotated frame to the next. We then use these examples to bootstrap a keypoint detection system, which can be deployed in minutes instead of days. In the second part, we leverage neural implicit representations to extract dense segmentation masks from sparse user input, and use the representation to synthesize novel examples of the scene, to better teach a downstream object detection system. In the third part of this thesis, we design an interactive 3D volumetric scene annotation system, which is better able to make use of the expert user's time. We do this by leveraging self-supervised learning, techniques designed to learn from unlabeled data, to augment the data collected by the robot, and thus raising the level of abstraction and ending up with a smarter system. In the last part of the thesis, we attempt to use information learned from large-scale internet image-caption datasets, and grounding them in real world 3D scenes, as a way to learn without any direct human supervision at all. Finally, we sketch out a path forward for developing robust, continuously improving perception systems for robotic applications.

Publication status

published

Editor

Contributors

Examiner : Siegwart, Roland
Examiner : Davison, Andrew
Examiner: Chung, Jen Jen

Book title

Journal / series

Volume

Pages / Article No.

Publisher

ETH Zurich

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Robotics; Legged Locomotion; Quadrupedal Robots; Motion Planning and Control; Trajectory Optimization; Deep Reinforcement Learning; Self supervision; Computer Vision (CV)

Organisational unit

03737 - Siegwart, Roland Y. / Siegwart, Roland Y. check_circle

Notes

Funding

Related publications and datasets