Open access
Author
Date
2023Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
In the past decade, computer vision has progressed by leaps and bounds. Deep learning based methods have crushed benchmark after benchmark in a paradigm shift that has converted precision engineered, hand-crafted approaches into neural networks that simply learn from millions and millions of input-output examples. As each neural network is task and application specific, this means that to tackle a new task, the main problem has become how to create the datasets that will be able to teach a neural network to solve the task. Often, these datasets are built up by hand by annotating examples one by one through computer user interfaces, often outsourcing the work to low income countries. This creates additional challenges as the workers might not be domain experts on the data being annotated and they, in turn, have to be taught.
This severely limits the tasks and domains where we can deploy robots with advanced perception skills, as creating these datasets is expensive. The costs are only bearable for applications which are very general and have massive markets. Creating robots for niche industrial use cases or simply adapting existing robots to new domains, is infeasible. The costs of retraining and annotating data also often have to be borne when significant changes are made to the hardware of the robot, as the data distribution has changed.
In this thesis, we develop techniques to tackle this data problem for robot perception tasks. We approach it from multiple different directions, both by making better use of unlabeled data and by constructing ways in which we can better make use of the human teacher's time.
In the first part of this thesis, we develop a method by which we can quickly build up 3D object keypoint datasets to teach robots about semantic points on objects that are relevant for custom tasks. We design a pipeline to make use of proprioceptive sensing built into the robot and 3D geometry to propagate examples from one annotated frame to the next. We then use these examples to bootstrap a keypoint detection system, which can be deployed in minutes instead of days.
In the second part, we leverage neural implicit representations to extract dense segmentation masks from sparse user input, and use the representation to synthesize novel examples of the scene, to better teach a downstream object detection system.
In the third part of this thesis, we design an interactive 3D volumetric scene annotation system, which is better able to make use of the expert user's time. We do this by leveraging self-supervised learning, techniques designed to learn from unlabeled data, to augment the data collected by the robot, and thus raising the level of abstraction and ending up with a smarter system.
In the last part of the thesis, we attempt to use information learned from large-scale internet image-caption datasets, and grounding them in real world 3D scenes, as a way to learn without any direct human supervision at all.
Finally, we sketch out a path forward for developing robust, continuously improving perception systems for robotic applications. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000650308Publication status
publishedExternal links
Search print copy at ETH Library
Publisher
ETH ZurichSubject
Robotics; Legged Locomotion; Quadrupedal Robots; Motion Planning and Control; Trajectory Optimization; Deep Reinforcement Learning; Self supervision; Computer Vision (CV)Organisational unit
03737 - Siegwart, Roland Y. / Siegwart, Roland Y.
More
Show all metadata
ETH Bibliography
yes
Altmetrics