
Open access
Author
Date
2019Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Scene understanding is one of the fastest growing areas in computer vision research. Such growth is mainly driven by the emergence of deep learning techniques that contributed to boosting performance on popular benchmarks for well-studied tasks, and to approaching tasks that have been very difficult to solve with traditional techniques. This dissertation examines how traditional low-level features such as boundaries and points help in tackling higher-level scene understanding tasks such as detection, segmentation, and 3D reconstruction.
First, we propose a hierarchical grouping algorithm that uses deeply learned boundaries and their orientation. We examine how grouping from predicted boundaries can help object detection and semantic segmentation when plugged into the corresponding pipelines.
Second, we use human-generated points for guided object segmentation. We show how to obtain segmented masks by using extreme points provided by humans, and how to speed up the time-consuming process of annotating for segmentation by using this technique.
Third, we show how automatically detected keypoints help 3D re- construction in a complicated environment for robot-assisted retinal surgery. The task is to provide visual guidance during surgery by using two stereo cameras mounted on the surgical microscope. We propose a method for calibration, 3D registration, and 3D reconstruction from a single pipeline, by detecting specific robot keypoints, and by obtaining 3D to 2D correspondences just by moving the robot.
Last, we examine the interplay of low-level and high-level tasks when trained jointly in a single neural network. We propose ways to overcome problems such as task interference and limited capacity as a result of jointly training for many different, unrelated tasks. We propose a universal network that can tackle all tasks, but only one task at a time.
All in all, we show how to predict low-level features and how they contribute to different pipelines a) in combination with deep networks trained for scene understanding b) as human-generated input, c) in combination with 3D reconstruction, and d) by jointly training them with higher-level tasks. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000382817Publication status
publishedExternal links
Search print copy at ETH Library
Contributors
Examiner: Van Gool, Luc
Examiner: Tombari, Federico
Examiner: Kokkinos, Iasonas
Examiner: Pont-Tuset, Jordi
Publisher
ETH ZurichSubject
Scene understanding; Segmentation; Multi-task learning; Boundary detection; 3D reconstructionOrganisational unit
03514 - Van Gool, Luc / Van Gool, Luc
More
Show all metadata
ETH Bibliography
yes
Altmetrics