
Open access
Author
Date
2020Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Automatically understanding the body pose from camera inputs promotes many real-life applications such as human activity recognition, autonomous driving, assistant robotics and sport analysis. Hence, this highly demanding task has attracted great interest from the computer vision community for decades and particularly, it has seen extraordinary progress over the recent years. The success can be credited to two main factors: effective appearance modeling by deep neural networks and the accessibility of large-scale annotated datasets. However, the current systems are not flawless that still many challenging issues are left to be alleviated especially when people are in complex articulations or several instances stay close, occluding each other. We argue that incorporating prior knowledge like the inherent structure of our body into the network design is equally essential. To this end, in this thesis, we study how to design efficient algorithms to jointly optimize the parameters of deep feature extractors and also the probabilistic inference models which encode priors. First, we address the problem of single person 2D pose estimation. We present a deep structured model to explicitly incorporate skeletal priors into network design to regularize predictions and to enforce temporal consistency. The inference is conducted by an embedded layer performing message passing along the loopy spatio-temporal graph edges. The entire architecture is able to be trained in an end-to-end manner.
Second, we study the challenging case with an unknown number of people depicted in the image. We explore to connect deep networks with graph decomposition into a jointly trainable framework, by introducing an unconstrained binary cubic formulation. The cycle consistency constraints are directly formulated in the objective function. This new optimization problem can be viewed as a Conditional Random Field where the cycle constraints are represented as high-order potentials. The parameters for the CRF inference are optimized together with the front-end feature extractor.
The final part of the thesis concerns the estimation of 3D human pose.
Combining the refinement capabilities of iterative gradient-based optimization techniques with the robustness of neural networks, we propose a method to lift 2D pose to 3D, in which the statistical 3D human shape model has been leveraged to regularize the lifting. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000478108Publication status
publishedExternal links
Search print copy at ETH Library
Publisher
ETH ZurichOrganisational unit
03979 - Hilliges, Otmar / Hilliges, Otmar
More
Show all metadata
ETH Bibliography
yes
Altmetrics