Structured Articulated Human Pose Estimation with Neural Networks

Song, Jie

doi:10.3929/ethz-b-000478108

Download

Full text (PDF, 30.07Mb)

Open access

Author

Song, Jie

Date

2020

Type

Doctoral Thesis

ETH Bibliography

yes

Altmetrics

Download

Full text (PDF, 30.07Mb)

Rights / license

In Copyright - Non-Commercial Use Permitted

Abstract

Automatically understanding the body pose from camera inputs promotes many real-life applications such as human activity recognition, autonomous driving, assistant robotics and sport analysis. Hence, this highly demanding task has attracted great interest from the computer vision community for decades and particularly, it has seen extraordinary progress over the recent years. The success can be credited to two main factors: effective appearance modeling by deep neural networks and the accessibility of large-scale annotated datasets. However, the current systems are not flawless that still many challenging issues are left to be alleviated especially when people are in complex articulations or several instances stay close, occluding each other. We argue that incorporating prior knowledge like the inherent structure of our body into the network design is equally essential. To this end, in this thesis, we study how to design efficient algorithms to jointly optimize the parameters of deep feature extractors and also the probabilistic inference models which encode priors. First, we address the problem of single person 2D pose estimation. We present a deep structured model to explicitly incorporate skeletal priors into network design to regularize predictions and to enforce temporal consistency. The inference is conducted by an embedded layer performing message passing along the loopy spatio-temporal graph edges. The entire architecture is able to be trained in an end-to-end manner. Second, we study the challenging case with an unknown number of people depicted in the image. We explore to connect deep networks with graph decomposition into a jointly trainable framework, by introducing an unconstrained binary cubic formulation. The cycle consistency constraints are directly formulated in the objective function. This new optimization problem can be viewed as a Conditional Random Field where the cycle constraints are represented as high-order potentials. The parameters for the CRF inference are optimized together with the front-end feature extractor. The final part of the thesis concerns the estimation of 3D human pose. Combining the refinement capabilities of iterative gradient-based optimization techniques with the robustness of neural networks, we propose a method to lift 2D pose to 3D, in which the statistical 3D human shape model has been leveraged to regularize the lifting. Show more

Permanent link

https://doi.org/10.3929/ethz-b-000478108

Publication status

published

External links

Search print copy at ETH Library

Contributors

Examiner: Hilliges, Otmar
Examiner: Schiele, Bernt
Examiner: Van Gool, Luc

Publisher

ETH Zurich

Organisational unit

03979 - Hilliges, Otmar / Hilliges, Otmar