Environment-aware 3D Human Motion Capture in Challenging Scenarios


Loading...

Author / Producer

Date

2024

Publication Type

Doctoral Thesis

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

In the era of autonomy, the creation of a 3D digital world that faithfully replicates our physical reality becomes increasingly critical. Central to this endeavor is the incorporation of realistic human behaviors, which requires a deep understanding of human actions and movements. Moreover, human behaviors are intricately rooted in environments - our movements are influenced by our interactions with various objects and the spatial arrangement of our surroundings. Therefore, it is essential not only to model human motion itself but also to model how humans interact with the surrounding environment. Understanding human motions within diverse environments has significant applications across numerous fields, including augmented reality (AR), virtual reality (VR), assistive robotics, healthcare, biomechanics, filmmaking, and the gaming industry. The first critical step towards understanding human motion is to capture it accurately. In this thesis, we aim to develop robust methods for human pose and motion reconstruction using affordable monocular RGB(-D) cameras. Existing methods struggle with various challenges, including limited data, neglect of 3D scene context, domain gaps between the third-person view and egocentric perspectives, and pose ambiguity caused by noisy and partial observations, often resulting in noticeable artifacts in reconstruction, such as implausible human-scene interactions and unnatural motions. To reduce the performance gap between the monocular setup and the high quality marker-based motion capture setup, we propose to learn motion priors from existing large-scale motion capture datasets to model the intrinsic properties of human motions. We first introduce LEMO, an optimization framework that utilizes data-driven motion priors and physics-based constraints to reconstruct smooth, natural motions in 3D environments. Next, by extending LEMO to a sparse multi-view RGB-D setup, we present EgoBody, a novel egocentric human motion dataset. Existing human motion datasets primarily focus on third-person views and lack comprehensive egocentric data. Consequently, methods trained on these datasets often struggle to generalize to the egocentric view due to its unique challenges, such as motion blur, body truncations, people entering/exiting the field of view. Capturing human movements during social interactions in complex 3D environments from the egocentric view, along with other rich multi-modality data streams, the EgoBody dataset fills this gap by providing essential data. We thoroughly evaluate existing methods, examine their limitations on egocentric data, and address these shortcomings with our high-quality annotations. Furthermore, we propose addressing the pose ambiguity caused by occlusions, truncations, and missing observations in real-life images through probabilistic modeling. This ambiguity arises from the limited field of view of monocular cameras and persists as a challenge in both third-person and egocentric views. By framing the task as a conditional generation problem based on the available observations, models can learn the expressive human pose and motion distributions. Additionally, we incorporate 3D scene context and temporal information to further mitigate pose ambiguity. Based on data from EgoBody, we develop a scene-conditioned probabilistic method, EgoHMR, to recover the human mesh from egocentric images. Leveraging the 3D scene constraint through both classifier-free and classifier-based ways, the pose ambiguity caused by severe body truncations - one of the main factors that causes the domain gap between third-person view and the egocentric view - is largely resolved. To further enhance robustness for motion reconstruction, we propose RoHM, a diffusion-based approach, to learn the human motion distribution over time. We decouple the problem of recovering global and local motions by learning two separate models, conditioned on available image evidence, and introduce a novel conditioning module to bridge the global-local correlations. Compared with LEMO, RoHM demonstrates more robust performance, and enables more realistic and accurate motion reconstruction from noisy and incomplete data. Through these contributions, we significantly advance the state-of-the-art in monocular human motion reconstruction, enhancing the realism of captured motions and broadening the applicability of this technology in diverse real-world scenarios. We aim for this work to advance 3D human behavior understanding and open up new possibilities for future applications.

Publication status

published

Editor

Contributors

Examiner: Tang, Siyu
Examiner : Pollefeys, Marc
Examiner : Pons-Moll, Gerard
Examiner : Wu, Jiajun

Book title

Journal / series

Volume

Pages / Article No.

Publisher

ETH Zurich

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Computer Vision; 3D Computer Vision; Digital Humans; Human motion modeling; Human scene interaction

Organisational unit

09686 - Tang, Siyu / Tang, Siyu check_circle

Notes

Funding

Related publications and datasets