Environment-aware 3D Human Motion Capture in Challenging Scenarios
OPEN ACCESS
Loading...
Author / Producer
Date
2024
Publication Type
Doctoral Thesis
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
In the era of autonomy, the creation of a 3D digital world that faithfully replicates our physical reality becomes increasingly critical. Central to this endeavor is the incorporation of realistic human behaviors, which requires a deep understanding of human actions and movements. Moreover, human behaviors are intricately rooted in environments - our movements are influenced by our interactions with various objects and the spatial arrangement of our surroundings. Therefore, it is essential not only to model human motion itself but also to model how humans interact with the surrounding environment. Understanding human motions within diverse environments has significant applications across numerous fields, including augmented reality (AR), virtual reality (VR), assistive robotics, healthcare, biomechanics, filmmaking, and the gaming industry.
The first critical step towards understanding human motion is to capture it accurately. In this thesis, we aim to develop robust methods for human pose and motion reconstruction using affordable monocular RGB(-D) cameras. Existing methods struggle with various challenges, including limited data, neglect of 3D scene context, domain gaps between the third-person view and egocentric perspectives, and pose ambiguity caused by noisy and partial observations, often resulting in noticeable artifacts in reconstruction, such as implausible human-scene interactions and unnatural motions.
To reduce the performance gap between the monocular setup and the high quality marker-based motion capture setup, we propose to learn motion priors from existing large-scale motion capture datasets to model the intrinsic properties of human motions. We first introduce LEMO, an optimization framework that utilizes data-driven motion priors and physics-based constraints to reconstruct smooth, natural motions in 3D environments.
Next, by extending LEMO to a sparse multi-view RGB-D setup, we present EgoBody, a novel egocentric human motion dataset. Existing human motion datasets primarily focus on third-person views and lack comprehensive egocentric data. Consequently, methods trained on these datasets often struggle to generalize to the egocentric view due to its unique challenges, such as motion blur, body truncations, people entering/exiting the field of view. Capturing human movements during social interactions in complex 3D environments from the egocentric view, along with other rich multi-modality data streams, the EgoBody dataset fills this gap by providing essential data. We thoroughly evaluate existing methods, examine their limitations on egocentric data, and address these shortcomings with our high-quality annotations.
Furthermore, we propose addressing the pose ambiguity caused by occlusions, truncations, and missing observations in real-life images through probabilistic modeling. This ambiguity arises from the limited field of view of monocular cameras and persists as a challenge in both third-person and egocentric views. By framing the task as a conditional generation problem based on the available observations, models can learn the expressive human pose and motion distributions. Additionally, we incorporate 3D scene context and temporal information to further mitigate pose ambiguity. Based on data from EgoBody, we develop a scene-conditioned probabilistic method, EgoHMR, to recover the human mesh from egocentric images. Leveraging the 3D scene constraint through both classifier-free and classifier-based ways, the pose ambiguity caused by severe body truncations - one of the main factors that causes the domain gap between third-person view and the egocentric view - is largely resolved. To further enhance robustness for motion reconstruction, we propose RoHM, a diffusion-based approach, to learn the human motion distribution over time. We decouple the problem of recovering global and local motions by learning two separate models, conditioned on available image evidence, and introduce a novel conditioning module to bridge the global-local correlations. Compared with LEMO, RoHM demonstrates more robust performance, and enables more realistic and accurate motion reconstruction from noisy and incomplete data.
Through these contributions, we significantly advance the state-of-the-art in monocular human motion reconstruction, enhancing the realism of captured motions and broadening the applicability of this technology in diverse real-world scenarios. We aim for this work to advance 3D human behavior understanding and open up new possibilities for future applications.
Permanent link
Publication status
published
External links
Editor
Contributors
Book title
Journal / series
Volume
Pages / Article No.
Publisher
ETH Zurich
Event
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Computer Vision; 3D Computer Vision; Digital Humans; Human motion modeling; Human scene interaction
Organisational unit
09686 - Tang, Siyu / Tang, Siyu