Semi- and Weakly-Supervised Methods for 3D Hand Pose Estimation from Monocular RGB
OPEN ACCESS
Loading...
Author / Producer
Date
2023
Publication Type
Doctoral Thesis
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
Hands are vital in day-to-day tasks such as interacting with the physical world and communication. Endowing computer systems with the capability to understand the actions of hands enables new applications in activity recognition and augmented/virtual reality. Due to this potential, it has garnered longstanding interest from the research community.
In more recent years, there has been a focus on 3D hand pose estimation from monocular RGB. Its success can be attributed to two main factors: The development and design of powerful neural network architectures and access to large-scale datasets. More specifically, the field of hand pose estimation has benefited greatly from accurate model designs such as ResNet or HRNet. This has lead to impressive results where the field has achieved accurate hand pose and mesh prediction, hand and object interaction modeling, or inter-hand interaction prediction. To achieve such performances, these networks require large-scale datasets. More so if this performance is desired in a wide range of settings, such as different skin colors or changing illumination levels. Yet acquiring fully labeled datasets on such a large scale that encompasses this diverse number of settings is costly and challenging. Expensive recording equipment and recording session with numerous participants are required. In this thesis, we argue that to achieve true large scale dataset training we need to expand our sources of training data. To this end, we propose methods that are capable of leveraging weakly or unlabeled datasets. Such data are typically not sourced from recording setups but instead recorded from in-the-wild settings, such as videos or imagery acquired from the internet. As such, these do not contain the full annotation with respect to the current task. Instead, these can either have noisy or weak labels, or even contain no labels at all.
In the first part of this thesis, we focus on unlabeled data. We start by introduce a cross-modal latent space by deriving an objective function from the variational lower bound of the Variational Autoencoder framework. The resulting training regime naturally admits to embedding multiple modalities into a single latent space representing hand poses. This results in a straightforward manner of performing semi-supervision via reconstruction. The arising latent space can be directly used to predict 3D hand configurations. We thoroughly evaluate our method and demonstrate that semi-supervision improves hand pose estimation performance.
Next, we move on to the temporal domain to leverage unlabeled videos. Inspired by the motion modeling literature, we propose a motion model for hands and demonstrate that it leads to improved performance of the hand pose estimator via semi-supervised training. The motion model reasons on both a structural as well as a temporal level. Extensive evaluation showcases essential components required for our framework. We demonstrate its performance in two challenging settings, leading to great improvements of the mean joint error.
Furthermore, we propose to make use of ideas in the contrastive learning literature. We introduce an equivariant contrastive objective which is suited for the task of 3D hand pose estimation. Experiments demonstrate the impact of our propose objective and show that a standard ResNet trained on additional unlabeled data achieving state-of-the-art performance without any task-specific specialized architecture.
In the second part of the thesis, we investigate weakly-supervised data annotated with only 2D positions. More specifically, we look at the effect 2D supervision has on keypoint-based pose estimation method. We show that straightforward use of 2D annotated data benefits the accuracy of the 2D location of the keypoint, but does little to alleviate its depth estimation performance. This results in poses predicted that are biomechanically implausible. To address this, we propose a novel set of losses that constrain the prediction of a network to lie within a biomechanically feasible range. Extensive experiments show that our proposed constraints reduce the depth ambiguity by a significant margin, showing that it allows the network to leverage the additional 2D annotated images more effectively.
Permanent link
Publication status
published
External links
Editor
Contributors
Examiner: Hilliges, Otmar
Examiner : Gross, Markus
Examiner : Theobalt, Christian
Examiner : Molchanov, Pavlo
Book title
Journal / series
Volume
Pages / Article No.
Publisher
ETH Zurich
Event
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
3D hand pose estimation; Computer Vision and Pattern Recognition (cs.CV)
Organisational unit
03979 - Hilliges, Otmar (ehemalig) / Hilliges, Otmar (former)