Otmar Hilliges
Loading...
136 results
Search Results
Publications 1 - 10 of 136
- Leveraging Driver Field-of-View for Multimodal Ego-Trajectory PredictionItem type: Conference Paper
13th International Conference on Learning Representations (ICLR 2025)Akbiyik, M. Eren; Savov, Nedko; Pani Paudel, Danda; et al. (2025)Understanding drivers’ decision-making is crucial for road safety. Although predicting the ego-vehicle’s path is valuable for driver-assistance systems, existing methods mainly focus on external factors like other vehicles’ motions, often neglecting the driver’s attention and intent. To address this gap, we infer the ego-trajectory by integrating the driver’s gaze and the surrounding scene. We introduce RouteFormer, a novel multimodal ego-trajectory prediction network combining GPS data, environmental context, and the driver's field-of-view—comprising first-person video and gaze fixations. We also present the Path Complexity Index (PCI), a new metric for trajectory complexity that enables a more nuanced evaluation of challenging scenarios. To tackle data scarcity and enhance diversity, we introduce GEM, a comprehensive dataset of urban driving scenarios enriched with synchronized driver field-of-view and gaze data. Extensive evaluations on GEM and DR(eye)VE demonstrate that RouteFormer significantly outperforms state-of-the-art methods, achieving notable improvements in prediction accuracy across diverse conditions. Ablation studies reveal that incorporating driver field-of-view data yields significantly better average displacement error, especially in challenging scenarios with high PCI scores, underscoring the importance of modeling driver attention. All data and code are available at meakbiyik.github.io/routeformer. - Deformation capture via soft and stretchable sensor arraysItem type: Journal Article
ACM Transactions on GraphicsGlauser, Oliver; Panozzo, Daniele; Hilliges, Otmar; et al. (2019)We propose a hardware and software pipeline to fabricate flexible wearable sensors and use them to capture deformations without line-of-sight. Our first contribution is a low-cost fabrication pipeline to embed multiple aligned conductive layers with complex geometries into silicone compounds. Overlapping conductive areas from separate layers form local capacitors that measure dense area changes. Contrary to existing fabrication methods, the proposed technique only requires hardware that is readily available in modern fablabs. While area measurements alone are not enough to reconstruct the full 3D deformation of a surface, they become sufficient when paired with a data-driven prior. A novel semi-automatic tracking algorithm, based on an elastic surface geometry deformation, allows us to capture ground-truth data with an optical mocap system, even under heavy occlusions or partially unobservable markers. The resulting dataset is used to train a regressor based on deep neural networks, directly mapping the area readings to global positions of surface vertices. We demonstrate the flexibility and accuracy of the proposed hardware and software in a series of controlled experiments and design a prototype of wearable wrist, elbow, and biceps sensors, which do not require line-of-sight and can be worn below regular clothing. - SAGA: Stochastic Whole-Body Grasping with ContactItem type: Conference Paper
Lecture Notes in Computer Science ~ Computer Vision – ECCV 2022Wu, Yan; Wang, Jiahao; Zhang, Yan; et al. (2022)The synthesis of human grasping has numerous applications including AR/VR, video games and robotics. While methods have been proposed to generate realistic hand-object interaction for object grasping and manipulation, these typically only consider interacting hand alone. Our goal is to synthesize whole-body grasping motions. Starting from an arbitrary initial pose, we aim to generate diverse and natural whole-body human motions to approach and grasp a target object in 3D space. This task is challenging as it requires modeling both whole-body dynamics and dexterous finger movements. To this end, we propose SAGA (StochAstic whole-body Grasping with contAct), a framework which consists of two key components: (a) Static whole-body grasping pose generation. Specifically, we propose a multi-task generative model, to jointly learn static whole-body grasping poses and human-object contacts. (b) Grasping motion infilling. Given an initial pose and the generated whole-body grasping pose as the start and end of the motion respectively, we design a novel contact-aware generative motion infilling module to generate a diverse set of grasp-oriented motions. We demonstrate the effectiveness of our method, which is a novel generative framework to synthesize realistic and expressive whole-body motions that approach and grasp randomly placed unseen objects. Code and models are available at https://jiahaoplus.github.io/SAGA/saga.html. - HARP: Personalized Hand Reconstruction from a Monocular RGB VideoItem type: Conference Paper
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Karunratanakul, Korrawe; Prokudin, Sergey; Hilliges, Otmar; et al. (2023)We present HARP (HAnd Reconstruction and Personalization), a personalized hand avatar creation approach that takes a short monocular RGB video of a human hand as input and reconstructs a faithful hand avatar exhibiting a high-fidelity appearance and geometry. In contrast to the major trend of neural implicit representations, HARP models a hand with a mesh-based parametric hand model, a vertex displacement map, a normal map, and an albedo without any neural components. The explicit nature of our representation enables a truly scalable, robust, and efficient approach to hand avatar creation as validated by our experiments. HARP is optimized via gradient descent from a short sequence captured by a hand-held mobile phone and can be directly used in AR/VR applications with real-time rendering capability. To enable this, we carefully design and implement a shadow-aware differentiable rendering scheme that is robust to high degree articulations and self-shadowing regularly present in hand motions, as well as challenging lighting conditions. It also generalizes to unseen poses and novel viewpoints, producing photo-realistic renderings of hand animations. Furthermore, the learned HARP representation can be used for improving 3D hand pose estimation quality in challenging viewpoints. The key advantages of HARP are validated by the in-depth analyses on appearance reconstruction, novel view and novel pose synthesis, and 3D hand pose refinement. It is an AR/VR-ready personalized hand representation that shows superior fidelity and scalability. - HOOD: Hierarchical Graphs for Generalized Modelling of Clothing DynamicsItem type: Conference Paper
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Grigorev, Artur; Black, Michael J.; Hilliges, Otmar (2023)We propose a method that leverages graph neural networks, multi-level message passing, and unsupervised training to enable efficient prediction of realistic clothing dynamics. Whereas existing methods based on linear blend skinning must be trained for specific garments, our method, called HOOD, is agnostic to body shape and applies to tight-fitting garments as well as loose, free-flowing clothing. Furthermore, HOOD handles changes in topology (e.g., garments with buttons or zippers) and material properties at inference time. As one key contribution, we propose a hierarchical message-passing scheme that efficiently propagates stiff stretching modes while preserving local detail. We empirically show that HOOD outperforms strong baselines quantitatively and that its results are perceived as more realistic than state-of-the-art methods. - AvatarPose: Avatar-guided 3D Pose Estimation of Close Human Interaction from Sparse Multi-view VideosItem type: Conference Paper
Lecture Notes in Computer Science ~ Computer Vision - ECCV 2024Lu, Feichi; Dong, Zijian; Song, Jie; et al. (2025)Despite progress in human motion capture, existing multi-view methods often face challenges in estimating the 3D pose and shape of multiple closely interacting people. This difficulty arises from reliance on accurate 2D joint estimations, which are hard to obtain due to occlusions and body contact when people are in close interaction. To address this, we propose a novel method leveraging the personalized implicit neural avatar of each individual as a prior, which significantly improves the robustness and precision of this challenging pose estimation task. Concretely, the avatars are efficiently reconstructed via layered volume rendering from sparse multi-view videos. The reconstructed avatar prior allows for the direct optimization of 3D poses based on color and silhouette rendering loss, bypassing the issues associated with noisy 2D detections. To handle interpenetration, we propose a collision loss on the overlapping shape regions of avatars to add penetration constraints. Moreover, both 3D poses and avatars are optimized in an alternating manner. Our experimental results demonstrate state-of-the-art performance on several public datasets. - MultiPly: Reconstruction of Multiple People from Monocular Video in the WildItem type: Conference Paper
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Jiang, Zeren; Guo, Chen; Kaufmann, Manuel; et al. (2024)We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentanglement of individuals without any prior knowledge about the subjects. Moreover, it requires recovering intricate and complete 3D human shapes from short video sequences, intensifying the level of difficulty. To tackle these challenges, we first define a layered neural representation for the entire scene, composited by individual human and background models. We learn the layered neural representation from videos via our layer-wise differentiable volume rendering. This learning process is further enhanced by our hybrid instance segmentation approach which combines the self-supervised 3D segmentation and the promptable 2D segmentation module, yielding reliable instance segmentation supervision even under close human interaction. A confidence-guided optimization formulation is introduced to optimize the human poses and shape/appearance alternately. We incorporate effective objectives to refine human poses via photometric information and impose physically plausible constraints on human dynamics, leading to temporally consistent 3D reconstructions with high fidelity. The evaluation of our method shows the superiority over prior art on publicly available datasets and in-the-wild videos. - STCN: Stochastic Temporal Convolutional NetworksItem type: Working Paper
arXivAksan, Emre; Hilliges, Otmar (2019)Convolutional architectures have recently been shown to be competitive on many sequence modelling tasks when compared to the de-facto standard of recurrent neural networks (RNNs), while providing computational and modeling advantages due to inherent parallelism. However, currently there remains a performance gap to more expressive stochastic RNN variants, especially those with several layers of dependent random variables. In this work, we propose stochastic temporal convolutional networks (STCNs), a novel architecture that combines the computational advantages of temporal convolutional networks (TCN) with the representational power and robustness of stochastic latent spaces. In particular, we propose a hierarchy of stochastic latent variables that captures temporal dependencies at different time-scales. The architecture is modular and flexible due to the decoupling of the deterministic and stochastic layers. We show that the proposed architecture achieves state of the art log-likelihoods across several tasks. Finally, the model is capable of predicting high-quality synthetic sample s over a long-range temporal horizon in modeling of handwritten text. - Self-Supervised 3D Hand Pose Estimation from monocular RGB via Contrastive LearningItem type: Conference Paper
2021 IEEE/CVF International Conference on Computer Vision (ICCV)Spurr, Adrian; Dahiya, Aneesh; Wang, Xi; et al. (2021)Encouraged by the success of contrastive learning on image classification tasks, we propose a new self-supervised method for the structured regression task of 3D hand pose estimation. Contrastive learning makes use of unlabeled data for the purpose of representation learning via a loss formulation that encourages the learned feature representations to be invariant under any image transformation. For 3D hand pose estimation, it too is desirable to have invariance to appearance transformation such as color jitter. However, the task requires equivariance under affine transformations, such as rotation and translation. To address this issue, we propose an equivariant contrastive objective and demonstrate its effectiveness in the context of 3D hand pose estimation. We experimentally investigate the impact of invariant and equivariant contrastive objectives and show that learning equivariant features leads to better representations for the task of 3D hand pose estimation. Furthermore, we show that standard ResNets with sufficient depth, trained on additional unlabeled data, attain improvements of up to 14.5% in PA-EPE on FreiHAND and thus achieves state-of-the-art performance without any task specific, specialized architectures. Code and models are available at https://ait.ethz.ch/projects/2021/PeCLR - Preface: A Data-driven Volumetric Prior for Few-shot Ultra High-resolution Face SynthesisItem type: Conference Paper
2023 IEEE/CVF International Conference on Computer Vision (ICCV)Bühler, Marcel Christoph; Sarkar, Kripasindhu; Shah, Tanmay; et al. (2023)NeRFs have enabled highly realistic synthesis of human faces including complex appearance and reflectance effects of hair and skin. These methods typically require a large number of multi-view input images, making the process hardware intensive and cumbersome, limiting applicability to unconstrained settings. We propose a novel volumetric human face prior that enables the synthesis of ultra high-resolution novel views of subjects that are not part of the prior’s training distribution. This prior model consists of an identity-conditioned NeRF, trained on a dataset of low-resolution multi-view images of diverse humans with known camera calibration. A simple sparse landmark-based 3D alignment of the training dataset allows our model to learn a smooth latent space of geometry and appearance despite a limited number of training identities. A high-quality volumetric representation of a novel subject can be obtained by model fitting to 2 or 3 camera views of arbitrary resolution. Importantly, our method requires as few as two views of casually captured images as input at inference time.
Publications 1 - 10 of 136