Manuel Kaufmann
Loading...
Last Name
Kaufmann
First Name
Manuel
ORCID
Organisational unit
02219 - ETH AI Center / ETH AI Center
18 results
Search Results
Publications 1 - 10 of 18
- An Interpretable and Attention-based Method for Gaze Estimation Using ElectroencephalographyItem type: Conference Paper
Lecture Notes in Computer Science ~ Medical Image Computing and Computer Assisted Intervention – MICCAI 2023Weng, Nina; Plomecka, Martyna; Kaufmann, Manuel; et al. (2023)Eye movements can reveal valuable insights into various aspects of human mental processes, physical well-being, and actions. Recently, several datasets have been made available that simultaneously record EEG activity and eye movements. This has triggered the development of various methods to predict gaze direction based on brain activity. However, most of these methods lack interpretability, which limits their technology acceptance. In this paper, we leverage a large data set of simultaneously measured Electroencephalography (EEG) and Eye tracking, proposing an interpretable model for gaze estimation from EEG data. More specifically, we present a novel attention-based deep learning framework for EEG signal analysis, which allows the network to focus on the most relevant information in the signal and discard problematic channels. Additionally, we provide a comprehensive evaluation of the presented framework, demonstrating its superiority over current methods in terms of accuracy and robustness. Finally, the study presents visualizations that explain the results of the analysis and highlights the potential of attention mechanism for improving the efficiency and effectiveness of EEG data analysis in a variety of applications. - MultiPly: Reconstruction of Multiple People from Monocular Video in the WildItem type: Conference Paper
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Jiang, Zeren; Guo, Chen; Kaufmann, Manuel; et al. (2024)We present MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentanglement of individuals without any prior knowledge about the subjects. Moreover, it requires recovering intricate and complete 3D human shapes from short video sequences, intensifying the level of difficulty. To tackle these challenges, we first define a layered neural representation for the entire scene, composited by individual human and background models. We learn the layered neural representation from videos via our layer-wise differentiable volume rendering. This learning process is further enhanced by our hybrid instance segmentation approach which combines the self-supervised 3D segmentation and the promptable 2D segmentation module, yielding reliable instance segmentation supervision even under close human interaction. A confidence-guided optimization formulation is introduced to optimize the human poses and shape/appearance alternately. We incorporate effective objectives to refine human poses via photometric information and impose physically plausible constraints on human dynamics, leading to temporally consistent 3D reconstructions with high fidelity. The evaluation of our method shows the superiority over prior art on publicly available datasets and in-the-wild videos. - WorldPose: A World Cup Dataset for Global 3D Human Pose EstimationItem type: Conference Paper
Lecture Notes in Computer Science ~ Computer Vision – ECCV 2024Jiang, Tianjian; Billingham, Johsan; Müksch, Sebastian; et al. (2025)We present WorldPose, a novel dataset for advancing research in multi-person global pose estimation in the wild, featuring footage from the 2022 FIFA World Cup. While previous datasets have primarily focused on local poses, often limited to a single person or in constrained, indoor settings, the infrastructure deployed for this sporting event allows access to multiple fixed and moving cameras in different stadiums. We exploit the static multi-view setup of HD cameras to recover the 3D player poses and motions with unprecedented accuracy given capture areas of more than 1.75 acres (7k m). We then leverage the captured players’ motions and field markings to calibrate a moving broadcasting camera. The resulting dataset comprises 88 sequences with more than 2.5 million 3D poses and a total traveling distance of over 120 km. Subsequently, we conduct an in-depth analysis of the SOTA methods for global pose estimation. Our experiments demonstrate that WorldPose challenges existing multi-person techniques, supporting the potential for new research in this area and others, such as sports analysis. All pose annotations (in SMPL format), broadcasting camera parameters and footage will be released for academic research purposes. - ReLoo: Reconstructing Humans Dressed in Loose Garments from Monocular Video in the WildItem type: Conference Paper
Lecture Notes in Computer Science ~ Computer Vision – ECCV 2024Guo, Chen; Jiang, Tianjian; Kaufmann, Manuel; et al. (2025)While previous years have seen great progress in the 3D reconstruction of humans from monocular videos, few of the state-of-the-art methods are able to handle loose garments that exhibit large non-rigid surface deformations during articulation. This limits the application of such methods to humans that are dressed in standard pants or T-shirts. Our method, ReLoo, overcomes this limitation and reconstructs high-quality 3D models of humans dressed in loose garments from monocular in-the-wild videos. To tackle this problem, we first establish a layered neural human representation that decomposes clothed humans into a neural inner body and outer clothing. On top of the layered neural representation, we further introduce a non-hierarchical virtual bone deformation module for the clothing layer that can freely move, which allows the accurate recovery of non-rigidly deforming loose clothing. A global optimization jointly optimizes the shape, appearance, and deformations of the human body and clothing via multi-layer differentiable volume rendering. To evaluate ReLoo, we record subjects with dynamically deforming garments in a multi-view capture studio. This evaluation, both on existing and our novel dataset, demonstrates ReLoo's clear superiority over prior art on both indoor datasets and in-the-wild videos. - Sensor-Based 3D Human Performance CaptureItem type: Doctoral ThesisKaufmann, Manuel (2024)
- Towards Egocentric Understanding of SurgeryItem type: Conference Paper
2025 International Conference on Intelligent Computing and Virtual & Augmented Reality Simulations (ICVARS)Yavuz, Ahmetcan; Gultekin, Cagatay; Wang, Xi; et al. (2025)Modern surgeries are complex and cognitively demanding, creating a need for advanced tools to assist medical staff, reduce cognitive load, and ultimately improve patient outcome. Computational models with a holistic understanding of the surgical scene, interactions, and context hold the promise to support surgeons in this task, especially when including an egocentric perspective. With advances in learning-based machine perception from images, creating these models is within reach provided that corresponding data can be acquired. In this study, we explore the creation and processing of egocentric surgical video data, collected using a head-worn recording device, i.e., Meta's Project Aria glasses. Along with addressing challenges in data processing, we investigate the performance of image annotation pipelines to establish high-quality labels. To showcase tasks such a dataset enables, we then evaluate state-of-the-art segmentation and 3D human hand and body pose estimation models. Our results highlight the complexities of working in a real clinical environment and provide insights for future improvements in the curation of egocentric datasets of surgical activity. - HSR: Holistic 3D Human-Scene Reconstruction from Monocular VideosItem type: Conference Paper
Lecture Notes in Computer Science ~ Computer Vision – ECCV 2024Xue, Lixin; Guo, Chen; Zheng, Chengwei; et al. (2025)An overarching goal for computer-aided perception systems is the holistic understanding of the human-centric 3D world, including faithful reconstructions of humans, scenes, and their global spatial relationships. While recent progress in monocular 3D reconstruction has been made for footage of either humans or scenes alone, the joint reconstruction of both humans and scenes, along with their global spatial information, remains an unsolved challenge. To address this, we introduce a novel and unified framework that simultaneously achieves temporally and spatially coherent 3D reconstruction of static scenes with dynamic humans from monocular RGB videos. Specifically, we parameterize temporally consistent canonical human models and static scene representations using two neural fields in a shared 3D space. Additionally, we develop a global optimization framework that considers physical constraints imposed by potential human-scene interpenetration and occlusion. Compared to separate reconstructions, our framework enables detailed and holistic geometry reconstructions of both humans and scenes. Furthermore, we introduce a synthetic dataset for quantitative evaluations. Extensive experiments and ablation studies on both real-world and synthetic videos demonstrate the efficacy of our framework in monocular human-scene reconstruction. Code and data are publicly available on our project page. - Convolutional Autoencoders for Human Motion InfillingItem type: Conference Paper
2020 International Conference on 3D Vision (3DV)Kaufmann, Manuel; Aksan, Emre; Song, Jie; et al. (2020)In this paper we propose a convolutional autoencoder to address the problem of motion infilling for 3D human motion data. Given a start and end sequence, motion infilling aims to complete the missing gap in between, such that the filled in poses plausibly forecast the start sequence and naturally transition into the end sequence. To this end, we propose a single, end-to-end trainable convolutional autoencoder. We show that a single model can be used to create natural transitions between different types of activities. Furthermore, our method is not only able to fill in entire missing frames, but it can also be used to complete gaps where partial poses are available (e.g. from end effectors), or to clean up other forms of noise (e.g. Gaussian). Also, the model can fill in an arbitrary number of gaps that potentially vary in length. In addition, no further post-processing on the model’s outputs is necessary such as smoothing or closing discontinuities at the end of the gap. At the heart of our approach lies the idea to cast motion infilling as an inpainting problem and to train a convolutional de-noising autoencoder on image-like representations of motion sequences. At training time, blocks of columns are removed from such images and we ask the model to fill in the gaps. We demonstrate the versatility of the approach via a number of complex motion sequences and report on thorough evaluations performed to better understand the capabilities and limitations of the proposed approach. - ARCTIC: A Dataset for Dexterous Bimanual Hand-Object ManipulationItem type: Conference Paper
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Fan, Zicong; Taheri, Omid; Tzionas, Dimitrios; et al. (2023)Humans intuitively understand that inanimate objects do not move by themselves, but that state changes are typically caused by human manipulation (e.g., the opening of a book). This is not yet the case for machines. In part this is because there exist no datasets with ground-truth 3D annotations for the study of physically consistent and synchronised motion of hands and articulated objects. To this end, we introduce ARCTIC - a dataset of two hands that dexterously manipulate objects, containing 2.1M video frames paired with accurate 3D hand and object meshes and detailed, dynamic contact information. It contains bi-manual articulation of objects such as scissors or laptops, where hand poses and object states evolve jointly in time. We propose two novel articulated hand-object interaction tasks: (1) Consistent motion reconstruction: Given a monocular video, the goal is to reconstruct two hands and articulated objects in 3D, so that their motions are spatio-temporally consistent. (2) Interaction field estimation: Dense relative hand-object distances must be estimated from images. We introduce two baselines ArcticNet and InterField, respectively and evaluate them qualitatively and quantitatively on ARCTIC. Our code and data are available at https://arctic.is.tue.mpg.de. - Deep Inertial Poser: Learning to Reconstruct Human Pose from Sparse Inertial Measurements in Real TimeItem type: Journal Article
ACM Transactions on GraphicsHuang, Yinghao; Kaufmann, Manuel; Aksan, Emre; et al. (2018)
Publications 1 - 10 of 18