Faithful 3D Avatars for Everyone: Prior-Guided Face Reconstruction for High-Quality Novel View Synthesis from Casual Few-Shot Captures


Loading...

Date

2025

Publication Type

Doctoral Thesis

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

Photorealistic digital avatars have many applications in areas ranging from virtual reality and telepresence to digital content creation and human-computer interaction. Today, digital artists create such avatars by capturing a target person in a professional capture dome with hundreds of studio-grade cameras. Reducing the input requirements from hundreds of studio cameras to just a few casually captured smartphone photos has the potential to democratize high-quality face modeling and make avatars widely accessible in previously impossible scenarios. This thesis studies and develops techniques to reconstruct faces and render high-quality novel views from casual smartphone captures. This problem is inherently under-constrained because the sparse inputs only provide limited and possibly ambiguous information about the underlying geometry. Inspired by how digital artists would use their prior knowledge, this thesis leverages data-driven priors to resolve these ambiguities. This thesis contributes several insights to the field. First, it establishes that data-driven priors can effectively resolve the fundamental ambiguities in 3D face reconstruction for novel view synthesis from sparse inputs. Second, it demonstrates that these priors can be learned from both real and synthetic datasets, with the latter showing a surprising generalization capability for in-the-wild captures from the real world. Third, the work shows that volumetric representations, when combined with appropriate priors, can capture fine-grained facial details like wrinkles and eyelashes even from minimal input. These insights were developed through three interconnected technical contributions. We lay the foundation with VariTex, a generative model of neural face textures. The key insight is that integrating a 3D Morphable Face Model and training a generative model in its texture space enables extreme novel poses (beyond 30 degrees) and expressions, even when the training data is strongly biased to frontal and smiling faces. The VariTex model can sample novel identities and render extreme head poses beyond the distribution of the training set. Given a single image of a target person, VariTex reconstructs neural face textures and renders novel views and expressions. As a drawback, the neural face texture representation in VariTex is only defined on the face surface. Non-surface regions like hair lack consistency across views. Hence, the subsequent model, Preface, replaces the neural-texture-based representation with a Neural Radiance Field (NeRF). As a major challenge, NeRFs require dozens of views for training, which are infeasible to obtain from casual captures. To solve this, Preface proposes a data-driven volumetric prior, which reduces the input requirements to as few as two input images. The prior takes the form of an identity-conditioned NeRF. This conditional NeRF learns the distribution of faces, including non-surface components like glasses and hair. The method uses a sparse landmark-based alignment to create a smooth latent space. This space generalizes to previously unseen subjects, requiring only two input views at inference time. As a limitation, Preface requires a large dataset of real multi-view captures. Such data are expensive to capture, may exhibit demographic biases, and are subject to strict privacy protection regulations, which hinders general applicability. Fortunately, the third contribution of this thesis, Cafca, shows that a prior can be trained on purely synthetic data. The key insight is that such a purely synthetic prior can bridge the gap to real-world inputs through only minimal finetuning. Cafca first generates a synthetic dataset by combining a geometric face model with assets like hair, clothing, and glasses; and rendering synthetic faces with diverse expressions. Cafca then extends the identity-conditioned prior from Preface with expressions. The synthetic prior in Cafca generalizes to out-of-distribution inputs like in-the-wild smartphone captures and stylized faces. Together, these contributions enable high-quality face reconstruction and novel view synthesis from casual captures, making photorealistic digital avatars accessible in everyday scenarios previously limited to professional studios.

Publication status

published

Editor

Contributors

Examiner : Gross, Markus
Examiner : Beeler, Thabo
Examiner : De la Torre, Fernando
Examiner : Wetzstein, Gordon

Book title

Journal / series

Volume

Pages / Article No.

Publisher

ETH Zurich

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Novel view synthesis; Data-driven priors; Photorealistic digital avatars; Volumetric representations; Sparse input reconstruction

Organisational unit

03420 - Gross, Markus / Gross, Markus check_circle

Notes

Funding

Related publications and datasets