PhonemeNet: A Transformer Pipeline for Text-Driven Facial Animation


Loading...

Date

2025

Publication Type

Conference Paper

ETH Bibliography

yes

Citations

Altmetric

Data

Abstract

We present a fully text-driven framework for 3D facial animation that eliminates the need for audio input or explicit prosodic cues. Our architecture extracts rich phoneme embeddings from text using a pre-trained TTS encoder, aligns them with quantized motion embeddings via a transformer decoder, and decodes the result into mesh deformations through a pre-trained transformer decoder. We explore two scenarios of our pipeline: (1) In the single-subject setting, we find that phoneme embeddings alone can yield accurate lip motion. (2) In a multi-subject setting, where speaker articulation varies widely, we introduce stochastic latent modulation to model residual variability conditioned on both phoneme context and speaker identity. We evaluate our approach quantitatively and qualitatively: We demonstrate accurate lip sync in the single-subject case, and compare against audio-driven baselines on a large multi-subject dataset. Our results show that PhonemeNet not only achieves competitive lip sync and motion quality, but also offers flexibility, modularity, and scalability as an alternative to audio-driven facial animation.

Publication status

published

Book title

MIG '25: Proceedings of the 2025 18th ACM SIGGRAPH Conference on Motion, Interaction, and Games

Journal / series

Volume

Pages / Article No.

14

Publisher

Association of Computing Machinery

Event

18th ACM SIGGRAPH Conference on Motion, Interaction, and Games (MIG 2025)

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Text-driven animation; Facial animation; Transformers

Organisational unit

03420 - Gross, Markus / Gross, Markus check_circle

Notes

Funding

216294 - Data-Driven Animation Synthesis for Stylized Characters (SNF)

Related publications and datasets