PhonemeNet: A Transformer Pipeline for Text-Driven Facial Animation
OPEN ACCESS
Loading...
Author / Producer
Date
2025
Publication Type
Conference Paper
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
We present a fully text-driven framework for 3D facial animation that eliminates the need for audio input or explicit prosodic cues. Our architecture extracts rich phoneme embeddings from text using a pre-trained TTS encoder, aligns them with quantized motion embeddings via a transformer decoder, and decodes the result into mesh deformations through a pre-trained transformer decoder. We explore two scenarios of our pipeline: (1) In the single-subject setting, we find that phoneme embeddings alone can yield accurate lip motion. (2) In a multi-subject setting, where speaker articulation varies widely, we introduce stochastic latent modulation to model residual variability conditioned on both phoneme context and speaker identity. We evaluate our approach quantitatively and qualitatively: We demonstrate accurate lip sync in the single-subject case, and compare against audio-driven baselines on a large multi-subject dataset. Our results show that PhonemeNet not only achieves competitive lip sync and motion quality, but also offers flexibility, modularity, and scalability as an alternative to audio-driven facial animation.
Permanent link
Publication status
published
External links
Book title
MIG '25: Proceedings of the 2025 18th ACM SIGGRAPH Conference on Motion, Interaction, and Games
Journal / series
Volume
Pages / Article No.
14
Publisher
Association of Computing Machinery
Event
18th ACM SIGGRAPH Conference on Motion, Interaction, and Games (MIG 2025)
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Text-driven animation; Facial animation; Transformers
Organisational unit
03420 - Gross, Markus / Gross, Markus
Notes
Funding
216294 - Data-Driven Animation Synthesis for Stylized Characters (SNF)