All’s Well That FID’s Well? Result Quality and Metric Scores in GAN Models for Lip-Synchronization Tasks
OPEN ACCESS
Loading...
Author / Producer
Date
2025-09-01
Publication Type
Journal Article
ETH Bibliography
yes
OPEN ACCESS
Data
Rights / License
Abstract
This exploratory study investigates the usability of performance metrics for generative adversarial network (GAN)-based models for speech-driven facial animation. These models focus on the transfer of speech information from an audio file to a still image to generate talking-head videos in a small-scale “everyday usage” setting. Two models, LipGAN and a custom implementation of a Wasserstein GAN with gradient penalty (L1WGAN-GP), are examined for their visual performance and scoring according to commonly used metrics: Quantitative comparisons using FID, SSIM, and PSNR metrics on the GRIDTest dataset show mixed results, and metrics fail to capture local artifacts crucial for lip synchronization, pointing to limitations in their applicability for video animation tasks. The study points towards the inadequacy of current quantitative measures and emphasizes the continued necessity of human qualitative assessment for evaluating talking-head video quality.
Permanent link
Publication status
published
External links
Editor
Book title
Journal / series
Volume
14 (17)
Pages / Article No.
3487
Publisher
MDPI
Event
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
speech-driven facial animation; generative adversarial networks (GANs); lip synchronization; image-to-video synthesis; audio-driven talking-head generation; evaluation metrics
