All’s Well That FID’s Well? Result Quality and Metric Scores in GAN Models for Lip-Synchronization Tasks


Loading...

Date

2025-09-01

Publication Type

Journal Article

ETH Bibliography

yes

Citations

Web of Science:
Scopus:
Altmetric

Data

Abstract

This exploratory study investigates the usability of performance metrics for generative adversarial network (GAN)-based models for speech-driven facial animation. These models focus on the transfer of speech information from an audio file to a still image to generate talking-head videos in a small-scale “everyday usage” setting. Two models, LipGAN and a custom implementation of a Wasserstein GAN with gradient penalty (L1WGAN-GP), are examined for their visual performance and scoring according to commonly used metrics: Quantitative comparisons using FID, SSIM, and PSNR metrics on the GRIDTest dataset show mixed results, and metrics fail to capture local artifacts crucial for lip synchronization, pointing to limitations in their applicability for video animation tasks. The study points towards the inadequacy of current quantitative measures and emphasizes the continued necessity of human qualitative assessment for evaluating talking-head video quality.

Publication status

published

Editor

Book title

Journal / series

Volume

14 (17)

Pages / Article No.

3487

Publisher

MDPI

Event

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

speech-driven facial animation; generative adversarial networks (GANs); lip synchronization; image-to-video synthesis; audio-driven talking-head generation; evaluation metrics

Organisational unit

Notes

Funding

Related publications and datasets