Abstract
A speaker normalization scheme that uses explicit knowledge of acoustic phonetics is presented. The scheme warps the frequency axis linearly in critical band rate with respect to the fundamental frequency F0. It thus allows an immediate adaption to a new speaker which is an advantage over commonly used schemes. Variants with different values of F0 and different parameters have been evaluated on several tasks of SpeechDat(II). The results show significant performance improvements on three tasks with monophone models, the most prominent result is a reduction in WER of 44.5% for an isolated digit task. However, the results achieved with tied triphone models are very modest. It is argued that the normalization scheme may still be correct but that the MFCC feature extraction erases its effect. Evidence for the need of a new feature extraction method that locates spectral peaks and ignores irrelevant portions of the spectrum is given. Show more
Publication status
publishedJournal / series
TIK ReportVolume
Publisher
ETH Zurich, Computer Engineering and Networks LaboratorySubject
Speaker normalization; Frequency warping; Vocal tract length normalization; Human speech perception; Feature extractionOrganisational unit
03429 - Thiele, Lothar (emeritus) / Thiele, Lothar (emeritus)
More
Show all metadata
ETH Bibliography
yes
Altmetrics