Detecting emphaisised spoken words by considering them prosodic outliers and taking advantage of HMM-based TTS framework
Abstract
A fresh approach to detecting emphasised spoken words, where the concept of one-class classification is adopted, is investigated in this research work, such that a major difficulty – collecting a large amount of well-annotated training data containing emphasis – can be avoided. The key idea, in brief, is that after rich context-dependent phone models are trained on common, neutrally read speech data with the help of the HMM-based speech synthesis framework, emphasised words are considered prosodic outliers with respect to these “neutral” phone models and thus get detected. Experiments were conducted on speech data in the German language without any simplifying assumption (e.g. there was only one emphasised word in each utterance). Under many conditions this universally applicable approach was found to outperform totally random guessing, even though the emphasised words constituted only a small portion (i.e. 6.28%) of the test set. In addition, an optimal set of configuration parameters which applied to all the test speakers was not observed. For better performance, collecting a small amount of development data containing emphasis from a target speaker and then optimising the proposed detector in a “speaker-adaptive” manner is presumably necessary. Show more
Publication status
publishedJournal / series
TIK ReportVolume
Publisher
ETH Zurich, Computer Engineering and Networks LaboratorySubject
Emphasis detection; Prosodic outlier; Rich context modelling; HMM-based speech synthesisOrganisational unit
03429 - Thiele, Lothar (emeritus) / Thiele, Lothar (emeritus)
Related publications and datasets
Is previous version of: http://hdl.handle.net/20.500.11850/116354
More
Show all metadata
ETH Bibliography
yes
Altmetrics