Parameter Uncertainty and Multi-sensor Attention Models for End-to-end Speech Recognition
OPEN ACCESS
Loading...
Author / Producer
Date
2019
Publication Type
Doctoral Thesis
ETH Bibliography
yes
Citations
Altmetric
OPEN ACCESS
Data
Rights / License
Abstract
Over the past decades, the dominant approach towards building automatic speech recognition (ASR) systems has been a complex combination of separately optimized pre-processing, acoustic model and language model components. The recently proposed end-to-end models for ASR present a significant simplification over conventional ASR systems. End-to-end models transcribe input speech to output text with a single neural network that is optimized in a single training stage. While the single model and training stage are a welcome simplification of the ASR system, they are also mostly incompatible with past research that went into optimizing the separate components of conventional ASR systems. Furthermore, the monolithic neural network structure in end-to-end models remains a black box with millions of parameters, and the contribution of specific parameters to the model accuracy is hardly understood. In consequence, the accuracy of conventional ASR systems is still higher, and end-to-end models require new strategies to improve. This thesis has the objective to advance the state-of-the art in end-to-end models for ASR, with a focus on improving noise robustness and model interpretability. The contributions cover novel training strategies and neural network architectures, and three main contributions can be identified. First, a curriculum learning strategy is presented that improves the noise robustness over conventional training methods. The network training follows a signal-to-noise ratio (SNR) curriculum that starts training at low SNR levels and gradually exposes the network to higher SNR levels as training proceeds. Second, a sensory attention mechanism is integrated into the end-to-end model, adding only a fraction of the total parameters to the model. The attention mechanism allows the model to extract information from multiple input sensors and dynamically tune its attention towards less noisy sensors for improved accuracy. The attentional signal is highly interpretable as it correlates with the sensor noise level. Third, the entire model architecture is changed, replacing the deterministic neural network parameters with probabilistic parameters. All network parameters are sampled from probability distributions with a learned degree of uncertainty, and the uncertainty information is interpreted as a proxy measure for parameter importance. The parameter importance information is used in parameter pruning for saved computation and domain adaptation for increased noise robustness.
Permanent link
Publication status
published
External links
Editor
Contributors
Examiner : Liu, Shih-Chii
Examiner : Hahnloser, Richard H.R.
Examiner : Mesgarani, Nima
Book title
Journal / series
Volume
Pages / Article No.
Publisher
ETH Zurich
Event
Edition / version
Methods
Software
Geographic location
Date collected
Date created
Subject
Organisational unit
03774 - Hahnloser, Richard H.R. / Hahnloser, Richard H.R.