Parameter Uncertainty and Multi-sensor Attention Models for End-to-end Speech Recognition
- Doctoral Thesis
Rights / licenseIn Copyright - Non-Commercial Use Permitted
Over the past decades, the dominant approach towards building automatic speech recognition (ASR) systems has been a complex combination of separately optimized pre-processing, acoustic model and language model components. The recently proposed end-to-end models for ASR present a significant simplification over conventional ASR systems. End-to-end models transcribe input speech to output text with a single neural network that is optimized in a single training stage. While the single model and training stage are a welcome simplification of the ASR system, they are also mostly incompatible with past research that went into optimizing the separate components of conventional ASR systems. Furthermore, the monolithic neural network structure in end-to-end models remains a black box with millions of parameters, and the contribution of specific parameters to the model accuracy is hardly understood. In consequence, the accuracy of conventional ASR systems is still higher, and end-to-end models require new strategies to improve. This thesis has the objective to advance the state-of-the art in end-to-end models for ASR, with a focus on improving noise robustness and model interpretability. The contributions cover novel training strategies and neural network architectures, and three main contributions can be identified. First, a curriculum learning strategy is presented that improves the noise robustness over conventional training methods. The network training follows a signal-to-noise ratio (SNR) curriculum that starts training at low SNR levels and gradually exposes the network to higher SNR levels as training proceeds. Second, a sensory attention mechanism is integrated into the end-to-end model, adding only a fraction of the total parameters to the model. The attention mechanism allows the model to extract information from multiple input sensors and dynamically tune its attention towards less noisy sensors for improved accuracy. The attentional signal is highly interpretable as it correlates with the sensor noise level. Third, the entire model architecture is changed, replacing the deterministic neural network parameters with probabilistic parameters. All network parameters are sampled from probability distributions with a learned degree of uncertainty, and the uncertainty information is interpreted as a proxy measure for parameter importance. The parameter importance information is used in parameter pruning for saved computation and domain adaptation for increased noise robustness. Show more
External linksSearch print copy at ETH Library
Organisational unit03774 - Hahnloser, Richard H.R. / Hahnloser, Richard H.R.
MoreShow all metadata