DPGMM-RNN Hybrid Model: Towards Universal Acoustic Modeling to ASR at Different Supervised Levels

Bin Wu


The independent development of methods for unsupervised and supervised learning induces the different treatments to the unsupervised phoneme discovery and the supervised speech recognition; the two tasks both need acoustic modeling to find patterns that form the perceptual units such as phonemes and words; the only difference is at different supervised levels. So it is reasonable to regard the unsupervised phoneme discovery as the unsupervised ASR (that finds units from speech without text). We propose to use universal acoustic modeling (instead of separated ones) of supervised and unsupervised ASR for the whole process from acoustic waveform to speech units.

The study aims to construct universal acoustic modeling for speech recognition at different supervised levels. Specifically, the work proposes the hybrid model, which combines the Dirichlet process Gaussian mixture model and recurrent neural network (DPGMM-RNN). Furthermore, the proposed approach is utilized (1) to improve phoneme categorization by relieving the fragmentation problem; (2) to extract perceptual features to improve ASR performance.