Iterative Estimation of Pronunciation Lexicons and Acoustic Models for Automatic Non-native Speech Recognition

辻岡 聡 (1451075)


Non-native speech differs significantly from native speech because non-native speech are affected by their mother tongue pronunciation or specific accent (e.g. dialect). The problems include (1) phone-level pronunciation mismatch, (2) word-level pronunciation mismatch, and (3) sentence grammatical error. Due to these factors, the performance of automatic speech recognition (ASR) on non-native speech often degrades significantly. In this thesis, we focus on a pronunciation problem in both phone-level and word-level.

There were many methods of pronunciation learning including knowledge-based or data-driven approach. However, knowledge-based approach usually require hand-crafted transformation rule, which is very expensive and time consuming. On the other hand, data-driven approach can find pronunciation variants from speech data, but over-fitting often occurs.

In this work, we proposed acoustic data-driven of iterative pronunciation learning for non-native speech recognition, in which it automatically learn a non-native pronunciation directly from speech in a iterative fashion. First, a G2P (Grapheme-to-Phoneme) tool is used to predict multiple candidate pronunciations for each word then the occurrence frequency of pronunciation variations from the acoustic data of non-native speakers is estimated. Moreover, we investigate various cases such as (1) with knowledge of non-native pronunciation, (2) without knowledge of non-native pronunciation information, and (3) without knowledge of non-native pronunciation information in matched English proficiency level. The method without knowledge-based approach uses phoneme recognition result of non-native speech data. This proposed method is able to prevent over-fitting problem and obtain suitable non-native pronunciations through iterative estimation. In addition, this iterative estimation can combine acoustic model adaptation briefly.

We also investigate two types of combination methods: (1) combination of with and without knowledge of non-native pronunciation information, and (2) combination of lexical and acoustic modelling adaptation method. Last, we also perform on both mono-lingual and multi-lingual non-native speech recognition tasks.

We confirm that the proposed approach is able to improve recognition accuracies in comparison to conventional ASR using standard pronunciation learning. The best performance achieved by using phoneme recognition results of each English proficiency level as a training data for G2P converter with combination of acoustic model adaptation method.