In this thesis, we proposed the modified speaker adaptive training (SAT) methods for building a canonical model for non-audible murmur (NAM) adaptation so as to make a larger amount of normal speech data transformed into NAM data available in the training.
NAM is a different type of speech that people use with no intention of communicating with others, similar purpose as murmuring. However, with very sensitive microphones we can use this type of speech for communication as well. In this presentation, it will be given a short description of NAM acoustic properties. Then, a basic framework of speech recognition system will be explained. The only difference between a NAM recognizer and a normal speech one is the acoustic model. For this reason, in order to improve the performance of NAM recognition we need to increase the accuracy of the acoustic model. Two methods for building NAM acoustic model, speaker independent acoustic model and canonical model with speaker adaptive training technique, are shortly described. These methods are previously investigated by other researchers. Even though SAT increases the performance, more room for improvements are needed.
Limitations encountered in the above mentioned methdos for developing NAM acoustic model motivated us to investigate alternative ways. Based on speech synthesis techniques for speech transformation, we investigated the performance of acoustic model updated with transformed normal speech and NAM data. Constrained maximum likelihood linear regression (CMLLR) transforms are used to map the normal speech features into NAM space. This conversion increases the amount of utterances used for NAM acoustic model training. Two different methods were investigated and presented here. First, SAT method using transformed normal speech with only speaker dependent CMLLR transforms, and a more complex transformation structure with factorized form of CMLLR transforms. The later method was also combined with MAP update method in order to further increase the accuracy.
Finally, experimental results demonstrated that the proposed method yields significant improvements in NAM recognition accuracy compared to the conventional SAT method since it is capable of extracting more information from normal speech data and applying it to the training process of the NAM acoustic model. Moreover, the use of factorized transformation in the proposed method yields a slight improvement in the performance of NAM recognition. Further investigation will be conducted on the regression tree generation of the SAT process.