Microphone Array Processing based on Blind Source Separation for Robust Distant Speech Recognition System

Fine Dwinita Aprilyanti ( 1361014 )


Distant speech recognition system allows more flexibility in human-machine interaction by utilizing a single or an array of microphone to capture the user's utterance instead of a user-attached microphone. However, its performance is greatly affected by the presence of background noise and reverberation. Many array processing method have been developed to improve the quality of the captured speech waveform in terms of human perception, assuming that it will also improve the recognition performance. On the other hand, speech recognition system works as a statistical pattern classifier of features extracted from the speech waveform. Therefore, an array front-end processing can only be expected to increase the recognition performance if it maximizes the likelihood of the correct hypothesis.

In this talk, I present the proposed array processing techniques based on blind source separation (BSS) to suppress the background noise and late reverberation which are optimized to maximize the likelihood to the acoustic model of speech recognizer. The first method utilizes frequency-domain blind signal extraction (BSE), which is an alternative to the conventional BSE specifically designed for the case of speech in the presence of diffuse noise, combined with two stages of multichannel Wiener filter. This method is extended by integrating information from the image sensor to achieve optimum performance regardless the interference level. In the second method, we combine BSE with multichannel generalized minimum mean-square error estimator of short time spectral amplitude (MMSE-STSA), which can provide less distortion to the output signal owing to the use of statistical model assumption of speech spectral amplitude and decision-directed signal-to-noise ratio estimation approach. In the third method, we develop a source-adaptive BSS, in which the activation functions are parameterized according to the estimated statistical model of the target speech and interference. The parameter of the activation function for the target speech is derived from the parameter of the optimized generalized MMSE-STSA postprocessing. Experimental evaluation shows that this proposed method is more robust to different types of interference than conventional BSS and BSE method.