HMMを用いた音声からの唇画像合成法

山本 英里 (9551206)


Synthesized lip movement images can compensate lack of auditory information for hearing impaired people, and also contribute to realize a human-like face of computer agents. In the thesis, a novel method to synthesize lip images based on mapping from an input speech using HMM is proposed. There are some studies of speech-to-lip synthesis applying ANN. The fundamental idea of the ANN method is simple mapping from the speech parameters to the image parameters frame by frame. On the other hand, the proposed method using HMM optimizes mapping from the speech to lip image through the utterance by determining the maximum likelihood state alignment. The lip image parameters can be estimated by a standard E-M algorithm used for speech recognition.

In the experiment the image parameters are estimated approximately by taking a mean of image parameters corresponding the same HMM state along the Viterbi alignment. Then lip images are synthesized as a sequence of the image parameters corresponding HMM states along the Viterbi alignment to the input speech. The speech and image data of Japanese 316 words are recorded synchronously at 125Hz. The 216 words out of 316 are used for estimation of image parameters and 100 words for testing. The speech data is converted to the 16 order mel frequency cepstral coefficients. As for image data, the three parameters, the width and the height of lip contour and the lip contact protrusion, are defined from the independent parameters to describe 3-D lip movement. The comparison experiment to the conventional method based on frame by frame parameter mapping through vector quantization(VQ) is conducted.

The result is evaluated by the square error distance between the original image parameters and the synthesized image parameters. It is shown that the error distance of the HMM method is 0.33cm smaller than 1.86cm of the VQ method for test data. Furthermore to solve the problem that the synthesized image parameters of /h/,/Q/, and silence are strongly dependent on succeeding phonemes, the new HMM method would be proposed. The improved method takes respective means of image parameters depending a group of the succeeding phoneme classified by a viseme. The method shows the remarkable reduction of 0.25cm compared with the simple average method.

These results conclude that the proposed HMM methods achieve more accurate lip image synthesis compared to the conventional method.