In the experiment the image parameters are estimated approximately by taking a mean of image parameters corresponding the same HMM state along the Viterbi alignment. Then lip images are synthesized as a sequence of the image parameters corresponding HMM states along the Viterbi alignment to the input speech. The speech and image data of Japanese 316 words are recorded synchronously at 125Hz. The 216 words out of 316 are used for estimation of image parameters and 100 words for testing. The speech data is converted to the 16 order mel frequency cepstral coefficients. As for image data, the three parameters, the width and the height of lip contour and the lip contact protrusion, are defined from the independent parameters to describe 3-D lip movement. The comparison experiment to the conventional method based on frame by frame parameter mapping through vector quantization(VQ) is conducted.
The result is evaluated by the square error distance between the original image parameters and the synthesized image parameters. It is shown that the error distance of the HMM method is 0.33cm smaller than 1.86cm of the VQ method for test data. Furthermore to solve the problem that the synthesized image parameters of /h/,/Q/, and silence are strongly dependent on succeeding phonemes, the new HMM method would be proposed. The improved method takes respective means of image parameters depending a group of the succeeding phoneme classified by a viseme. The method shows the remarkable reduction of 0.25cm compared with the simple average method.
These results conclude that the proposed HMM methods achieve more accurate lip image synthesis compared to the conventional method.