NAIST-IS-MT0851026: Yusuke Kisaki

Transform-Based Approach to Voice Quality Controllable Speech Synthesis Based on Hidden Markov Model

Yusuke Kisaki (0851026)

Text-to-speech (TTS) synthesis is a speech synthesis technology that generates speech from arbitrary text data. Recently, as one of the statistical parametric speech synthesis systems, a hidden Markov model (HMM)-based speech synthesis system (HTS) has attracted great interest due to its compact and flexible modeling of spectral and prosodic parameters. HTS is expected as a flexible TTS system that easily generates speech with various types of voice quality or a speaking style, and so on, by converting HMM parameters. As one of the attempts at building a voice quality controllable HTS, multiple-regression hidden semi-Markov model (MR-HSMM) of which model parameters are represented as linear combination of regression parameters and low-dimensional voice quality control vector has been proposed. The MR-HSMM allows us to manually control voice quality of synthetic speech by manipulating the voice quality control vector. In the conventional MR-HSMM training method, a large amount of speech data from each of various speakers is usually required. However, it is laborious to record such large-sized speech data sets. There are often available unbalanced data sets consisting of a large amount of speech data from only one speaker and a small amount of speech data from other various speakers. Therefore, it is desired to develop techniques for robustly train the MR-HSMM even if only the unbalanced training data sets are available.

In this presentation, we propose a transformation-based training method for effectively reducing the number of parameters to be trained so as to improve robustness against the amount of training data. The conventional model-based training method represents mean vectors of HSMMs as linear combination of regression parameters and voice quality control vectors directly. However, the number of regression parameters is very large. This may cause over-training, which degrades the quality of synthetic speech. Therefore, we propose a transform-based training method for MR-HSMM, in which transformations for HSMM adaptation is represented as linear combination of regression parameters and voice quality vectors. Finally, we demonstrate the effectiveness of the proposed method from the results of cross-validation experiments and listening tests.