Acoustic modeling and speech parameter generation for high-quality statistical parametric speech synthesis

Shinnosuke Takamichi (1361007)


Speech is one of the promising ways for people to communicate, and speech synthesis is a technique to synthesize speech waveform through a computer, such as Text-To-Speech (TTS) synthesis that synthesizes speech from the arbitrary text and Voice Conversion (VC) that converts input speech into the another speech having different speech information. Speech synthesis is actively targeted to deploy speech-based applications that are very helpful with human-to-human and human-to-computer communication, such as a speech-to-speech translation system and a speech dialogue system.

Referring the recent development of machine learning techniques and computational environments, statistical approaches have been the main stream in speech synthesis. Trough many state-of-the-art methods have been proposed, Hidden Markov Model (HMM)-based TTS and Gaussian Mixture Model (GMM)-based VC have the gained popularity thanks to their mathematical foundation. However, the critical drawback of HMM-based TTS and GMM-based VC is significant degradation in synthetic speech quality. The synthetic speech often sounds muffled, and we can still distinguish synthetic speech from natural speech. There are three main reasons causing the quality degradation: parameterization errors in the analysis/synthesis stage, insufficient modeling in the training stage, and over-smoothing effect in the synthesis stage. This thesis mainly addresses the latter 2 reasons.

One of the reasons with insufficiency in the modeling stage is that information of the individual speech feature are lost by averaging them in the statistical modeling. To address this problem, we integrate an idea of unit selection synthesis, which is the another approach that directly uses the speech waveform segments, into HMM-based TTS and GMM-base VC. The individual speech feature or segment is first modeled with the individual acoustic model robustly to the unseen input features (called rich context model), then they are gathered and reformulated as the mixture model (called Rich context GMM (R-GMM)). By the use of individual speech information, we can achieve the better speech quality than basic HMM-based TTS and GMM-based VC. Moreover, the proposed method have the capability of directly utilizing the mathematical foundation of basic HMM-based TTS and GMM-based VC.

Over-smoothing effect is the main cause with quality degradation in the synthesis stage. One promising approach to alleviate the over-smoothing effect is to extract a specific feature to quantify the over-smoothing effect and to generate speech parameters so that their corresponding features become more similar to those of natural speech parameters. Although Global Variance (GV) is well-known such an example, the quality gap between natural and synthetic speech is still large. This thesis introduces a new feature more sensitively correlated to the over-smoothing effect than the GV, the Modulation Spectrum (MS). The MS of a speech parameter sequence is given as the power spectrum of the sequence, and can be regarded as a mathematical extension of the GV. This thesis also proposes the MS-based post-filter that modifies the MS of generated speech parameters. Because the process is done separately from basic HMM-based TTS and GMM-based VC, the proposed post-filter has the capability of not only quality improvements in synthetic speech but also high portability to the various speech synthesis systems.

We further propose joint optimization algorithms of the basic acoustic models, HMMs and GMMs, and the statistical model of the MS. Whereas the proposed MS-based post-filter improves the MS criterion while deteriorating the basic criteria using HMMs and GMMs, the proposed algorithms jointly optimize these criteria. We first integrate the MS criterion into the speech parameter generation to directly alleviate the over-smoothing effect observed in the synthesis stage. The proposed objective function is iteratively updated to generate the synthetic speech parameters. As the yet another approach, we integrate the MS criterion into the training stage to perform the high-quality and computationally-efficient speech synthesis. The trajectory HMMs and GMMs are trained with the MS constraint by the proposed training algorithm. We can directly utilize the basic computationally-efficient parameter generation algorithm, but the MS of the generated speech parameters are well compensated. We conduct evaluation from various perspectives in HMM-based TTS and GMM-based VC, and confirm the effectiveness of the proposed acoustic modeling and speech parameter generation.