Statistical waveform modification for speaking and singing voice conversion

小林和弘 (1461002)


We present statistical waveform modification for speaking and singing voice conversion.
Varieties of voice characteristics, such as voice timbre and fundamental frequency (F0) patterns, produced by individual speaker are always restricted by their own physical constraints due to the speech production mechanism.
In this presentation, in order to realize voice timbre control beyond physical constraints, we focus on two kind of voice timbre control approaches for a user as follows: 1) convert voice timbre of the user into a specific target speaker/signer, and 2) control voice timbre of the user using voice timbre expression words as perceptually understandable cues.
Voice conversion (VC) based on Gaussian mixture model (GMM) is a potential technique to make it possible for the user to produce speech sounds beyond his/her own physical constraints.
However, the VC based on GMM method does not use in practical because the sound quality of the converted voice is still significantly degraded compared to that of a natural speech waveform in its conversion process.
One of the biggest factors causing the quality degradation is waveform generation process using vocoding.
For the vocoding process, a converted voice is generated by using transformed F0, converted aperiodicity and mel-cepstrum.
In this process, various factors such as F0 extraction errors, unvoiced/voiced decision errors, and spectral parameterization errors caused by liftering usually cause the sound quality degradation of the converted voice.
These factors indeed difficult to solve even when using high-quality vocoding frameworks.
To address this issue, in this presentation, we mainly propose a statistical waveform modification technique for both VC and voice timbre control approaches using speaking and singing voice.
The experimental results demonstrated that our proposed methods make it possible to convert/control voice timbre with higher sound quality.