In order to determine the relationship mapping between speech and articulatory parameters, one can utilize the so called statistical feature mapping technique. Statistical feature mapping is a family of mapping method that harnesses the interconnection of data to build a model which parameterizes the underlying statistical traits between the elements of the data. Statistical feature mapping methods allow robust and flexible generation procedure even if the intended data has not been exactly observed before, the trait in which the rule-based technique, that in this case employs mathematical production models to create correspondence between speech and articulators, cannot give reliable results. Several noted works on mapping between speech and articulatory movements have been studied, such as codebook-based mapping, neural networks, hidden Markov model (HMM), and Gaussian mixture model (GMM). In this thesis, we focus on the GMM-based statistical feature mapping which can capture the statistical correspondence between speech and articulatory movements using multiple mixture densities with elegant and convenient training-conversion procedure and independency of any language/text specification input.
In order to maximize the relationship potential between speech and articulators, we propose an articulatory controllable speech modification system using statistical feature mappings, particularly with Gaussian mixture model. We perform a sequential mapping procedure between two feature mappings, i.e. the GMM-based acoustic-to-articulatory inversion mapping and GMM-based articulatory-to-acoustic production mapping. This mapping process would allow one to modify an input speech signal by manipulating the unobserved articulatory parameters. In order to provide fine control of articulators, we propose a method for controlling the articulatory movements by considering their inter-dimensional and inter-frames correlations. Furthermore, to generate high-quality modified speech sounds, we propose direct waveform modification methods capable of avoiding the use of vocoder-based excitation generation process through directly filtering a speech signal according to spectrum differences between modified speech and the original one. Experimental results demonstrated that the proposed system is capable of producing high-quality desired modified speech sounds by conveniently performing manipulation of the articulatory configurations.