Articulatory Controllable Speech Modification using Statistical Feature Mappings

Patrick Lumban Tobing (1451207)


Speech is the most universal way for people to communicate with each other. During speech production, our speech organs or the so called articulators, such as tongue and lips, move in a certain way to modulate the excitation signal, generated by our vocal folds, to produce desired speech sounds. Therefore, speech can potentially be parameterized by the slow-varying parameters, i.e. the articulatory parameters. These parameters can be physically and intuitively understood compared to the conventional speech parameters, such as spectrum of the vocal tract. Knowing these facts, to maximize the potential of the articulatory parameters, one can thus try to develop a system to make it possible in modifying speech signal by performing manipulation of the more intuitive articulatory parameters. Such system would be very useful for diverge speech application tasks, such as speech recovery, tools in language learning, and assistive technology in speech therapy.

In order to determine the relationship mapping between speech and articulatory parameters, one can utilize the so called statistical feature mapping technique. Statistical feature mapping is a family of mapping method that harnesses the interconnection of data to build a model which parameterizes the underlying statistical traits between the elements of the data. Statistical feature mapping methods allow robust and flexible generation procedure even if the intended data has not been exactly observed before, the trait in which the rule-based technique, that in this case employs mathematical production models to create correspondence between speech and articulators, cannot give reliable results. Several noted works on mapping between speech and articulatory movements have been studied, such as codebook-based mapping, neural networks, hidden Markov model (HMM), and Gaussian mixture model (GMM). In this thesis, we focus on the GMM-based statistical feature mapping which can capture the statistical correspondence between speech and articulatory movements using multiple mixture densities with elegant and convenient training-conversion procedure and independency of any language/text specification input.

In order to maximize the relationship potential between speech and articulators, we propose an articulatory controllable speech modification system using statistical feature mappings, particularly with Gaussian mixture model. We perform a sequential mapping procedure between two feature mappings, i.e. the GMM-based acoustic-to-articulatory inversion mapping and GMM-based articulatory-to-acoustic production mapping. This mapping process would allow one to modify an input speech signal by manipulating the unobserved articulatory parameters. In order to provide fine control of articulators, we propose a method for controlling the articulatory movements by considering their inter-dimensional and inter-frames correlations. Furthermore, to generate high-quality modified speech sounds, we propose direct waveform modification methods capable of avoiding the use of vocoder-based excitation generation process through directly filtering a speech signal according to spectrum differences between modified speech and the original one. Experimental results demonstrated that the proposed system is capable of producing high-quality desired modified speech sounds by conveniently performing manipulation of the articulatory configurations.