Machine Speech Chain

Andros Tjandra


Write your presentation abstract here and put the paragraph delimiter at the end of paragraph.

Abstract: Despite the close relationship between speech perception and production, re- search in automatic speech recognition (ASR) and text-to-speech synthesis (TTS) has progressed more or less independently without exerting much mutual influ- ence. In human communication, on the other hand, a closed-loop speech chain mechanism with auditory feedback from the speaker’s mouth to her ear is crucial. We take a step further and develop a closed-loop machine speech chain model based on deep learning. The sequence-to-sequence model in closed-loop archi- tecture allows us to train our model on the concatenation of both labeled and unlabeled data. While ASR transcribes the unlabeled speech features, TTS at- tempts to reconstruct the original speech waveform based on the text from ASR. In the opposite direction, ASR also attempts to reconstruct the original text tran- scription given the synthesized speech. To the best of our knowledge, this is the first deep learning framework that integrates human speech perception and pro- duction behaviors. Our experimental results show that the proposed approach significantly improved performance over that from separate systems that were only trained with labeled data. In this thesis, first I present a study about end-to-end speech modeling in gen- eral and followed by their application for ASR and TTS. Later, the basic of ma- chine speech chain is described in detail in Chapter 3. Next, we integrate speech chain with speaker embedding model in Chapter 4 to achieve multi-speaker speech chain and improve the ASR and TTS performance on multi-speaker dataset set- tings. In Chapter 5, we identify the issue where the output of ASR is discrete variables, therefore we proposed a way to fully backpropagate the loss from TTS to the ASR model by using the straight-through estimator. In Chapter 6, we propose an alternative ASR training with reinforcement learning to solve the discrepancy between training and inference stage.