Directly translating spoken utterances from a source language to a target language is a challenging task as it requires a fundamental transformation in both linguistic and para/non-linguistic features. The traditional speech-to-speech translation approaches concatenated automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech synthesizer (TTS) via text information. The traditional speech translation performance is worth MT because the translation result affects by ASR error. The end-to-end Speech translation system potentially recovers ASR error and achieve higher performance than text MT system.
Recent state-of-the-art models for ASR, MT, and TTS have mainly been built using Deep Neural Networks(DNN), in particular, an encoder-decoder model with an attentional mechanism. Several works have attempted to construct end-to-end direct speech translation using DNN. However, the usefulness of these models has only been investigated on language pairs of similar syntax and word order (e.g., English-French or English-Spanish). For syntactically distant language pairs (e.g., English-Japanese), speech translation requires distant word reordering.
This thesis addresses syntactically distant language pairs that suffer from long-distance reordering. We focus mainly on English (subject-verb-object (SVO) word order) and Japanese (SOV word order) language pairs. First, we propose a neural speech translation without requiring significant changes in the cascade ASR and MT structure. Specifically, we construct a neural network model that passing ASR all candidate scores to the MT part. The MT part could consider the ASR hypothesis in the translation process and learn how to recover ASR error during translation. We demonstrate how the acoustic information help to recover the ASR error and improve the translation quality in our proposed model.
Next, we propose the first attempt to build an end-to-end attentional speech translation system on syntactically distant language pairs that suffer from long-distance reordering. To guide the encoder-decoder attentional model on this challenging problem, we construct end-to-end speech-to-text translation with transcoding and utilize curriculum learning (CL) strategies that gradually train the network for end-to-end speech translation tasks by adapting the decoder or encoder parts. We then focus on text-to-speech translation tasks and apply speech information to the target text decoding process. Our proposed approach shows the speech information helps target text generation, and the generated results accurate to the reference sentence. Finally, we propose a complete end-to-end speech-to-speech translation system and compare the performance with the state of the art system. Our experiment results show that the proposed approaches provide significant improvements in comparison with end-to-end speech translation models.