Neural Incremental Speech Recognition Towards Simultaneous Speech Translation

Sashi Novitasari (1811424)


Speech-to-speech translation (S2ST) system attempts to mimic human interpreters to translate a speech. As a pipeline, S2ST system consists of three components: automatic speech recognition (ASR), machine translation (MT), and text-to-speech synthesis (TTS) systems. Unlike the human interpreter, who can do a simultaneous interpretation, conventional S2ST system costs a high output latency. To enable automatic speech translation in a real-time situation, a simultaneous S2ST system that works within a low delay is required. Among all S2ST components, ASR has a critical role to determine the performance and delay of a simultaneous S2ST system in the first place. Despite its remarkable performance, the state-of-the-art attention-based neural ASR requires a high recognition delay because of the global attention mechanism, so it cannot be used for a simultaneous S2ST task. Several studies recently proposed the sequence mechanisms for incremental speech recognition (ISR) for low delay ASR, however, these frameworks have a more complicated mechanism than the standard neural ASR due to the nature of the incremental recognition task.

Towards simultaneous S2ST, this thesis focuses on addressing the current ISR problem by investigating whether possible to perform ISR and handle the input segmentation without introducing a complicated training mechanism. By using the standard neural ASR as the base, the challenges here are to (1) reducing recognition delay, (2) keeping system complexity, and (3) maintaining recognition performance. We perform two main tasks to achieve our goal. First, as our proposal, we construct a neural ISR system using attention transfer from a standard neural ASR model with an identical structure. Here we treat a standard neural ASR as a teacher model that transfers its attention-based knowledge to an ISR model, the student. Our experiments show that the proposed ISR with a system delay of 0.54 sec can achieve a close performance to the teacher model, whose delay is more than 6 sec. In the second task, we integrate the proposed ISR to an MT system and see how the ISR affect the MT performance. As our focus is on the ISR, we explored various ISR-MT integration approaches by adapting the ISR output unit to achieve a good translation performance. Our experiments of English-French translation task shows that an end-to-end ISR with the matching subword representation as MT input side achieves the best speech recognition and translation performance among all integration approaches.

Keywords: attention transfer, incremental speech recognition, simultaneous speech-to-speech translation