Speech Chain for Semi-supervised Learning of Japanese-English Code-switching ASR
Sahoko Nakayama
Code-switching (CS) speech, in which speakers alternate two or more languages in the same utterance, often occurs in multilingual communities. Such a phenomenon poses challenges for automatic speech recognition (ASR), since a system needs to be able to handle the multilingual input with unpredictable switching position in an utterance. We may find code-switching text or code-switching speech in social media, but parallel text and speech of code-switching data suitable for training ASR are mostly difficult to collect. However, most existing approaches, developed for bilingual code-switching, mainly focused on supervised learning with CS data. In this study, we enable ASR to learn code-switching in a semi-supervised fashion by utilizing a speech chain framework based on deep learning. Our study constructs sequence-to-sequence models for Japanese-English code-switching ASR and text-to-speech (TTS) that are jointly trained through a loop connection. We first separately train ASR and TTS systems with the parallel speech-text of monolingual Japanese and English data (supervised learning). After that, we perform a speech chain with only code-switching text or code-switching speech (unsupervised learning). We used TTS to generate speech for data augmentation, and conducted two kinds of experiments, experiments using speech synthesized by TTS and experiments using natural speech. Both experimental results reveal that such closed-loop architecture allows ASR to learn from each other and improve the performance even without any parallel code-switching data.