Improving Low-Resource Machine Translation through Syntactic and Contextual Information

Akiva Miura (1661017)


The current mainstream frameworks of machine translation (MT) are statistical MT and neural MT, which are characterized by the ability to acquire translation rules automatically through machine learning techniques. It has been observed that translation with models trained on larger parallel corpora can achieve higher accuracy, and usually millions of sentence pairs are required in order to produce a high quality translation. Unfortunately, readily available parallel corpora are limited for most language pairs, particularly those that do not include English, having few sentence pairs, or none at all.

In this thesis, we focus on improving MT quality in two types of approaches for coping with the scarceness of bilingual corpus: (1) pivot translation and (2) active learning for MT.

Pivot translation is a useful method for translating between languages with little or no parallel data by utilizing parallel data in an intermediate (pivot) language such as English. Although various methods using pivot languages have been proposed, ambiguity due to expressions in the pivot language often causes incorrect selection of translation rules and harms translation quality. Therefore, pivot-side disambiguation is a key issue in pivot translation. In this part, we propose two new pivot translation methods to solve the two types of ambiguity respectively. The first method is proposed to solve semantic ambiguity, and lets MT models remember the information of the pivot phrase, and this information can help to select appropriate translation rules considering pivot-side context with pivot language models. The second method is proposed to solve syntactic ambiguity, and introduces an explicitly syntax-aware matching condition to find correct correspondence of source-pivot and pivot-target translation rules, and can produce more reliable models.

Active learning is a framework that makes it possible to efficiently train statistical models by selecting informative examples from a pool of unlabeled data. Previous work has found this framework effective for MT, making it possible to train better translation models with less effort, particularly when annotators translate short phrases instead of full sentences. However, previous methods for phrase-based active learning for MT fail to consider whether the selected units are coherent and easy for human translators, and also have problems with selecting redundant phrases with similar content. In this part, we propose two new methods for selecting more syntactically coherent and less redundant segments in active learning for MT.

Our experiments in both of these parts demonstrate that MT quality can significantly benefit from syntactic and contextual information when faced with limited training data.