On Language Representation for Low-Resource Neural Machine Translation

Ander Martínez


In recent years, machine translation systems have taken a qualitative leap. This leap is thanks to the introduction of systems based on neural networks. Neural Machine Translation (NMT) systems have not only yielded unprecedented results, which in certain cases are comparable to those of professional translators, but also come with the promise of being general solutions. However, not all language pairs are the same. If the number of parallel sentences is small, it is considered a low-resource pair. The most popular datasets used as benchmarks in research are large parallel corpora of related languages, such as French-English or English-German.

This thesis explores the problems of machine translation with respect to low-resource pairs. We present the key elements of a low-resource machine translation system and the critical decisions that have a great impact on the effectiveness of the system. We propose a novel approach that combines subword-level segmentation with character-level information in the form of character n-gram features to construct subword representations for standard encoder-decoder models. We use a custom algorithm to select a small number of effective binary character n-gram features. We show the benefits and characteristics of the proposed approach through extensive experimentation. The thesis ends with several concrete conclusions drawn from the experiments.