Extracting Bilingual Multi-word Terms from Comparable Corpora

Liang Jun (1351208)


Statistical Machine Translation (SMT) systems often make mistake in translating a multi-word term (MWT). Building a bilingual MWT lexicon is one of the important steps to improve the translation result on sentence level. This thesis proposes a novel method to produce a bilingual lexicon of English-Japanese (and vice versa) translation from the comparable corpora. First we extract the candidate MWTs using a predefined linguistic filter separately on Japanese documents and English documents. Next we score the extracted MWTs using six statistical filters and prune some of the MWTs whose scores are less than the threshold. Finally we use a topic model to analyze the distribution of MWTs according to the topic information, and extract a bilingual lexicon by treating each MWT as a word. Manual evaluation on top-100 MWTs shows that a precision of about $70\%$ can be achieved on two different comparable corpora.