Improving Formal Document Translation Using Sublanguage-Specific Sentence Structure

Masaru Fuji (1561018)


Advances in reordering techniques based on syntactic parsing, with growing volumes of parallel corpora available, have brought significant improvements in the performance of statistical machine translation (SMT) for translating across distant language pairs. However, formal documents such as patent, law, operation manual documents still pose difficulties for SMT due to the extreme sentence lengths and characteristic sentence structures. These formal documents are often regarded to form sublanguages since writers of each sublanguage-specific documents have developed characteristic ways of writing to facilitate readers to understand these documents.

This paper describes methods for incorporating features specific to the target sublanguage to recognize the sentence structure correctly and thus improve translation quality. The correct recognition of sentence structure is particularly important for translating long sentences between distant language pairs since not only the word order but the sentence structure is different between these language pairs. I conducted two sets of experiments for translating formal documents between distant language pairs: an experiment for dealing with sentences with very high regularity, and an experiment for dealing with sentences with moderate regularity.

The first experiment focuses on translation of patent claim sublanguage where sentences are extremely long but have very high regularity in the writing style. As the writing style of this patent claim sublanguage is highly consistent, I chose to handcraft rules for detecting sentence components and performed translation experiments using these detected sentence components. The proposed method resulted in substantial improvement in translation quality between distant language pairs, such as English-to-Japanese and Chinese-to-Japanese.

The second experiment proposes a method for capturing the sentence structure with moderate regularity of writing style and more frequently occurring compared with patent claim sentences. Since there exists some variation in the writing style of these documents, I chose to automatically recognize sentence structures. I propose and construct automatic reordering of components in translating from the source to the target languages, which I call global reordering. Substantial improvement in translation quality was observed by incorporating global reordering along with conventional reordering especially for Japanese-to-English translation.