Words are the most prominent unit in languages. Human started to understand languages by combining some characters forming the smallest unit of languages that convey meaning. Since babies, human naturally learn words by combining a sequence of speech. It is very important to model this correctly on computers. Until recently, there have been several different representation of words in a computer.
For example, the earliest of natural language processing treated words as atomic units. While simple and effective, this approach has several drawbacks such as it creates a data sparsity problem because words that should be correlated to each other are now treated as totally different tokens. As a remedy of this approach is the continuous representation which treated words as a vector in a continuous space, thus better representing sense of meaning of the word itself for a computer.
As it is aforementioned, this thesis studies the better lexical representation for a computer in doing translations. Wrong representation of words unit can make learning slower, or even worse, fail to generalize the pattern of the languages. This generalization failure can cause NMT system to fail to acquire/produces certain lexical units.
We first study how to help Neural Machine Translation system to produce better rare lexical units by directly adding some probability priors to the system. Next, we study how to model these most basic units in a continuous space so the systems can do better in modeling the language. Finally, we apply an unsupervised method to extract words unit from a stream of unsegmented input, mimicking how babies do language acquisition.
There are mixed results for the whole experiments. We successfully increase the system performance in generating lexical units using lexicon. We also achieved a better state of the art performance by using the mix of character and word representation in continuous space. However, at the third experiment, we achieved a partially good results. While we are successful in acquiring a good knowledge of a language such as segmentation, our experiments still show that it is increasing the translation performance. Further studies are needed to stabilize the proposed method.