Using Large-scale unlabeled data for parsing internal structure of Chinese Synthetic words

Fei Cheng (1151131)


In some Asia languages like Chinese and Japanese, there are no spaces between words to specify the word boundary. Morphological analysis becomes the first step before any other NLP processes. Due to the existence of different segmentation standards and OOV words, the lack of internal information of words has become a crucial problem for morphological analysis systems. The problem can be illustrated as that we parse the given unsegment Chinese synthetic words into a tree structure.

However, Chinese synthetic word parser can benefit other applications. When native Chinese speakers use search engines, usually they manually insert space between the words in queries. Correct parse for unsegment Chinese synthetic words will be helpful whatever for query expansion and substitution, but also for the convenience of usage.

In this thesis, we first introduce the concept of Chinese synthetic words and describe the categorization of Chinese synthetic words from two different aspects, syntactic relation and morphological structure. Next we extracted Chinese synthetic words from Wikipedia titles, manually selected and annotated them as the training data. Then we proposed two supervised learning methods, dependency parsing and constituency parsing to parse the internal structure of Chinese synthetic words. Then we use an unsupervised learning method to extract features from large-scale unlabeled corpus and incorporate these features into our parsing algorithm. Finally, we give the result of experiments of parsing Chinese synthetic words, investigate these results and compare the performances of two parsing methods.