Studies on improving two fundamental steps for Chinese natural language processing: word segmentation and spelling check

程飛(Fei Cheng) (1561010)


In Chinese, a sentence is written as a sequence of Chinese character without any indicators of word boundaries. Therefore, Chinese word segmentation is generally thought as the fundamental step in the Chinese Natural Language Processing (NLP) pipeline. In the meanwhile, Chinese spelling check is an automatic mechanism to detect and correct human errors in unsegmented Chinese documents, which can be seen as the prior process before word segmentation. Chinese word segmentation and spelling check play such crucial roles as to directly influence the performance of all the other down-streaming Chinese NLP tasks. However, both tasks are facing some remaining issues. In Chinese word segmentation, various word segmentation standards keep the existing data from being extended and out-of-vocabulary words are productive. We believe that both issues can be addressed by analyzing the internal structure of words. For this purpose, we manually construct an human-annotated synthetic word dictionary. Then we propose a character-based word structure parser enhanced by various features extracted from a dictionary and large-scale unlabeled data. The word segmentation performance of our system is significantly improved by a fine-grained conversion of the segmentation level based on the internal information of words. Furthermore, we propose a simple strategy to transform two different Chinese word segmentation corpora into a new consistent segmentation level, which enables easy extension of the training data size. The new extended training data is verified to be highly consistent by 10-fold cross-validation. With the help of larger training data size and internal structure of words, our word segmentation system achieves state-of-theart performance on the test data of the two corpora. As for Chinese spelling check, spelling error is hard to be detected without word boundary information and must be considered within a context. Our two-stage spelling check system first enlarges the candidate lists by gathering corrections generated by several single systems. The second ranking component is friendly to include the context features to select the best correction from each candidate list accurately.