Studies on improving two fundamental steps for Chinese natural language processing: word segmentation and spelling check

Fei Cheng (1361023)


In Chinese, a sentence is written as a sequence of Chinese character without any indicators of word boundaries, unlike ’space’ in English. Therefore, Chinese word segmentation is generally thought as the fundamental step in the Chinese Natural Language Processing (NLP) pipeline. In the meanwhile, Chinese spelling check is an automatic mechanism to detect and correct human errors in unsegmented Chinese documents, which can be seen as the prior process before word segmentation. Chinese word segmentation and spelling check play such crucial roles as to directly influence the performance of all the other down-streaming Chinese NLP tasks. However, both tasks are facing some remaining issues. In Chinese word segmentation, various word segmentation standards keep the existing data from being extended and out-of-vocabulary words are productive. We believe that both issues can be addressed by analyzing the internal structure of words. For this purpose, we manually construct an annotated synthetic word dictionary. Then we propose a character-based word structure parser boosted by extracting statistic and cluster features from a dictionary and large-scale unlabeled data. Finally, our word segmentor is enhanced by a fine-grained conversion of the segmentation level based on the internal information of words. As for Chinese spelling check, spelling error is hard to to be detected without word boundary information and must be considered within a context. Our two-stage spelling check system first enlarges the candidate lists by gathering corrections generated by several single systems. The second ranking component is friendly to include the context features to select the best correction from each candidate list accurately.