Studies on improving two fundamental steps for
Chinese natural language processing: word
segmentation and spelling check
Fei Cheng (1361023)
In Chinese, a sentence is written as a sequence of Chinese character without any
indicators of word boundaries, unlike ’space’ in English. Therefore, Chinese word
segmentation is generally thought as the fundamental step in the Chinese Natural
Language Processing (NLP) pipeline. In the meanwhile, Chinese spelling check is
an automatic mechanism to detect and correct human errors in unsegmented Chinese
documents, which can be seen as the prior process before word segmentation. Chinese
word segmentation and spelling check play such crucial roles as to directly influence
the performance of all the other down-streaming Chinese NLP tasks. However, both
tasks are facing some remaining issues.
In Chinese word segmentation, various word segmentation standards keep the existing
data from being extended and out-of-vocabulary words are productive. We believe
that both issues can be addressed by analyzing the internal structure of words. For
this purpose, we manually construct an annotated synthetic word dictionary. Then we
propose a character-based word structure parser boosted by extracting statistic and
cluster features from a dictionary and large-scale unlabeled data. Finally, our word
segmentor is enhanced by a fine-grained conversion of the segmentation level based on
the internal information of words.
As for Chinese spelling check, spelling error is hard to to be detected without word
boundary information and must be considered within a context. Our two-stage spelling
check system first enlarges the candidate lists by gathering corrections generated by
several single systems. The second ranking component is friendly to include the context
features to select the best correction from each candidate list accurately.