NAIST-IS-MT0151207: Goh Chooi Ling

Chinese Unknown Word Identification by Combining Statistical Models

Goh Chooi Ling (0151207)

Since written Chinese does not use blank spaces to indicate word boundaries, segmenting Chinese texts becomes an essential task for Chinese language processing. In this task, unknown words are particularly problematic. It is impossible to register all words in a dictionary as new words can always be created by combining characters. We propose a unified solution to unknown word detection in Chinese texts regardless of the types of unknown words such as compound words, abbreviation, person names and organization names. First, an n-best morphological analysis is conducted in order to obtain an initial segmentation and part-of-speech tags. Next, the segmentation output from the morphological analysis is converted into character-based features. At last, unknown words are detected by chunking sequences of characters. A preliminary study extracted named entities such as person names and organization names, then the same procedure was applied to general unknown words. The approach achieved high accuracy (87.69 and 70.41 F-measure) for person name and organization name respectively but moderate results for general unknown words (61.00).

We also suggest solutions for guessing the part-of-speech of the general unknown words detected.