Chinese Unknown Word Identification by Combining Statistical Models
Goh Chooi Ling (0151207)
Since written Chinese does not use blank spaces to indicate word
boundaries, segmenting Chinese texts becomes an essential task for
Chinese language processing. In this task, unknown words are
particularly problematic. It is impossible to register all words in a
dictionary as new words can always be created by combining
characters. We propose a unified solution to unknown word detection in
Chinese texts regardless of the types of unknown words such as
compound words, abbreviation, person names and organization names.
First, an n-best morphological analysis is conducted in order to
obtain an initial segmentation and part-of-speech tags. Next, the
segmentation output from the morphological analysis is converted into
character-based features. At last, unknown words are detected by
chunking sequences of characters. A preliminary study extracted named
entities such as person names and organization names, then the same
procedure was applied to general unknown words. The approach achieved
high accuracy (87.69 and 70.41 F-measure) for person name and
organization name respectively but moderate results for general
unknown words (61.00).
We also suggest solutions for guessing the part-of-speech of the
general unknown words detected.