Handling Tokenization Ambiguities in English Part-of-Speech Tagging
Alexander Shinn (0751204)
Part-of-speech tagging precision is traditionally measured
in terms of the same, error-free, regularly structured
English text, and degrades significantly when applied to
other texts. To attempt to improve this situation we
construct a new corpus with varying levels of informality in
English text, and investigate several of the problems
traditional taggers have with this text, specifically
spelling errors, ambiguous tokenization, and multi-word
expression ambiguities. We then propose a new parsing
algorithm that can represent all of these possible
ambiguities, and compare it with traditional taggers on the
corpus.