Handling Tokenization Ambiguities in English Part-of-Speech Tagging

Alexander Shinn (0751204)

Part-of-speech tagging precision is traditionally measured in terms of the same, error-free, regularly structured English text, and degrades significantly when applied to other texts. To attempt to improve this situation we construct a new corpus with varying levels of informality in English text, and investigate several of the problems traditional taggers have with this text, specifically spelling errors, ambiguous tokenization, and multi-word expression ambiguities. We then propose a new parsing algorithm that can represent all of these possible ambiguities, and compare it with traditional taggers on the corpus.