Construction and Analysis of Multiword Expression-Aware Dependency Corpus (NAIST-IS-DD1661004): Akihiko Kato

Construction and Analysis of Multiword Expression-Aware Dependency Corpus

Akihiko Kato

Multiword expressions (MWEs) consist of multiple words with syntactic or semantic non-compositionality. In downstream tasks exploiting syntactic dependency information to understand the meaning of texts, MWE-aware dependency trees, where each MWE is a syntactic unit, are preferable to word-based dependency structures. An English dependency corpus is often acquired with automatic conversion from a treebank of phrase structure trees, however, most of existing English treebanks do not guarantee that an MWE-span corresponds to a phrase structure subtree. Hence, it is not straightforward to get MWE-aware dependency trees from these treebanks. To deal with this problem, I formalize procedures to ensure that an MWE-span corresponds to a phrase structure subtree by modifying phrase structure trees, and I develop a dependency corpus that is aware of functional MWEs and either adjective MWEs or named entities in Ontonotes corpus.

In downstream tasks, it is important to recognize not only continuous MWEs but also verbal MWEs (VMWEs), which could be discontinuous. Therefore, I conduct VMWE annotations in Ontonotes with crowdsourcing.

Finally, I address the task to predict both continuous MWE-aware dependency trees and VMWEs. The reason I deal with these two sub-tasks simultaneously is that dependency information is expected to be effective in VMWE recognition. I perform experiments with continuous MWE-aware dependency corpus and VMWE annotations in Ontonotes. By experiments, I confirm the effectiveness of a model based on the hierarchical multi-task learning of the following three tasks: continuous MWE recognition, a prediction of a word-based dependency tree that encodes MWE-spans, and VMWE recognition.