Unsupervised Seed Selection and Stop List Construction for Bootstrapping: a Graph-based Approach

Tetsuo Kiso (0951058)

Bootstrapping (self-training) is a semi-supervised technique popular in natural language processing. Given a small amount of labeled data called seed set, bootstrapping first trains a model from the seed set, and then iteratively makes use of the previously trained model to annotate unlabeled data, which in turn is used to re-train the model in the next iteration. Bootstrapping has been applied to word sense disambiguation (WSD), information extraction and statistical parsing.

Bootstrapping algorithms, however, suffer from a problem known as semantic drift: as the iteration proceeds, the algorithms tend to select instances unrelated to the seed instances. Previous work has tried to reduce semantic drift by providing the algorithm with the stop list of instances likely to trigger drift and not using them for subsequent training. This list is usually created by human experts. Drifts can also be reduced by carefully selecting seeds, but selecting good seeds again requires expert knowledge.

In this thesis, we present unsupervised graph-based methods to alleviate semantic drift in the state-of-the-art bootstrapping algorithm Espresso. Our idea is built around the concept of hubs, in the context of Kleinberg's HITS algorithm for evaluating the importance of web pages. We model the instance extraction process of the Espresso algorithm as a graph, and select seeds and create a stop list, based on the HITS ranking of instances in the graph. We demonstrate the efficacy of our approach on WSD tasks. Experimental results show that the proposed methods find seed sets better than random seed sets, and the stop list created by our method reduces semantic drift and improve the accuracy of harvested instances.