ŠT—vF |
Kon ya no shiro bakama, a famous Japanese proverb, commonly translates in
English as "The shoemaker's children go barefoot." Companies such as Yahoo
and Google strive to understand user information requests, yet mostly they
still walk barefoot. In this talk, I will discuss challenges we face in
industry for transferring the semantics technology being developed in the
NLP community, specifically focusing on harvesting entities from the Web. We
present Ensemble Semantics (ES), a general framework for modeling
information extraction algorithms that combine multiple sources of
information and extractors. We show large gains in entity extraction by
combining state-of-the-art distributional and pattern-based extractors with
a large set of features from a 600 million document webcrawl, one year of
query logs, and a snapshot of Wikipedia. We explore the hypothesis that
although distributional and pattern-based algorithms are complementary, they
do not exhaust the semantic space; other sources of evidence can be
leveraged to better combine them. A detailed analysis of feature
correlations and interactions shows that query log and webcrawl features
yield the highest gains, but easily accessible Wikipedia features also
improve over current state-of-the-art systems. We further deep dive on
Yahoo!fs distributional set expansion extractor and study the impact of
editor-chosen seeds on extraction performance. We show that in general few
seeds are needed to saturate a distributional model and that seed
compositionality is very sensitive resulting in tremendous variance on
expansion performance. We further study the latter and show that untrained
editors are terrible at choosing the right seeds and we propose algorithms
for helping editors choose better seeds.
|
u‰‰ŽÒЉîF |
Patrick Pantel is a Senior Scientist at Yahoo! Labs and a Research Assistant
Professor in the Natural Language Group at the USC Information Sciences
Institute, where he conducts research in large-scale natural language
processing, text mining, and knowledge acquisition. In 2003, he received a
Ph.D. in Computing Science from the University of Alberta in Edmonton,
Canada.
|