ƒ[ƒ~ƒi[ƒ‹Iu‰‰

“úŽžF •½¬22”N1ŒŽ19“ú(‰Î)4ŒÀ (15:10 -- 16:40)
êŠF L2

u‰‰ŽÒF Patrick Pantel (Information Sciences Institute, University of Southern California/Yahoo! Labs)
‘è–ÚF Industrial Semantics
ŠT—vF Kon ya no shiro bakama, a famous Japanese proverb, commonly translates in English as "The shoemaker's children go barefoot." Companies such as Yahoo and Google strive to understand user information requests, yet mostly they still walk barefoot. In this talk, I will discuss challenges we face in industry for transferring the semantics technology being developed in the NLP community, specifically focusing on harvesting entities from the Web. We present Ensemble Semantics (ES), a general framework for modeling information extraction algorithms that combine multiple sources of information and extractors. We show large gains in entity extraction by combining state-of-the-art distributional and pattern-based extractors with a large set of features from a 600 million document webcrawl, one year of query logs, and a snapshot of Wikipedia. We explore the hypothesis that although distributional and pattern-based algorithms are complementary, they do not exhaust the semantic space; other sources of evidence can be leveraged to better combine them. A detailed analysis of feature correlations and interactions shows that query log and webcrawl features yield the highest gains, but easily accessible Wikipedia features also improve over current state-of-the-art systems. We further deep dive on Yahoo!fs distributional set expansion extractor and study the impact of editor-chosen seeds on extraction performance. We show that in general few seeds are needed to saturate a distributional model and that seed compositionality is very sensitive resulting in tremendous variance on expansion performance. We further study the latter and show that untrained editors are terrible at choosing the right seeds and we propose algorithms for helping editors choose better seeds.
u‰‰ŽÒЉîF Patrick Pantel is a Senior Scientist at Yahoo! Labs and a Research Assistant Professor in the Natural Language Group at the USC Information Sciences Institute, where he conducts research in large-scale natural language processing, text mining, and knowledge acquisition. In 2003, he received a Ph.D. in Computing Science from the University of Alberta in Edmonton, Canada.
’S“–‹³ˆõEŽi‰ïF ¼–{@—TŽ¡

ƒ[ƒ~ƒi[ƒ‹ I, II ƒy[ƒW‚Ö