TvF |
The Web is a gold-mine of data in diverse categories and
characteristics, which contains ones that had been hardly available in
past, such as a large amount of text in different languages and data
from social media. Such data contributes to accelerate progress of
research, and at the same time, brings new challenges. In this talk, I
introduce our recent research efforts exploiting such novel data on
the Web. First, we exploit Twitter data for classifying spiking
queries into their topical categories. Spiking queries show sudden
spikes in search engines, which represents usersf hot attention to
them. Therefore, accurate classification of spiking query is important
for search engines. Next, I introduce our effort to extract
Japanese-English parallel sentence pairs from the Web. We took 3
approaches to mine such data and carefully developed data cleaning
framework to extract only high-quality portion. Then I briefly
introduce our approach on Japanese-English statistical machine
translation, specifically reordering method to fill the gap in word
orders between Japanese and English, which is important to improve the
quality of translation.