Latent Variable Models for Bag-of-Words Data Based on Kernel Embeddings of Distributions

吉川友也 (1361013)


In machine learning and its related fields such as natural language processing, kernel methods are studied to perform non-linear prediction. In this talk, we consider a case where input data are represented as multi-sets of features, i.e., bag-of-words (BoW). Many papers have reported that kernel methods are superior to linear models in terms of prediction accuracy. However, kernel functions based on inner-product such as a Gaussian kernel and polynomial kernel has a problem that the kernel functions cannot reflect the correlation between related features in a kernel calculation.

In this talk, to overcome this weakness, we propose a general framework of kernel methods for bag-of-words (BoW) data, which consists of the following two parts: (1) defining a class of kernel functions with latent variables for BoW data, which we call it latent distribution kernel (LDK), and (2) developing models with LDK and their optimization methods. To demonstrate the effectiveness of this framework, we apply it to solve three machine learning problems: classification, regression and cross-domain matching. We show that the proposed methods by the framework outperforms the existing linear and non-linear methods in terms of prediction accuracy.