Distant Representation for Natural Language Processing

Takuo Hamaguchi (NAIST-IS-DD1461008)


To develop understandings and manipulations of natural languages, we need to study word representations themselves which lay the foundations of natural language processing. As word representations, the recent development of continuous representations is a great step because it allows us to capture semantics effectively. The continuous representations share following procedures: map words to continuous-valued vectors called embedding vectors and treat the embedding vectors as the words. For example, skip-gram trains the embedding vectors to predict surrounding contexts inspired by distributional hypothesis stated in terms “a word is characterized by the company it keeps”. Amongst word representations, we call representations whose vectors aim to represent corresponding words directly as direct representation.

The direct representation goes against the spirit of distributional hypothesis because the representation puts an assumption “a word is characterized by embedding vectors”. Due to the violation, direct representations face various problems as word representations. For example, the problem of OOV words occurs when we cannot represent words due to the lack of corresponding embedding vectors. Previous work trains multiple embedding vectors for a word to represent polysemous senses individual. However, this approach modeling polysemy explicitly poses the hard problem in linguistics, what are meanings of words, including determining numbers of senses. These problems show the limitations of distant representations.

To overcome the problems, we propose novel word representation called distant representation applying weakly supervision to the distributional hypothesis. The weakly supervision (or distant supervision) is an annotation scheme, which associates a target object with weakly related objects. Thus, if related objects are contexts, we collect contexts containing a target word and then represent the target word using a vector set generated from the contexts in distant representation. We verify the effectiveness of the distant representation in two settings, without texts and with texts, in knowledge base completion and entity classification respectively.