With the rapid progress of the Internet and social media, it has become possible to collect and analyze large amounts of unstructured text data. Also, every human activity will be accumulated as unstructured text data with the progress of IoT (Internet of Things). For example, the unstructured text data of watched TV, browsed electronic books and viewed tweets, and so on are accumulated. Extracting value from the data, not big but small in each domain, is expected for neural networks. However, the problem to be solved is that the words are sparse in each domain and it's hard to understand the feature quantities automatically extracted by neural networks.
In 2013, learning the meaning of each word in a relationship with about 200 neurons using a shallow neural network, named word2vec, showed that the inner product and the difference between learned word vectors could represent the word analogy and the direction of the word meaning. Also, in 2014, a paragraph vector that extended the word to a document was proposed and showed the state-of-the-art at the time of the publication in various tasks of natural language processing. Such word vectors or document vectors are called distributed representations, word embeddings, or document embeddings. The problem with the distributed representation learning is that it becomes difficult to understand which dimension corresponds to which meaning because all dimensions are initialized randomly.
In contrast to the distributed representations obtained by neural networks, we introduce a word semantic vector dictionary, constructed using a human expert with context information as features. First, we propose a method that encodes knowledge in an encyclopedia text using the word semantic vector dictionary. Also, we realize associative image retrieval to solve the problem of keywords attached to images.
Second, we propose a method to integrate feature word expansion using our dictionary with shallow neural networks to solve the problem of word sparsity in Twitter. Also, we realize reputation information extraction from Twitter. The integration showed that the accuracy of sentiment analysis improves by learning context information even if words are sparse.
Finally, we propose a method that gives a specific meaning to each hidden node by introducing our dictionary into the initial weights and by using shallow neural networks. We determined the readability in a user test. A total of 52.4% of the top five weighted hidden nodes were related to tweets. For the expandability of the method, we constructed a diverse sentiment analysis benchmark and improved the word semantic vector dictionary for the purpose of distributed representations. We also conducted a word similarity task using a Wikipedia corpus to test the domain-independence of the method. We found the objective and subjective evaluation support each hidden node maintaining a specific meaning. Thus, our method succeeds in improving readability.