Recurrent Neural Networks for Natural Language and Biological Sequence

Masashi Tsubaki (1461006)


Recently, in various research areas such as computer vision, natural language processing (NLP), and bioinformatics, machine learning is one of the most important techniques. My research goal in this thesis is to develop machine learning methods that can (i) capture the properties of data, in particular sequential data such as natural languages and proteins, and (ii) solve the higher level problems in NLP and bioinformatics.

Very recently, deep neural networks have achieved excellent performance in solving difficult problems such as speech recognition and machine translation. While various architectures of deep neural networks have been proposed for solving various problems, in this thesis I use recurrent neural networks (RNNs). The RNN is a well-suited neural network for the problems whose inputs are arbitrary length sequences such as natural languages and proteins.

In this thesis, I focus on the problems in NLP and bioinformatics, in particular the problems of semantic composition and protein structure prediction. In addition, I solve these with various RNN-based architectures specified for each problem. I first use RNNs with long short-term memory (LSTM) units, which can find long-term dependencies in a sequence and store the information for a long period of time. I show that LSTMs provide effective and general sequential representations for both natural languages and proteins. Then, on top of the LSTM, I combine other machine learning techniques such as kernel method, convolutional neural network, and attention mechanism to capture each property in natural languages and proteins. In my experiments, I demonstrate the quantitative and qualitative effectiveness of my methods compared to various existing methods.