Beyond the Real: Alternative Methods for Discovering Translation Enhancer via Machine Learning

Hiroaki Tanaka (1651069)


Protein production by plants is a hot topic, because the plants have some advantages in the application for drug discovery compared to other hosts, e.g., bacteria, yeasts, animals, etc. However, it is a problem that the amount of protein expression in plants is less than that of the other hosts. Many researchers tackle this problem and some researchers succeed to increase the amount of produced proteins by changing original 5'UTR to artificial 5'UTR; 5'UTR is one of the areas of mRNA (messenger RNA) and affect the translation from mRNA to proteins. Albeit the important sequence, discovering translation enhancer referring to the 5'UTR sequences which increase the amount of translated proteins is difficult, because the experiments in discovering the sequence require the significant cost, time and effort. To solve this problem, we want to predict how a large amount of translated proteins are obtained from given mRNAs without the experiments. In this paper, we propose a method R-STEINER (generate nucleotide sequence Randomly and Select a TrEmendous 5'-untranslated region which INcrEase the amount of tRanslated proteins of a certain gene). In R-STEINER, we build a model predicting the amount of translated proteins---this prediction model of translated proteins from mRNA sequence is the world's first model---, generates 5'UTR sequences and discovers the 5'UTRs which will increase the amount of translated proteins by using the prediction model. Using this method, we can obtain the translation enhancers without real synthesis experiments, leading to reduction of the cost, time and effort of the experiments. In our study, we built the prediction model by Oryza sativa (rice) and synthesized the 5'UTRs which were generated by R-STEINER. This study confirmed that the model can predict the amount of translated proteins with a correlation coefficient is 0.89.

Some companies own patents of 5'UTR sequences, and they require a technique to make new 5'UTR sequences which increase the amount of translated proteins based on their patent sequences. In this paper, we tackled solving this requirement. For solving the requirement, we have to develop a technique to obtain any input vectors that satisfy a condition to change, increase or decrease, the objective variable. In this paper, we propose a method, named TRANS-AM\ (TRANSforming-feAture Method), which can discover an input vector satisfying the condition of objective variable in regression problems and in the case of using random forest regression, by using a property of regression tree. The regression tree splits an input space into subspaces. There are subspaces with corresponding to objective variables satisfying the condition---the corresponding objective variables are higher than a given threshold. By transforming the input vector to new input vectors belonging to one of the subspaces, we can discover a new input vector whose distance from the original input vector is minimum by satisfying the condition to change the objective variable. We evaluated the proposal method through numerical simulations and investigated that the proposal method worked well for the datasets generated through several processes. In the final part of this paper, we suggest the application of TRANS-AM\ for 5'UTR generation.