NAIST-IS-MT1751011: Yoshitaka Inoue

Chemical compounds annotation using machine leaning for high resolution tandem spectra

Yoshitaka Inoue

In the case of identifying a certain compound from the mass spectrum, it is usual to perform a library search. Library search is a method of identifying by comparing the measured spectrum and the spectrum belonging to the library. But the problem here is the number of spectra belonging to the library. Although we are currently able to grasp 40,000 kinds of compounds, spectra and compounds are bound to less than 15,000 data. In order to deal with this problem, various methods for linking spectra and compounds using machine learning have been proposed in recent years, but they are still under development.
In this research, we proposed a method to classify and annotate compound types using machine learning based on spectrum information of mass spectrum. In this study, the data set obtained from the comprehensive mass spectrum database MoNA (http://mona.fiehnlab.ucdavis.edu/) owned by UC Davis Genome Center Fiehn lab was used. As the learning data, dissociation energy in MS/MS spectrum analysis is used by combining three stages of data of HCD 35, 45, 65. We constructed an array with m/z of the measured chromatogram as a column, each sample as a row and intensity as a value, and further selects a feature amount as an input. A class label was attached to these data, and 10 or more compounds belonging to each class were used for subsequent learning. At this time, in order to see the transition of the result for each selected class, learning was carried out by dividing the dataset into seven levels for each ion. For feature quantity selection, feature quantity selection based on chi-square test was performed. As an algorithm of machine learning, we conducted experiments using LightGBM with high accuracy of average, first using two types of gradient boosting (XGBoost, LightGBM) and random forest. As a result, the accuracy declined with increasing number of types of labels for both negative and positive. However, as more labels were added, the number of compounds that yielded 100% accuracy increased. Next, we gathered highly accurate classes for each ion and confirmed what the accuracy would be, and confirmed the accuracy of 96% in the positive and 84% in the negative. When we added these to the experimental class with each ion, we were able to obtain 85% in positive and 70% in negative. In the positive, it is the result of doing the calculation with the class which gave out 96% before the one which got 85% in 4 classes. At this time, it was possible to extend the class while maintaining some degree of accuracy. In addition, since the purpose of this research is to annotate, we would like to increase the number of classes as much as possible. As a result, we added one class that was less accurate in the previous experiment, and we were able to maintain accuracy of 85%.