Various pesticides have been developed and used to increase yields for vegetables, fruits, and other agricultural crops globally. The acceptable minimum residue level (MRL) of pesticides in food is set to 10 ppb (parts per billion) globally. In the real world, there are risks of exceeding MRL due to several factors such as the "post harvest treatment" (excessive usage of pesticides on exporting vegetables for prolong protection), "drift" (accidental spreadout of pesticides to the other field due to strong wind) and so on. Many food analysis laboratories analyze the resudual pesticides in foods routinely for ensuring the food safety by LC-MS and GC-MS.
This thesis addresses following two problems for analysis of residual pesticides in foods. The first problem is the "matrix effect" that alters the recovery rate of pesticide due to the chemical intereference of the other compounds (sample matrices) in foods. Food contains a lot of compounds such as metabolites and fiber, so removal of such other compounds is required, however, such sample preparation for removal of matrices is not perfect, i.e. trace level of sample matrix co-exist in the sample. Such matrices give chemical interference with the pesticides, resulting in altering the amount of pesticides. The recovery rate basically needs to be validated with actual injection of the sample to LC-MS and GC-MS. In chapter 2, the approach of data science to solve this problem is proposed using the quantitative structure-property relationship (QSPR) with the pesticide analysis report of seven crops and 248 pesticides. The molecular descriptors that represents pesticide's chemical property such as molecular shape, polalizability and so on were obtained and used as the explanatory variables of the regression analysis for recovery rate prediction by 89 machine learning methods. In addition to the performance of recovery rate prediction, the process time for building the prediction model were also evaluated because some machine learning methods required more than two hours for building. The selection of molecular descriptors by removing the highly correlated molecular descriptors was discussed in chapter 3 to avoid the multicollinearity for some machine learning methods. The prediction performance and model building time were compared.
The second problem is the "amenability" of pesticides with LC-MS or GC-MS. Both are used for pesticide analysis, but there is no quantitative method to tell which should be used except several rough guidelines. If new pesticides are needed to analyze, experienced chemists can predict the amenability roughly based on the past experiences and chemical knowledge, but such skill and knowledge are not easily transferred to the yonger generation. Experienced chemists are about to retire in spite of an aging society, there are risks in losing such capability in prediction of pesticides amenability based on the experience and chemical knowledge at food analysis labs. The classification of pesticides amenability using two validated analysis reports by the QSPR approach was developed in chapter 4. Molecular descriptors of pesticides are obtained and used as the explanatory variables of classification. 119 classfication methods in caret package were investigated. The performance of prediction and time for building the prediction model were evaluated. The selection of molecular descriptors for classification was also implemented in chapter 5. The classification performance are also compared. We demonstrate our methods utilizing QSPR approach successfully predict the recovery rate and amenability of residual pesticides in foods. We proposed the recommended machine learning methods with time saving with higher accuracy in prediction using the exisitng validated reports of residual pesticides in foods.