Volatile organic compounds (VOCs) are small molecules with low molecular weight that exhibit high vapor pressure under ambient conditions. VOCs are produced naturally by living organisms and have important roles in chemical ecology and human health. In this presentation, we describe the development of a novel VOC database of microorganisms, fungi, plants as well as human being, which comprises the relation between emitting species, VOC, and their biological activities. We have deposited the VOC data into KNApSAcK Metabolite Ecology Database and this database is currently available online. Accumulated data are divided into two types: (1) microorganisms species-VOC relations, (2) emitting species-VOC-biological activity relations.
Initially, we explain the clustering analysis methods for the first data; hierarchical clustering and graph clustering by DPClus algorithm to extract clusters of microorganisms based on VOC similarity. Both clustering results indicated that VOC based classification of microorganisms is consistent with their classification based on pathogenicity.
For the second data, we performed heatmap clustering utilizing Tanimoto coefficient as the similarity index between chemical structures to cluster all VOCs. We further accessed the statistical significance of the clusters using hypergeometric p-values to understand the relationships between chemical structures of VOCs and their biological activities. Additionally, we extended our analysis by implementing supervised machine learning methods such as Deep Neural Network (DNN), Gradient Boosting Machine (GBM), Random Forest (RF) and Generalized Linear Model (GLM) as classification models for predicting the biological activities of VOCs based on their chemical structures. The best classification model was developed by PubChem fingerprint trained with GBM algorithm and this result shows that GBM method is good at predicting biological activities of VOCs.