Data-Intensive Science of Plant Classification based on Metabolite-Content Similarity

Liu Kang (1561027)


Metabolite-content (MC) refers to all small molecules which are the products or intermediates of metabolism within an organism. In this study, we consider the metabolite features of plants as a new taxonomic marker and classify plants based on the MC-similarity of them using KNApSAcK Core DB. For reducing the effect of missing plant-metabolite relation data, we propose two approaches to compensate for the limitations of missing data:

(1) Classification of Plants based on Chemical Structure Similarity of Metabolite-Content. We calculated the structural similarity of all metabolite pairs by Tanimoto coefficients (TCs), and determined the MC-similarity of plants based on the background population of TCs. (2) Clustering Plants based on Structural-Similarity Network of Metabolites. We applied a network based approach to abstract structurally similar metabolite groups as features, and measured the phylogenetic distance of plants by a binary method. The results prove that the MC-similarity of plants is associated with the pathway and bioactive similarity, and can be regarded as a taxonomy marker which takes into account both general phylogenetic relations and the relations between plants based on bioactive features.

We also extended our finding by using phylogenetic statistic method to investigate the predictive power of MC-similarity in exploration of edible and medicinal plants for bioprospecting. We reconstructed the phylogenetic trees for the same set of plants based on MC-similarity and sequence-similarity. The result shows that plants with medicinal/edible uses are more significantly clustered in MC-based phylogenetic trees than sequence-based phylogenetic trees.