Development of pharmaceuticals is time-consuming and needs huge economical cost. One reason of the burden is low probability of success in clinical/non-clinical study, which results in iterative try-and-error of the development process. To reduce the error, efficient ways of drug screening need to be developed. In drug screening, computer-aided drug design (CADD) is considered useful in that it can dramatically reduce the number of laboratory experiments. Structure-based drug design (SBDD), represented by docking simulation, is one of effective methods in that energetic stability of chemical compounds bound to target protein can be virtually simulated. However, SBDD needs 3-dimensional structural information of target proteins, which are generally difficult to get with high resolution. As an alternative, ligand-based drug discovery (LBDD), in which numerical representation of chemical compound are used for predicting ligand activity, plays an important role. Especially, enrichment of public database and evolution of machine learning techniques accelerated the application of LBDD with machine learning approach.
When applying machine learning to LBDD, there are 3 problems to consider. Firstly, efficient ways to numerically represent chemical compounds is needed. Secondly, lack of negative data needs to be managed. There is less information showing that a compound doesn’t bind to a target protein. Lastly, we need to find candidate compounds not only having high binding affinity to target protein but also avoiding undesirable binding toward proteins triggering adverse side effects. To solve these problems, I suggested an approach to explore candidates for ligands specifically bind to target protein. The approach consists of 4 steps: 1) constructing a model to distinguish ligands for the target protein of interest from those targeting a protein causing adverse side effect, by using graph convolution neural network (GCNN); 2) extracting feature vectors after convolution/pooling processes and mapping their principal components in two dimensional space; 3) specifying chemical space of each ligand group; and 4) investigating the distribution of compounds for exploration on the map after getting feature vectors and principal components using the same classifier and decomposer. I evaluated the effectiveness of the approach, taking beta-site amyloid precursor protein-cleaving enzyme 1 (BACE1) and cathepsin D as a target protein and a protein triggering adverse effect. Using the approach, in addition, I explored candidate compounds of BACE1 ligands from KNApSAcK Core Database (http://www.knapsackfamily.com/knapsack_core/top.php).