Connecting Lung Adenocarcinoma Gene Expression to Image Features Learned from Classification Schemes

Antonio Victor Andrew Asuncion (1561204)


A number of advances involving convolutional neural network (CNN) architectures have demonstrated that extracting morphological features from histological images can be effective for classification of subtypes of various diseases, especially cancer. On the other hand, varying types of gene expression or mutation data have been a rich and ubiquitous resource for studies involving cancer prognosis, survival, and others. In this study, we present a method for classifying transcriptome subtypes of lung adenocarcinoma from slices of pathological images whose features come from convolutional autoencoders pretrained on smaller images. We then establish a correlation between the features extracted from histological images and their corresponding gene expression data as a stepping stone for whole slide image analysis.

Variants of autoencoders as building blocks of pretrained convolutional layers of neural networks are implemented. From here, a sparse deep autoencoder was proposed and applied to images of size 2048x2048. We applied this model for feature extraction from pathological images of lung adenocarcinoma, which is comprised of three transcriptome subtypes. The sparse autoencoder network provides a 98.9\% classification accuracy.

The second part of this study uses features extracted from tiles as a starting point for whole slide image analysis. We use 512x512 tiles as input for Google's Inception architecture, whose output is a 2048-dimensional vector for each tile. We applied dimension reduction to these vectors and correlated them with RNA-Seq data. A linear model was then created to further associate the features extracted from the aforementioned fully-connected layer to the gene expression values and use it for inference and other analysis.

The results showed that the larger input image that covers a certain area of the tissue is required to recognize transcriptome subtypes. This study shows the potential of autoencoders as a feature extraction paradigm and paves the way for a whole slide image analysis tool to predict molecular subtypes of tumors from pathological features. The analysis following the results of this model serves as a good starting point for outcome and prognostic predictions.