Systematization of Genome and Transcriptome data based on BLSOM

Bai Yu ( 1361202 )


Kohonen’s Self-Organizing Map (SOM), is a powerful tool for clustering and visualizing high-dimensional complex data on a two-dimensional map. An alternative version of SOM is known as batch learning SOM (BLSOM) which can produce results unaffected by the order of the input data. In this thesis we applied BLSOM for genome sequence analyses and RNA-seq data analyses. We characterized vertebrate genomes using BLSOM. We first analyzed pentanucleotide compositions in 100 kb long segments corresponding to a wide range of vertebrate genomes and then as a separate study the compositions in the human and mouse genomes in order to investigate a method for detecting differences between the closely related genomes. BLSOM successfully recognized the species specific key combination of oligonucleotide frequencies in each genome, which is called a “genome signature,” and the specific regions specifically enriched by transcription-factor binding sequences. We further analyzed RNA-seq data of wild type and bleached type of Euglena gracilis using BLSOM. Our analyses characterized the regulation changes for transcription factors under aerobic and anaerobic conditions and linked transcription factors with metabolic pathways. Transcription factors and other genes placed at same area on the map follows the same patterns of regulation changes. Collectively, the results of our analyses can provide a valuable resource for Euglena gracilis research and specially novel insights into Euglena gracilis responses to anaerobic stress and offer candidate genes or markers that can be used to guide future efforts attempting to develop anaerobic tolerant Euglena gracilis cultivars.