Extracting Relations between Promoter Sequences and Their Strengths from Microarray Data

木立 尚孝 (0251034)

The relations between promoter sequences and their strengths were extensively studied in eighties. Although those studies uncovered strong sequence-strength correlations, their elaborate experimental methods have costed too high to be applied to large number of promoters.

On the other hand, recent increase of microarray data allows us to compare thousands of gene expressions with their DNA sequences.

These motivated us to investigate the relations between large number of promoter sequences and their activities with microarray data.

In this thesis, we studied relations between promoter sequences and their strengths with E.coli microarray data. We modeled those relations using a simple weight matrix, which was optimized by a novel support vector regression method.

The results confirmed previous studies which concluded that promoter sequences close to the consensus sequence have strong activities, and the -35 region has smaller effect on promoter strength than the -10 region.

Moreover, we found several novel results: Relative importance of each base position in a promoter sequence varies considerably than expected from simple base frequency counting methods. The first three position in the -35 region and the first, second, fifth and sixth positions in the -10 region strongly affect promoter sequences, while the fourth and sixth position in the -35 region and the second position in the -10 region have almost no effect on the strength despite high base conservation at these positions. The spacer length within 15-19 base pairs have smaller effects on promoter strength as compared to the base sequence contributions. The amount of reduction of strength when a base at a promoter position differs from the consensus base, also varies greatly among different base species. For example, guanine at the sixth position of the -10 region has much worse effect on promoter strength than adenine and cytosine, in contrast to similar observed frequencies of these three bases at that position.

We have also identified possibly regulated genes by looking outliers for which the observed expression strengths significantly differ from the predicted promoter strengths.

These results were unavailable by previous methods such as base frequency counting and base mutation experiments.

Our method is applicable to other procaryotes for which both promoter sequences and microarray data are available.