Discovery of New Regulatory Motifs of Purine Biosynthetic Genes in Escherichia Coli and Bacillus Subtilis Indiana University School of Informatics Haifeng.

Discovery of New Regulatory Motifs of Purine Biosynthetic Genes in Escherichia Coli and Bacillus Subtilis Indiana University School of Informatics Haifeng Zhao

Outline of Presentation
Project Goals Introduction PlatCom Discovery of DNA Regulatory Motifs Results Discussion

Project Goals Develop a Platform for Comparative Study of Predicted Proteins and Genomic Sequences Analyze the Transcription Regulatory Motifs of De Novo Purine Biosynthetic Pathway of Escherichia Coli and Bacillus Subtilis. Problem in biology involve a huge amount of data and they are typical computationally hard. A approach to these problems is to develop computational environments where multiple computational tool and databases are integrated in exploratory frameworks where user interactions can guide search. Good results in computer-assisted functional annotation of nucleotide sequences frequently have been obtained by combination of statistical analysis of DNA and comparative analysis of the protein sequences encoded by the respective genes. This approach is based on the assumption that groups of genes subject to a specific mode of regulation (regulons) are at least partially conserved in evolution. Under this approach, the assignment of a gene to a particular regulon is reinforced if not only this gene itself but also its orthologs in other genomes have candidate regulatory sites in the appropriate regions. I use this approach to analyze (purR) de novo purine biosynthetic gene regulons of E. coli and Bacillus subtilis Recognition of transcription regulation sites(operators) is a hard problem in computational molecular biology. My project suggest a approach to this problem based on simultaneous analysis of several releated genomes. It appears that as long as a gene coding for a transcription regulator is conserved in the compared bacterial genomes, the regulation of respective group of genes (regulons) are tends to be maintained. Thus a gene can be predicted to belong to a particular regulon in case not only itself, but also it’s orthologs in other genomes have candidate operators in the regulatory regions. This provides for a greated sensitivity of operator indefication as even relative weak signals are likely to be functionally revevant when conserverd.

Purine de novo synthesis
Purine nucleotides play an important role in many biochemical processes. ATP is the main source for energy. The de novo synthesis proceeds via a 14-step pathway branching after IMP In E. coli there are 14 reactions involved in the de novo synthesis of AMP and GMP and these are controlled by 12 known genes

Plasmodium_falciparum
Genbank Data A_thaliana Bacteria C_elegans Plasmodium_falciparum P_falciparum S_cerevisiae D_melanogaster Anopheles_gambiae H_sapiens R_norvegicus MITOCHONDRIA M_musculus Escherichia_coli Bacillus_subtilis … Completely Sequenced Genomes Genomes Incompletely Sequenced Genomes

A Platform for Computational
PlatCom A Platform for Computational Comparative Genomics 1. Building databases of all pairwise comparisons. 2. A toolkit for multiple genome comparisons. The goal of platcom project is to develop a platform for comparative study of predicted proteins and genomic sequences where biologists can perform comparative analysis of multiple genomes. To achieve this goal, the platform should provide a function that allows biologists to choose an arbitrary set of genomes of their choices and perform an analysis with the genomes. The most immediate technical problem is that comparing whole genomes (protein or genomic sequences) takes a huge amount of time and in practice it is not possible top perform analyses of multiple genomes from scratch. Designning this platform works in two steps. All pairwise comparisons of completely sequenced genomes can be computed can be computed using widely used pairwise comparison tools, such as BLAST, FAST.and stored in database. We will design and implement a suite of computational tools that perform sequences analyses from precomputed database and integrated them into the platform.

PlatCom Genbank Data BlastZ: Gapped BLAST algorithm designed
for aligning two long genomic sequences *.fna.cmp *.faa.cmp FASTA There will be three kind of comparison files existed in biokdd, fna, faa, and est comparison files. fna and faa files in Genbank stand for FASTA nucleic acid file and FASTA amino acid file. Expressed Sequence Tags (ESTs) are short (usually about bp), single-pass sequence reads from mRNA (cDNA). BlastZ takes fna files and FASTA takes faa files as input. The output comparisons files go to fna.cmp and faa.cmp folders. BlastZ is an independent implementation of Gapped BLAST algorithm specifically designed for aligning two long genomic sequences. Piptool is software which uses the output of BlastZ to display detailed information needed to interpret alignments. (5) BlastZ and Piptool along with GeneScan and Sim4 are all installed in biokdd. *.est.cmp

PlatCorm Browser NCBI FTP Server IBM Super Computer Server
To achieve this goal, the platform should provide a function that allows biologists to choose an arbitary set of genomes of their chioces and perform an analysis with the genomes. NCBI FTP Server IBM Super Computer Server

Dynamically Update the Databases
PlatCom Dynamically Update the Databases Update Genome Data Add New Genome Data Automatically Detect Missing Data Genbank periodically updates its data. It needs our system to keep up with all the changes. One of my programs takes genome names as input, deletes old Genbank data and comparison data, downloads new data from Genbank, and generates new comparison data. I also wrote programs which dynamically detect missing comparison files and update our system automatically. The heavy computation could be done locally or sent to IBM-SP.

Discovery of DNA Regulatory Motifs
Genome Sequences Predict Coregulated Set of Genes Schematic of motif discovery process. Potentially coregulated groups of genes are obtained from from comparative analysis of genome sequences, or other compilations of biological data, such as databases of metabolic and functional pathways. The upstream regions of these groups of genes that are believed to be coregulated are then aligned with a motif-finding algorithm to see if they share a significant upstream DNA motif, which could correspond to the binding site for a regulatory transcription factor. The biological significance of this motif can then be experimentally tested and the trans-acting factor that binds to it can be experimentally identified. Use Motif-Finding Aglorithm on Upstream Regions DNA Regulatory Motifs

Identify De Novo Purine (PurR) Biosynthetic Genes of E. coli
The purine repressor (PurR) is a DNA-binding protein involved in the process of transcription. Its name indicates its function: it represses the synthesis of purines. In E. coli there are 14 reactions involved in the de novo synthesis of AMP and GMP and these are controlled by 12 known genes

Identify Orthologs of Bacteria in COG Database
547 Genes COG0015 COG0026 COG0034 COG0041 COG0046 COG0047 COG0138 COG0150 COG0151 COG0152 COG0299 COG0516 COG0517 COG0518 COG0519 ……… I used the names of de novo purine biosynthetic genes of E. coli as the queries searching the COG database. 15 COG families are found each of which contains at least one de novo purine biosynthetic genes of E. coli. "COG" stands for Cluster of Orthologous Groups of proteins. COGs were identified using an all-against-all sequence comparison of the proteins encoded in completely-sequenced genomes. In considering a protein from a given genome, this comparison would reveal the one protein from each of the other genomes to which it is most similar. Orthologs are proteins from different species that evolved by vertical descent (speciation), and typically retain the same function as the original. The bacteria which are completed sequenced and can be found in GenBank are counted. The total number of genes in these COG families is 547. The name of each gene, the name of one specific bacteria which this gene belongs to, the starting point and ending point of the gene are recorded

Identify Upstream Regulatory Regions
C B A If a gene lies within an operon, its promoter and regulatory region could lie in several gene upstream. It is difficult to predict the first gene in an operon. George Church recorded the entire sequence of all of the intergenic segments of length greater than 10 bp between the gene of interest and operon head. (9) However, fewer genes recorded from COG database are close enough (less than 100 bp) and can be treated as an operon. For this reason, I only extract the upstream of the “operon head” of each gene cluster. It means 514 instead of 547 genes are used for further study. I used two upstream different cutoff distance (100 and 300 bp). 100 bp is used to ensure inclusion of the correct upstream region and 300 bp to reduce inclusion of extraneous regions. Operon Head

Convert Gene Names of COG Database to Gene Names of GenBank Database
I found that the gene names of COG database are not consistent with the gene names of GenBank database. More than 50% of them are different. My program will only work on Platform which utilizes Genbank data. So, it is necessary to make conversion between them. When I have genome names and gene names, I can start to extract upstream from GenBank database

Extract upstream regions
GenBank *.gbk DataBases of Upstream Regions Parser I wrote a parser to extract upstream region which is implemented in PERL langue. This parser takes genome name, gbk (gbk file stans for GenBank flat file format) file name, gene name and sequence length as parameters and generates FASTA format outputs which contains sequence of upstream region, sequence strand, sequence starting point. Sequence length parameter is set to 100 and 300 for this project. I will make this parser available in the hope that it might help other bioinformatics projects which need to extract upstream regions of genes in GenBank.

Motif-Finding Algorithms
Gibbs Sampler Algorithm AlignACE ( Based on Gibbs Sampler ) MEME MACAW AlignACE is a Gibbs sampling algorithm for identifying motifs that are over-represented in a set of DNA sequences. The MAP score is the parameter used by AlignACE to determine the statistical significance of alignments sampled, given the composition of the input sequence (13). MAP scores are normalized so that the score for an alignment of zero sites is assigned a score of zero. The MAP score is higher for similar motifs with greater numbers of aligned sites and for more tightly conserved motifs, and lower for an identical alignment of sites derived from a larger set of input sequences, motifs with more dispersed information content, and motifs enriched in nucleotides more prevalent in the genome. For E. coli and other bacteria, regulatory motifs tend to lie closer to the start codon, so 300bp and 100 bp upstream region databases are used for motif finding. 85 motifs are found in 100 bp upstream region database and 100 motifs are found in 300 bp upstream region database. In the AlignACE output file, motifs are listed in order of descending MAP score. Each motif matrix has four columns, site sequence, number of sequence from which the site can be found, position of the site in that sequence and strand. The core of MEME is Expectation Maximization (EM), an unsupervised learning algorithm guaranteed to converge to a local maximum. That is, any motif found by MEME will be "better" (according to MEME's statistical criteria) than any other motif that differs infinitesimally from the first. MEME can either a) favor motifs that appear exactly once in each sequence in the training set (the one-per model); b) favor motifs that appear zero or one time in each sequences in the training set (the zero-or-one-per model); or c) give no preference to the number of occurrences (the zero-or-more-per model). (14) I selected c) as my choice of the model. The E-value of the motif reported by MEME is actually an approximation of the E-value of the log likelihood ratio. It is an estimate of the number of motifs (with the same width and number of occurrences) that would have equal or higher log likelihood ratio if the training set sequences had been generated randomly according to the (0-order portion of the) background model.

Run AlignACE and MEME 100, 300 bp Upstream Databases AlignACE MEME
Motifs ScanACE MAST The first thing that must be done to characterize a new DNA motif is to look at its distribution of sites throughout the rest of the genome. ScanACE finds the best matching sites for a motif in a target sequence. It uses the same scoring mechanism that AlignACE uses in the sampling phase. The number of sites returned is controlled by –s option. I used –s= 1000 and –s= 2000 to search top 1000 and 2000 sites for regulatory region of de novo purine biosynthetic genes of Escherichia coli and Bacillus subtilis. MAST is a tool for searching biological sequence databases for sequences that contain one or more of a group of known motifs. MAST takes as input a file generated by MEME which contains the descriptions of one or more motifs and searches a sequence database that you select for sequences that match the motifs. MAST works by calculating match scores for each sequence in the database compared with each of the motifs in the group of motifs MEME results provide. For each sequence, the match scores are converted into various types of p-values and these are used to determine the overall match of the sequence to the group of motifs and the probable order and spacing of occurrences of the motifs in the sequence. Escherichia Coli Bacillus Subtilis Sites

DPInteract (E. coli) >guaBA 48->74 ggtagatgcaatcggttacgctctgt
>purB -205->-179 TGCCGACGCAATCGGTTACCTTGATG >purC 148->174 atgatacgcaaacgtgtgcgtctgca >purEK 66->92 GAGCAAGGAAAACGGTTGCGTGGCTG >purF tccctacgcaaacgttttctttttct >purH 102->128 GCGTTGCGCAAACGTTTTCGTTACAA >purL 71->97 tttccacgcaaacggtttcgtcagcg >purMN 59->85 cagtctcgcaaacgtttgctttccct Escherichia coli is well studied and its completed genomic sequence offers a special opportunity to exploit systematically the variety of regulatory data available in the literature in order to make a comprehensive set of regulatory predictions in the whole genome DPInteract was created by Robison, K., and Church G.M of Harvard medical college. It collected 55 E. coli DNA-binding proteins with known binding sites. It compiled DNA-binding site matrices and performed searches over the complete E. coli K12 genome. The followings are the binding sites for pruR in DPIneteract:

pur Operon of Bacillus subtilis
The Bacillus subtilis purEKBCQLFMNHD operon, called the pur operon, encodes 10 enzymes required for de novo purine tynthesis. The Dnase I footprinting of the pur operon covered from -179 to -30 upstream region. The common DNA recognition element for binding of PurR to pur operon is not known. IU alumnus James Watson won Nobel price at 1962

Results: AlignACE (Escherichia Coli)
100bp database 300bp database Number of Motifs 85 100 Identified Known Sites in DPInteract 25% 14% (1000 hits) 50% 29% (2000 hits) For AlignACE, I found that if I set –s bigger, I can identify more binding sites. I also found that 100bp database can identify more binding sites than 300bp database (within top 1000 hits and top 2000 hits); although some known binding sites locate beyond 100bp upstream region.

Results: AlignACE (Bacillus Subtilis)
Gene name Starting point Strand Sequence Map Score purE -176 + cgcagaagcgaacgac High map score

Results: MEME (Escherichia Coli)
100bp database 300bp database Number of Motifs 30 10 Identified Known Sites in DPInteract 50% 38% (E-Value < 10)

Results: MEME (Bacillus Subtilis)
Gene name Starting point Strand Sequence E-Value purE -72 + TGTCTTTCTCGAACT 0.11 High map score

Results: Locations of Mapped Genes of Escherichia coli and Bacillus subtilis
AlignACE I localized up to 50% experimentally characterized sites of Escherichia coli described in DPInteract. The sites of purEK, purB, purMN, and guaBA were unambiguously identified by ScanACE. The sites of purHD purEK, guaBA, and purL were unambiguously identified by MAST. MEME

Discussion PlatCom: A Platform for Comparative Study of
Multiple Genomes Multiple Tools Multiple Genomes Escherichia Coli Bacillus Subtilis … AlignACE MEME … AlignACE: one day. MEME: one week AlignACE: 85, 100 motifs, MEME, 30, 10 motifs Bacillus Subtilis MEME takes much longer time than AlignACE to get the results (4 – 5 times longer). However, its results are more reliable than AlignACE considering the smaller number of motifs it uses. So it seems that MEME could archive better performance than AlignACE does. Computer analysis had been used for prediction of bacteria transcription signals more than a decade. Analysis results have served for further experimental work. Platform project offers biologist the great opportunities to perform comparative analysis of multiple genomes including transcription signal detections. In this project, I successfully detected up to 50% matched regulatory motif of novo purine biosynthetic genes of Escherichia coli and found a good candidate motif for pur operon of Bacillus subtilis. It is important that predication of regulatory site should be confirmed by experimental data. Unfortunately, other genomes are not well studied as Escherichia coli. This project is an attempt to change the study of the regulatory prediction in one completed sequenced genome to the study the regulatory prediction in multiple genomes. There is no doubt that the improvement will be achieved in the future when more and more experimental data are collected. Many Significant new motifs are found Performance …

Zhiping Wang, Classmate
Acknowledgement Sun Kim, Advisor Zhiping Wang, Classmate At first I want to thank my supervisor Sun Kim who supported me with all his knowledge and friendship through out my studies. As well I would like to thank my classmate Zhiping Wang gave me the great advice

Discovery of New Regulatory Motifs of Purine Biosynthetic Genes in Escherichia Coli and Bacillus Subtilis Indiana University School of Informatics Haifeng.

Similar presentations

Presentation on theme: "Discovery of New Regulatory Motifs of Purine Biosynthetic Genes in Escherichia Coli and Bacillus Subtilis Indiana University School of Informatics Haifeng."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Discovery of New Regulatory Motifs of Purine Biosynthetic Genes in Escherichia Coli and Bacillus Subtilis Indiana University School of Informatics Haifeng.

Similar presentations

Presentation on theme: "Discovery of New Regulatory Motifs of Purine Biosynthetic Genes in Escherichia Coli and Bacillus Subtilis Indiana University School of Informatics Haifeng."— Presentation transcript:

Similar presentations

About project

Feedback