1 In Silico Identification of Promoters in Prokaryotic Genomes Manju Bansal Molecular Biophysics Unit Indian Institute of Science Bangalore

Slides:



Advertisements
Similar presentations
PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
A Genomic Code for Nucleosome Positioning Authors: Segal E., Fondufe-Mittendorfe Y., Chen L., Thastrom A., Field Y., Moore I. K., Wang J.-P. Z., Widom.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
McPromoter – an ancient tool to predict transcription start sites
Finding Eukaryotic Open reading frames.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Predicting RNA Structure and Function. Non coding DNA (98.5% human genome) Intergenic Repetitive elements Promoters Introns mRNA untranslated region (UTR)
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Lecture 12 Splicing and gene prediction in eukaryotes
Biological Motivation Gene Finding in Eukaryotic Genomes
PCR Primer Design Guidelines
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Transcription Transcription- synthesis of RNA from only one strand of a double stranded DNA helix DNA  RNA(  Protein) Why is RNA an intermediate????
Transcription Chapter 11.
Bacterial Transcription
Gene Structure and Identification
Transcription transcription Gene sequence (DNA) recopied or transcribed to RNA sequence Gene sequence (DNA) recopied or transcribed to RNA sequence.
IN THE NAME OF GOD. PCR Primer Design Lecturer: Dr. Farkhondeh Poursina.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Queensland University of Technology CRICOS No J Using a Beagle to sniff for Bacterial Promoters Stefan R. Maetschke, Michael Towsey and James M.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
From Structure to Function. Given a protein structure can we predict the function of a protein when we do not have a known homolog in the database ?
Grupo 5. 5’site 3’site branchpoint site exon 1 intron 1 exon 2 intron 2 AG/GT CAG/NT.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
PreDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Department.
Searching for structured motifs in the upstream regions of hsp70 genes in Tetrahymena termophila. Roberto Marangoni^, Antonietta La Terza*, Nadia Pisanti^,
Design of oligonucleotides for microarrays and perspectives for design of multi-transcriptome arrays Henrik Bjorn Nielsen, Rasmus Wernersson and Steen.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Promoter prediction assessment by Vladimir B Bajic ENCODE Workshop 2005 at Sanger Institute.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
EJ Kochis.  Regulatory sequences found in intergenic regions of DNA  There are different types!  Repressor  Promoter  DNAa Sites.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Cluster validation Integration ICES Bioinformatics.
Local Multiple Sequence Alignment Sequence Motifs
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
RNA and Gene Expression BIO 224 Intro to Molecular and Cell Biology.
Motif Search and RNA Structure Prediction Lesson 9.
Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine 朱林娇 14S
Exam #1 is T 2/17 in class (bring cheat sheet). Protein DNA is used to produce RNA and/or proteins, but not all genes are expressed at the same time or.
Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI-Jena, Germany Introduction: During the last 10 years, a large number of complete.
(H)MMs in gene prediction and similarity searches.
Figure S1 Figure S1. Phylogenetic tree of LexA binding sites in cyanobacteria, B.subtilis,  - proteobacteria and E.coli. Binding sites of cyanobacteria.
Finding genes in the genome
DNA-Protein Interactions & Complexes. Prokaryotic promoter Consensus sequence is not present in majority of prokaryotic promoters. Sequence motifs.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
A Robust and Accurate Binning Algorithm for Metagenomic Sequences with Arbitrary Species Abundance Ratio Zainab Haydari Dr. Zelikovsky Summer 2011.
` Comparison of Gene Ontology Term Annotations Between E.coli K12 Databases REDDYSAILAJA MARPURI WESTERN KENTUCKY UNIVERSITY.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
Transcription.
3. genome analysis.
Exam #1 is T 9/23 in class (bring cheat sheet).
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
Introduction to Bioinformatics II
Volume 10, Issue 7, Pages (July 2017)
Taichi Umeyama, Takashi Ito  Cell Reports 
Protein Occupancy Landscape of a Bacterial Genome
Presented by, Jeremy Logue.
Human Promoters Are Intrinsically Directional
Genome-wide binding sites of OsMADS1 and the distribution of binding sites in different regions of annotated genes. Genome-wide binding sites of OsMADS1.
Structural properties of 2954 OsMADS1-bound sequences in three data sets (intergenic, gene body, and A-tract). Structural properties of 2954 OsMADS1-bound.
Summarized by Sun Kim SNU Biointelligence Lab.
Presented by, Jeremy Logue.
Taichi Umeyama, Takashi Ito  Cell Reports 
Presentation transcript:

1 In Silico Identification of Promoters in Prokaryotic Genomes Manju Bansal Molecular Biophysics Unit Indian Institute of Science Bangalore Indo-Russia Workshop Novosibirsk Oct 2008

2 How does RNA polymerase know where to start transcription? It is through sequence motifs which match the consensus sequences in -10 and -35 regions, but large variability seen. Also similar sequences seen in non-promoter regions.

3 Some typical promoter sequence motifs There are few sequence motifs which exactly match the consensus sequence, large variability seen. Similar sequences seen in non-promoter regions TATAAT TACTGT GACACT TATGGT TSS 17 bp SPACER 1 TTGACA CTGACG TGGACT GTCACA Consensus araBAD araC galP1

4 Because: The sequence motifs are only 6-10 bp long and are degenerate, the probability of finding similar sequences in regions other than promoters is quite high. E. coli genome size: 4,639,221 bp E. coli DNA has ~1400 annotated promoter sites in Ecocyc database but ~4500 annotated genes Number of ‘-10 consensus’ hexamer sequences expected in E. coli : 1058 (with exact match viz no mismatch/changes from consensus) 35,762 (1 mismatch), 3,26,746 (with 2 mismatches) e.g.: consensus TATAAT vs TATGGT OR E. coli should have a ‘-10 like’ sequence at every 4400 nt (exact match), or every 130 th nt (with 1 mismatch) or 14 th nt (with 2 mismatches)

5 Does this indicate that there are other signals which help in positioning RNA polymerase? Hence analysis of structural properties of a DNA sequence to locate signals that are: Relevant to transcription from a functional/mechanistic/structural point of view. Unique to the promoter sequences and can be used to differentiate between promoter and non-promoters. Can be predicted from a given sequence. For example: 1) DNA STABILITY (Ability of DNA to Open up) 2) DNA CURVATURE (Intrinsically curved DNA structure) 3) DNA BENDABILITY (Ability of DNA to bend)

6 An important step in transcription is the formation of an open complex which involves strand separation of DNA duplex upstream of the transcription start site (TSS) This separation takes place without the help of any external energy. Hence evaluating stabilities of promoter sequences may give some clues. Why Stability?

7 Stability of base paired dinucleotides SantaLucia J (1998) Proc. Natl. Acad. Sci. USA 95(4): based on Tm (melting temp data) on a collection of 108 oligonucleotide duplexes.

8 A representative free energy profile for 1000nt long E. coli promoter sequence

9 Kanhere and Bansal, Nucl. Acid Res. (2005) 33, Verteb: 252 Plants: 74 E coli: 227B Subtilis: 89

10 Curved DNA sequences are present in upstream region Organism Distance of the bent site from TSSReference Gene Name virFShigella flexneri -137 Prosseda et al. (2004) per-fdx Clostridium perfringens -43 Kaji et al. (2003) Streptokinase Streptococcus equisimilis H46A -98 Malke et al. (2000) aprEBacillus subtilis -103 Jan et al. (2000) nifLA Klebsiella pneumoniae -95 Cheema et al. (1999) GyrA Streptococcus pneumoniae -23 Balas et al. (1998) appYEscherichia coli -350(from start codon) Atlung et al. (1996) rrnB P1Escherichia coli -110 Gaal et al. (1994) ompF Escherichia coli K to –71 Huang et al. (1994)

11 Roll at junction Roll at every step

12 Dinucleotide parameters Bansal M (1996) Biological Structure and Dynamics, Proceedings of the Ninth Conversation (Vol. I) pp

13 A representative intrinsic curvature profile for 1000nt long E. coli promoter sequence

14 Kanhere and Bansal, Nucleic Acid Research (2005) 33,

15 DNA bendability Protein DNA

16 Kanhere and Bansal, Nucl. Acid Res. (2005) 33,

17 Distribution of different signals in 272 E. coli promoters 17%19% 2% 3% 4% 24% 10% seqs show no signals 90% show atleast one signal

18 Hence: The upstream region and downstream regions, with respect to the TSS, show considerable differences in their properties. Upstream region is less stable, more rigid and more curved compared to the downstream region, in prokaryotic and eukaryotic genomes. Stability signal is much more common than other two signals Some of the promoters which do not show any of the three signals are either internal/secondary/weak promoters

19 Can incorporating these features help in improving the promoter prediction tools? Since low stability signature was found to be most common in promoters – it was examined first. E. Coli promoter data was studied in detail, also B. Subtilis and M. tuberculosis as examples.

20 Average stability profile for 429 E. coli promoters (from EcoCyc Database V 9.1), located atleast 500 nt apart

21 Nucleotide composition (in %) for three bacterial systems. Difference between Mtb and others is clearly seen E. coliB. subtilisM. tuberculosis ATGCA+TATGC ATGC Whole genome Up stream region -200 to Down stream region 100 to Promoter region -80 to The composition was calculated for 101 nt length (ranges from -200 to -100, 100 to 200 and -80 to +20 with respect to TSS) promoter sequences. 582 promoter sequences from E. coli, 305 promoter sequences from B.subtilis and 42 promoter sequences from M. tuberculosis were obtained when the TSS are 200 nt apart.

22 Average stability profile for promoter sequences that are 500 nt apart B) 239 B. subtilis promoters (from DBTBS Database)A) 429 E. coli promoters (from EcoCyc Database V 9.1) C) 40 M. tuberculosis promoters (from MtbRegList Database) One sharp peak corresponding to high A+T content seen

23

24 Sensitivity and precision for promoter prediction of 500 nt apart experimentally verified bacterial TSS. E. coliB. subtilisM. tuberculosis Sensitivity / Cutoff value applied (kcal/mol) E-cutoff = D-cutoff = 1.0 E-cutoff = D-cutoff = 1.5 E-cutoff = D-cutoff = 1.0 Total no. of promoter sequences of 1001 nt length considered for analysis No. of True Positives No. of False Positives No. of False Negatives After I cycle58716 After II cycle6120 Calculated Sensitivity = TP/(TP+FN) Calculated Precision = TP/(TP+FP) False negatives after first cycle are taken for the second cycle promoter prediction, with E1 window size of 50nt. False negatives remaining after second cycle are considered for sensitivity calculation. True positives and false positives are added up after first and second cycle prediction. Definition of TP, FP: V Rangannan and M Bansal, J. Biosci. 32, (2007).

25 Nos of nucleotides between each TSS (#729) and TLS (considering the occurrence of the first gene). Min dist = 0, Max dist = 708 Average stability profile for 4461 E. coli gene sequences of 1001nt length (-500 to +500 w.r.t TLS) Av stability profile for all 4461 genes in E. Coli aligned w.r.t their TLS

26 E. Coli – Average stability profile for 1089 Protein promoter sequences and 59 RNA promoter sequences E. Coli – Average stability profile for 34 tRNA promoter sequences and 13 other RNA promoter sequences

27 E. coliB. subtilis Forward strand of the genome Reverse strand of the genome TotalForward strand of the genome Reverse strand of the genome Total Protein coding genes RNA coding genes Protein coding genes RNA genes Protein coding genes RNA genes Protein coding genes RNA genes No of TSSs a a No of genes No of predictions TP calculated w.r.t gene TLS b % % FP calculated w.r.t gene TLS b TP calculated w.r.t TSS c % % a 3 TSSs of E. coli and 1 TSS of B. subtilis regulate protein as well as RNA genes. b True and false positives are identified against the genes in forward and reverse strand. c True positive is calculated with respect to the annotated TSS (located in -150 to +50 nt region w.r.t TSS)  63% and 68% accuracy (precision) achieved in case of E. coli and B. subtilis respectively w.r.t TLS  75% and 59% reliability achieved in case of E. coli and B. subtilis respectively w.r.t annotated TSS (against 37% in case of SIDD for 927 TSS in E. coli). Whole genome annotation for promoter regions in E coli and B. subtilis

28 M. tuberculosis Forward strand of the genome Reverse strand of the genome Total Protein coding genes RNA coding genes Protein coding genes RNA coding genes No of genes No of predictions TP calculated w.r.t gene TLS (41%) FP calculated w.r.t gene Whole genome annotation of promoter regions over M. tuberculosis genome

29 All false positives need not be REAL false positives In prokaryotic genomes, the intergenic region is very small (~ 12%). Experimental evidence shows that for some genes the regulating transcription start site lies within the coding region of an upstream neighboring gene. For example, the E.coli rpoS gene has its transcribing TSS (rpoSp) within the coding region of nlpD gene and 567 nt away from its own TLS. Lange R, Fischer D and Hengge-Aronis R., J Bacteriol. (1995); 177(16):

30 Distribution of coding and intergenic regions in the bacterial genomes  Histograms showing the distribution of predicted promoter regions in different genomic regions in E. coli, B. subtilis and M. tuberculosis genomes. Color coding for intergenic and coding region are shown on top right.

31 Predicted promoter region distribution in E. coli genome (over ALL 1145 Ecocyc annotated, 1001 nt long promoter sequences).

32 Comparison of our method of promoter prediction with NNPP, w.r.t TLSS at position 0

33 Average energy profile for E.coli genomic fragment 9000bp to 15300bp

34 Average energy profile for E.coli genomic fragment bp to bp (DIV intergenic region)

35 Average energy profile for E.coli genomic fragment bp to bp (CON intergenic region)

36 Conclusions Relative stability of DNA in neighboring regions can help in annotating for promoter regions in whole genomes The method is quite general and shown to work for genomes with varying AT/GC content. The stability criteria performs better than other commonly used methods based on sequence motif search as well as the superhelix induced destabilization in DNA (SIDD) method.

37 %GC No of sequences analyzed * E. coliB. subMtb 30 – – – – – – – – Total No of promoter sequences grouped according to their %GC content in the three bacterial systems  TSSs which are 500nt apart are considered in E. coli, B. subtilis and M. tuberculosis.  GC categorization is done based on the %GC over 1001nt long promoter sequences (ranging from -500 to +500 w.r.t TSS).

38 Average free energy distribution over promoter sequences with diverse GC composition (A) -500 to +500 region with respect to TSS (B) -80 to +20 region with respect to TSS  The average free energies over the promoter regions with similar GC composition are approximately same with E. coli and B.subtilis nearly overlapping for %GC intervals 35-40%, 40-45%, and 45-50%, in case of 1001 nt long promoter regions.

39 Thresholds of free energy values used to predict promoters in genomic DNA with varying GC content E is the average free energy over the -80 to +20 region of known promoters, and D is the difference between E and the average free energy over random sequences generated from downstream (+100 to +500 region) genomic sequence (REav).

41

42 Stability characteristics of TF binding site (e.g. CRP) Region of high stability corresponds to a binding site for CRP in E coli. The high stability trough extends for ~22 nucleotides (window size = 15 nts), which is the same as the foot print size of the protein reported in literature.

43 Ecoli CRP binding site consensus sequence for 209 sites

44 CRP: Average stability profile

45 CRP: Average stability profile for manipulated sequences NNNNNNNNNNNNNTGTGANNNNNNACACANNNNNNNNNNNNN 5’ flanking region3’ flanking region6-nt linker

46 CRP: Average bendability profile TGTGANNNNNNACACA

47 Thank You Acknowledgements: Dr Dhananjay Bhattacharyya Dr Aditi Kanhere Ms Vetriselvi R Mr Vikas Sarma Mr Nishad Matange Financial Support: Dept of Biotechnology, India

48

49 Coding and inter-genic region distribution in E. coli and B. subtilis genome. Histograms show the distribution of predicted promoter regions in different intergenic regions in E.coli and B.subtilis genomes (as per the color coding in the legend).

50 NarL: Binding site Consensus sequence

51 NarL: Average stability profile

52 NarL: Average bendability profile

Definition of thresholds of free energy values used to predict promoters in bacterial genome sequences. G specifies the average free energy over the entire genome. E is the average free energy over known promoter regions. All energy values are in kcal/mol and the standard deviation values are also indicated. E-cutoff and D-cutoff are the thresholds used to predict promoter regions. E. coliB. subtilisM. tuberculosis Average free energy G calculated over whole genome sequence Mean G Standard Deviation (σ) G Eav (Mean+3σ) Average free energy E calculated over upstream region of TSS Upstream region considered with respect to TSS -80 to +20 (101 nt length) -80 to +20 (101 nt length) -40 to +20 (61 nt length) Mean E Standard Deviation (σ) 000 E-cutoff (Mean+3σ) D-cutoff (E-cutoff – G Eav )

54

55 Region extracted from respective genome with respect to TSS (Length of the region) E. coliB. subtilisM. tuberculosis AFEG+CAFEG+CAFEG+C Upstream region -500 to -100 (401 nt) (1.0) 0.49 (0.06) (0.8) 0.43 (0.05) (0.7) 0.65 (0.03) -500 to -100 (401 nt) shuffled sequence (1.0) (0.8) (0.6) Downstream region 100 to 500 (401nt) (0.7) 0.49 (0.04) (0.7) 0.44 (0.04) (0.5) 0.66 (0.03) 100 to 500 (401nt) shuffled sequence (0.7) (0.7) (0.5) Promoter region -80 to +20 (101nt) (1.3) 0.42 (0.08) (1.0) 0.33 (0.06) (1.0) 0.61 (0.05) -80 to +20 (101nt) shuffled sequence (1.2) (0.9) (0.9) Longer region-500 to +500 (1001nt) (0.7) 0.49 (0.04) (0.5) 0.42 (0.03) (0.4) 0.65 (0.02) -500 to +500 (1001nt) shuffled sequence (0.6) (0.5) (0.33) Whole genome-20.1 (2.4) (2.3) (2.1) 0.66 Variation in base composition and average free energy (AFE) in different regions of bacterial genomes. Promoter sequences of 491, 283 and 40 TSS which are 500nt nucleotides apart are considered from E. coli, B. subtilis and M.tuberculosis respectively. Sequences are aligned with respect to the TSS. Standard deviation from the respective mean is given in brackets.

40 M. tuberculosis promoters from MtbRegList Database 491 E. coli promoters from EcoCyc Database version B. subtilis promoters from DBTBS Database Average stability profile for promoter sequences from three different organisms