Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 In Silico Identification of Promoters in Prokaryotic Genomes Manju Bansal Molecular Biophysics Unit Indian Institute of Science Bangalore

Similar presentations


Presentation on theme: "1 In Silico Identification of Promoters in Prokaryotic Genomes Manju Bansal Molecular Biophysics Unit Indian Institute of Science Bangalore"— Presentation transcript:

1 1 In Silico Identification of Promoters in Prokaryotic Genomes Manju Bansal Molecular Biophysics Unit Indian Institute of Science Bangalore mb@mbu.iisc.ernet.in Indo-Russia Workshop Novosibirsk 12-14 Oct 2008

2 2 How does RNA polymerase know where to start transcription? It is through sequence motifs which match the consensus sequences in -10 and -35 regions, but large variability seen. Also similar sequences seen in non-promoter regions.

3 3 Some typical promoter sequence motifs There are few sequence motifs which exactly match the consensus sequence, large variability seen. Similar sequences seen in non-promoter regions. -35-10 TATAAT TACTGT GACACT TATGGT TSS 17 bp SPACER 1 TTGACA CTGACG TGGACT GTCACA Consensus araBAD araC galP1

4 4 Because: The sequence motifs are only 6-10 bp long and are degenerate, the probability of finding similar sequences in regions other than promoters is quite high. E. coli genome size: 4,639,221 bp E. coli DNA has ~1400 annotated promoter sites in Ecocyc database but ~4500 annotated genes Number of ‘-10 consensus’ hexamer sequences expected in E. coli : 1058 (with exact match viz no mismatch/changes from consensus) 35,762 (1 mismatch), 3,26,746 (with 2 mismatches) e.g.: consensus TATAAT vs TATGGT OR E. coli should have a ‘-10 like’ sequence at every 4400 nt (exact match), or every 130 th nt (with 1 mismatch) or 14 th nt (with 2 mismatches)

5 5 Does this indicate that there are other signals which help in positioning RNA polymerase? Hence analysis of structural properties of a DNA sequence to locate signals that are: Relevant to transcription from a functional/mechanistic/structural point of view. Unique to the promoter sequences and can be used to differentiate between promoter and non-promoters. Can be predicted from a given sequence. For example: 1) DNA STABILITY (Ability of DNA to Open up) 2) DNA CURVATURE (Intrinsically curved DNA structure) 3) DNA BENDABILITY (Ability of DNA to bend)

6 6 An important step in transcription is the formation of an open complex which involves strand separation of DNA duplex upstream of the transcription start site (TSS) This separation takes place without the help of any external energy. Hence evaluating stabilities of promoter sequences may give some clues. Why Stability?

7 7 Stability of base paired dinucleotides SantaLucia J (1998) Proc. Natl. Acad. Sci. USA 95(4):1460-1465. based on Tm (melting temp data) on a collection of 108 oligonucleotide duplexes.

8 8 A representative free energy profile for 1000nt long E. coli promoter sequence

9 9 Kanhere and Bansal, Nucl. Acid Res. (2005) 33, 3165-3175 Verteb: 252 Plants: 74 E coli: 227B Subtilis: 89

10 10 Curved DNA sequences are present in upstream region Organism Distance of the bent site from TSSReference Gene Name virFShigella flexneri -137 Prosseda et al. (2004) per-fdx Clostridium perfringens -43 Kaji et al. (2003) Streptokinase Streptococcus equisimilis H46A -98 Malke et al. (2000) aprEBacillus subtilis -103 Jan et al. (2000) nifLA Klebsiella pneumoniae -95 Cheema et al. (1999) GyrA Streptococcus pneumoniae -23 Balas et al. (1998) appYEscherichia coli -350(from start codon) Atlung et al. (1996) rrnB P1Escherichia coli -110 Gaal et al. (1994) ompF Escherichia coli K- 12 -101 to –71 Huang et al. (1994)

11 11 Roll at junction Roll at every step

12 12 Dinucleotide parameters Bansal M (1996) Biological Structure and Dynamics, Proceedings of the Ninth Conversation (Vol. I) pp 121-134

13 13 A representative intrinsic curvature profile for 1000nt long E. coli promoter sequence

14 14 Kanhere and Bansal, Nucleic Acid Research (2005) 33, 3165-3175

15 15 DNA bendability Protein DNA

16 16 Kanhere and Bansal, Nucl. Acid Res. (2005) 33, 3165-3175

17 17 Distribution of different signals in 272 E. coli promoters 17%19% 2% 3% 4% 24% 10% seqs show no signals 90% show atleast one signal

18 18 Hence: The upstream region and downstream regions, with respect to the TSS, show considerable differences in their properties. Upstream region is less stable, more rigid and more curved compared to the downstream region, in prokaryotic and eukaryotic genomes. Stability signal is much more common than other two signals Some of the promoters which do not show any of the three signals are either internal/secondary/weak promoters

19 19 Can incorporating these features help in improving the promoter prediction tools? Since low stability signature was found to be most common in promoters – it was examined first. E. Coli promoter data was studied in detail, also B. Subtilis and M. tuberculosis as examples.

20 20 Average stability profile for 429 E. coli promoters (from EcoCyc Database V 9.1), located atleast 500 nt apart

21 21 Nucleotide composition (in %) for three bacterial systems. Difference between Mtb and others is clearly seen E. coliB. subtilisM. tuberculosis ATGCA+TATGC ATGC Whole genome0.250.240.25 0.490.28 0.22 0.560.17 0.33 0.34 Up stream region -200 to -100 0.27 0.23 0.540.310.280.210.200.590.190.170.330.320.35 Down stream region 100 to 200 0.25 0.260.240.500.310.260.230.190.570.170.180.340.310.35 Promoter region -80 to +20 0.280.290.21 0.580.340.320.180.150.660.20 0.310.290.40 The composition was calculated for 101 nt length (ranges from -200 to -100, 100 to 200 and -80 to +20 with respect to TSS) promoter sequences. 582 promoter sequences from E. coli, 305 promoter sequences from B.subtilis and 42 promoter sequences from M. tuberculosis were obtained when the TSS are 200 nt apart.

22 22 Average stability profile for promoter sequences that are 500 nt apart B) 239 B. subtilis promoters (from DBTBS Database)A) 429 E. coli promoters (from EcoCyc Database V 9.1) C) 40 M. tuberculosis promoters (from MtbRegList Database) One sharp peak corresponding to high A+T content seen

23 23

24 24 Sensitivity and precision for promoter prediction of 500 nt apart experimentally verified bacterial TSS. E. coliB. subtilisM. tuberculosis Sensitivity / Cutoff value applied (kcal/mol) E-cutoff = -18.7 D-cutoff = 1.0 E-cutoff = -17.2 D-cutoff = 1.5 E-cutoff = -21.0 D-cutoff = 1.0 Total no. of promoter sequences of 1001 nt length considered for analysis 42923940 No. of True Positives36618427 No. of False Positives27213128 No. of False Negatives After I cycle58716 After II cycle6120 Calculated Sensitivity = TP/(TP+FN) 0.980.941 Calculated Precision = TP/(TP+FP) 0.570.580.49 False negatives after first cycle are taken for the second cycle promoter prediction, with E1 window size of 50nt. False negatives remaining after second cycle are considered for sensitivity calculation. True positives and false positives are added up after first and second cycle prediction. Definition of TP, FP: V Rangannan and M Bansal, J. Biosci. 32, 851-862 (2007).

25 25 Nos of nucleotides between each TSS (#729) and TLS (considering the occurrence of the first gene). Min dist = 0, Max dist = 708 Average stability profile for 4461 E. coli gene sequences of 1001nt length (-500 to +500 w.r.t TLS) Av stability profile for all 4461 genes in E. Coli aligned w.r.t their TLS

26 26 E. Coli – Average stability profile for 1089 Protein promoter sequences and 59 RNA promoter sequences E. Coli – Average stability profile for 34 tRNA promoter sequences and 13 other RNA promoter sequences

27 27 E. coliB. subtilis Forward strand of the genome Reverse strand of the genome TotalForward strand of the genome Reverse strand of the genome Total Protein coding genes RNA coding genes Protein coding genes RNA genes Protein coding genes RNA genes Protein coding genes RNA genes No of TSSs50734582251145 a 30553022613 a No of genes208910921857344561942852164344225 No of predictions436943548723269227205412 TP calculated w.r.t gene TLS b 1329751596483048 68% 86630114292038 48% FP calculated w.r.t gene TLS b 100497954181252204250947 TP calculated w.r.t TSS c 3942342918864 75% 16741892362 59% a 3 TSSs of E. coli and 1 TSS of B. subtilis regulate protein as well as RNA genes. b True and false positives are identified against the genes in forward and reverse strand. c True positive is calculated with respect to the annotated TSS (located in -150 to +50 nt region w.r.t TSS)  63% and 68% accuracy (precision) achieved in case of E. coli and B. subtilis respectively w.r.t TLS  75% and 59% reliability achieved in case of E. coli and B. subtilis respectively w.r.t annotated TSS (against 37% in case of SIDD for 927 TSS in E. coli). Whole genome annotation for promoter regions in E coli and B. subtilis

28 28 M. tuberculosis Forward strand of the genome Reverse strand of the genome Total Protein coding genes RNA coding genes Protein coding genes RNA coding genes No of genes2010271989234049 No of predictions315331635316 TP calculated w.r.t gene TLS 6921493861650 (41%) FP calculated w.r.t gene 1032680231843 Whole genome annotation of promoter regions over M. tuberculosis genome

29 29 All false positives need not be REAL false positives In prokaryotic genomes, the intergenic region is very small (~ 12%). Experimental evidence shows that for some genes the regulating transcription start site lies within the coding region of an upstream neighboring gene. For example, the E.coli rpoS gene has its transcribing TSS (rpoSp) within the coding region of nlpD gene and 567 nt away from its own TLS. Lange R, Fischer D and Hengge-Aronis R., J Bacteriol. (1995); 177(16):4676-80

30 30 Distribution of coding and intergenic regions in the bacterial genomes  Histograms showing the distribution of predicted promoter regions in different genomic regions in E. coli, B. subtilis and M. tuberculosis genomes. Color coding for intergenic and coding region are shown on top right.

31 31 Predicted promoter region distribution in E. coli genome (over ALL 1145 Ecocyc annotated, 1001 nt long promoter sequences).

32 32 Comparison of our method of promoter prediction with NNPP, w.r.t TLSS at position 0

33 33 Average energy profile for E.coli genomic fragment 9000bp to 15300bp

34 34 Average energy profile for E.coli genomic fragment 3483400bp to 3487000bp (DIV intergenic region)

35 35 Average energy profile for E.coli genomic fragment 2863000bp to 2867600bp (CON intergenic region)

36 36 Conclusions Relative stability of DNA in neighboring regions can help in annotating for promoter regions in whole genomes The method is quite general and shown to work for genomes with varying AT/GC content. The stability criteria performs better than other commonly used methods based on sequence motif search as well as the superhelix induced destabilization in DNA (SIDD) method.

37 37 %GC No of sequences analyzed * E. coliB. subMtb 30 – 35 -6- 35 – 40 1661- 40 – 45 47168- 45 – 50 18347- 50 – 55 193-- 55 – 60 18-- 60 – 65 --25 65 – 70 --15 Total45728240 No of promoter sequences grouped according to their %GC content in the three bacterial systems  TSSs which are 500nt apart are considered in E. coli, B. subtilis and M. tuberculosis.  GC categorization is done based on the %GC over 1001nt long promoter sequences (ranging from -500 to +500 w.r.t TSS).

38 38 Average free energy distribution over promoter sequences with diverse GC composition (A) -500 to +500 region with respect to TSS (B) -80 to +20 region with respect to TSS  The average free energies over the promoter regions with similar GC composition are approximately same with E. coli and B.subtilis nearly overlapping for %GC intervals 35-40%, 40-45%, and 45-50%, in case of 1001 nt long promoter regions.

39 39 Thresholds of free energy values used to predict promoters in genomic DNA with varying GC content E is the average free energy over the -80 to +20 region of known promoters, and D is the difference between E and the average free energy over random sequences generated from downstream (+100 to +500 region) genomic sequence (REav).

40

41 41

42 42 Stability characteristics of TF binding site (e.g. CRP) Region of high stability corresponds to a binding site for CRP in E coli. The high stability trough extends for ~22 nucleotides (window size = 15 nts), which is the same as the foot print size of the protein reported in literature.

43 43 Ecoli CRP binding site consensus sequence for 209 sites

44 44 CRP: Average stability profile

45 45 CRP: Average stability profile for manipulated sequences NNNNNNNNNNNNNTGTGANNNNNNACACANNNNNNNNNNNNN 5’ flanking region3’ flanking region6-nt linker

46 46 CRP: Average bendability profile TGTGANNNNNNACACA

47 47 Thank You Acknowledgements: Dr Dhananjay Bhattacharyya Dr Aditi Kanhere Ms Vetriselvi R Mr Vikas Sarma Mr Nishad Matange Financial Support: Dept of Biotechnology, India

48 48

49 49 Coding and inter-genic region distribution in E. coli and B. subtilis genome. Histograms show the distribution of predicted promoter regions in different intergenic regions in E.coli and B.subtilis genomes (as per the color coding in the legend).

50 50 NarL: Binding site Consensus sequence

51 51 NarL: Average stability profile

52 52 NarL: Average bendability profile

53 Definition of thresholds of free energy values used to predict promoters in bacterial genome sequences. G specifies the average free energy over the entire genome. E is the average free energy over known promoter regions. All energy values are in kcal/mol and the standard deviation values are also indicated. E-cutoff and D-cutoff are the thresholds used to predict promoter regions. E. coliB. subtilisM. tuberculosis Average free energy G calculated over whole genome sequence Mean G-20.10-18.88-22.49 Standard Deviation (σ) 0.130.060.15 G Eav (Mean+3σ)-19.70-18.72-22.04 Average free energy E calculated over upstream region of TSS Upstream region considered with respect to TSS -80 to +20 (101 nt length) -80 to +20 (101 nt length) -40 to +20 (61 nt length) Mean E-18.70-17.20-21.02 Standard Deviation (σ) 000 E-cutoff (Mean+3σ) -18.70-17.20-21 D-cutoff (E-cutoff – G Eav )1.0 1.5 1.0

54 54

55 55 Region extracted from respective genome with respect to TSS (Length of the region) E. coliB. subtilisM. tuberculosis AFEG+CAFEG+CAFEG+C Upstream region -500 to -100 (401 nt) -19.9 (1.0) 0.49 (0.06) -18.8 (0.8) 0.43 (0.05) -22.4 (0.7) 0.65 (0.03) -500 to -100 (401 nt) shuffled sequence -19.6 (1.0) -19.6 (0.8) -22.1 (0.6) Downstream region 100 to 500 (401nt) -20.1 (0.7) 0.49 (0.04) -19.0 (0.7) 0.44 (0.04) -22.5 (0.5) 0.66 (0.03) 100 to 500 (401nt) shuffled sequence -19.9 (0.7) -18.7 (0.7) -22.3 (0.5) Promoter region -80 to +20 (101nt) -18.6 (1.3) 0.42 (0.08) -17.1 (1.0) 0.33 (0.06) -21.4 (1.0) 0.61 (0.05) -80 to +20 (101nt) shuffled sequence -18.5 (1.2) -17.0 (0.9) -21.4 (0.9) Longer region-500 to +500 (1001nt) -19.8 (0.7) 0.49 (0.04) -18.6 (0.5) 0.42 (0.03) -22.3 (0.4) 0.65 (0.02) -500 to +500 (1001nt) shuffled sequence -19.5 (0.6) -18.4 (0.5) -22.1 (0.33) Whole genome-20.1 (2.4) 0.51-18.9 (2.3) 0.44-22.5 (2.1) 0.66 Variation in base composition and average free energy (AFE) in different regions of bacterial genomes. Promoter sequences of 491, 283 and 40 TSS which are 500nt nucleotides apart are considered from E. coli, B. subtilis and M.tuberculosis respectively. Sequences are aligned with respect to the TSS. Standard deviation from the respective mean is given in brackets.

56 40 M. tuberculosis promoters from MtbRegList Database 491 E. coli promoters from EcoCyc Database version 11.0 239 B. subtilis promoters from DBTBS Database Average stability profile for promoter sequences from three different organisms


Download ppt "1 In Silico Identification of Promoters in Prokaryotic Genomes Manju Bansal Molecular Biophysics Unit Indian Institute of Science Bangalore"

Similar presentations


Ads by Google