Presentation is loading. Please wait.

Presentation is loading. Please wait.

110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29.

Similar presentations


Presentation on theme: "110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29."— Presentation transcript:

1

2 110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29

3 210/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Mon Oct 29 - Lecture 28 Promoter & Regulatory Element Prediction Chp 9 - pp 113 - 126 Wed Oct 30 - Lecture 29 Phylogenetics Basics Chp 10 - pp 127 - 141 Thurs Oct 31 - Lab 9 Gene & Regulatory Element Prediction Fri Oct 30 - Lecture 29 Phylogenetic Tree Construction Methods & Programs Chp 11 - pp 142 - 169 Required Reading (before lecture)

4 310/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Assignments & Announcements Mon Oct 29 - HW#5 - will be posted today HW#5 = Hands-on exercises with phylogenetics and tree-building software Due: Mon Nov 5 (not Fri Nov 1 as previously posted)

5 410/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 544 "Team" Projects Last week of classes will be devoted to Projects Written reports due: Mon Dec 3 (no class that day) Oral presentations (20-30') will be: Wed-Fri Dec 5,6,7 1 or 2 teams will present during each class period  See Guidelines for Projects posted online

6 510/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 544 Only: New Homework Assignment 544 Extra#2 Due: √PART 1 - ASAP PART 2 - meeting prior to 5 PM Fri Nov 2 Part 1 - Brief outline of Project, email to Drena & Michael after response/approval, then: Part 2 - More detailed outline of project Read a few papers and summarize status of problem Schedule meeting with Drena & Michael to discuss ideas

7 610/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: http://www.bcb.iastate.edu/seminars/index.html Nov 1 Thurs - BBMB Seminar 4:10 in 1414 MBB Todd Yeates UCLA TBA -something cool about structure and evolution? Nov 2 Fri - BCB Faculty Seminar 2:10 in 102 ScI Bob Jernigan BBMB, ISU Control of Protein Motions by Structure

8 710/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Chp 8 - Gene Prediction SECTION III GENE AND PROMOTER PREDICTION Xiong: Chp 8 Gene Prediction Categories of Gene Prediction Programs Gene Prediction in Prokaryotes Gene Prediction in Eukaryotes

9 810/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Computational Gene Prediction: Approaches Ab initio methods Search by signal: find DNA sequences involved in gene expression Search by content: Test statistical properties distinguishing coding from non-coding DNA Similarity-based methods Database search: exploit similarity to proteins, ESTs, cDNAs Comparative genomics: exploit aligned genomes Do other organisms have similar sequence? Hybrid methods - best

10 910/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Computational Gene Prediction: Algorithms 1.Neural Networks (NNs) (more on these later…) e.g., GRAIL 2.Linear discriminant analysis (LDA) (see text) e.g., FGENES, MZEF 3.Markov Models (MMs) & Hidden Markov Models (HMMs) e.g., GeneSeqer - uses MMs GENSCAN - uses 5th order HMMs - (see text) HMMgene - uses conditional maximum likelihood (see text) This is a new slide

11 1010/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Signals Search Approach: Build models (PSSMs, profiles, HMMs, …) and search against DNA. Detected instances provide evidence for genes This is a new slide

12 1110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Content Search Observation: Encoding a protein affects statistical properties of DNA sequence: Nucleotide.amino acid distribution GC content (CpG islands, exon/intron) Uneven usage of synonymous codons (codon bias) Hexamer frequency - most discriminative of these for identifying coding potential Method: Evaluate these differences (coding statistics) to differentiate between coding and non-coding regions This is a new slide

13 1210/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Human Codon Usage This is a new slide

14 1310/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Predicting Genes based on Codon Usage Differences Algorithm: Process sliding window Use codon frequencies to compute probability of coding versus non-coding Plot log-likelihood ratio: Coding Profile of ß-globin gene Exons This is a new slide

15 1410/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction In different genomes: Translate DNA into all 6 reading frames and search against proteins (TBLASTX,BLASTX, etc.) Within same genome: Search with EST/cDNA database (EST2genome, BLAT, etc.). Problems: Will not find “new” or RNA genes (non-coding genes). Limits of similarity are hard to define Small exons might be overlooked Similarity-Based Methods: Database Search ATTGCGTAGGGCGCT TAACGCATCCCGCGA This is a new slide

16 1510/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Similarity-Based Methods: Comparative Genomics Idea: Functional regions are more conserved than non-functional ones; high similarity in alignment indicates gene Advantages: May find uncharacterized or RNA genes Problems: Finding suitable evolutionary distance Finding limits of high similarity (functional regions) human mouse GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | | C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA- This is a new slide

17 1610/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction HumanMouse Human-Mouse Homology Comparison of 1196 orthologous genes Sequence identity between genes in human vs mouse Exons: 84.6% Protein: 85.4% Introns: 35% 5’ UTRs: 67% 3’ UTRs: 69% This is a new slide

18 1710/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Thanks to Volker Brendel, ISU for the following Figs & Slides Slightly modified from: BSSI Genome Informatics Module http://www.bioinformatics.iastate.edu/BBSI/course_desc_20 05.html#moduleB V Brendel vbrendel@iastate.eduvbrendel@iastate.edu Brendel et al (2004) Bioinformatics 20: 1157

19 1810/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Perform pairwise alignment with large gaps in one sequence (due to introns) Align genomic DNA with cDNA, ESTs, protein sequences Score semi-conserved sequences at splice junctions Using Bayesian probability model & 1st order MM Score coding constraints in translated exons Using Bayesian model Spliced Alignment Algorithm GeneSeqerGeneSeqer - Brendel et al.- ISU http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi Intron GT AG Splice sites Donor Acceptor Brendel et al (2004) Bioinformatics 20: 1157 http://bioinformatics.oxfordjournals.org/cgi/con tent/abstract/20/7/1157 Brendel 2005

20 1910/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Information Content I i : Extent of Splice Signal Window: i: ith position in sequence Ī : avg information content over all positions >20 nt from splice site  Ī : avg sample standard deviation of Ī Splice Site Detection Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal? YES Brendel 2005

21 2010/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Human T2_GT Human T2_AG Information Content vs Position Which sequences are exons & which are introns? How can you tell? Brendel 2005

22 2110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction enen e n+1 inin i n+1 PGPG P A(n) P  G (1-P  G )P D(n+1) (1-P  G )(1-P D(n+1) ) 1-P A(n) PGPG Markov Model for Spliced Alignment Brendel 2005

23 2210/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Evaluation of Splice Site Prediction Fig 5.11 Baxevanis & Ouellette 2005 This is a new slide TP= positive instance correctly predicted as positive FP= negative instance incorrectly predicted as positive TN= negative instance correctly predicted as negative FN= positive instance incorrectly predicted as negative Right!

24 2310/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Evaluation of Predictions Normalized specificity: Actual TrueFalse PP=TP+FP PN=FN+TN AP=TP+FNAN=FP+TN Predicted True False TNFN FP TP Specificity: Misclassification rates: Coverage Sensitivity: Predicted Positives True Positives False Positives Recall Do not memorize this !

25 2410/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Evaluation of Predictions - in English Actual TrueFalse PP=TP+FP PN=FN+TN AP=TP+FN AN=FP+TN Predicted True False TNFN FP TP Specificity: Sensitivity: = Coverage In English? Sensitivity is the fraction of all positive instances having a true positive prediction. = Recall In English? Specificity is the fraction of all predicted positives that are, in fact, true positives. IMPORTANT: in medical jargon, Specificity is sometimes defined differently (what we define here as "Specificity" is sometimes referred to as "Positive predictive value") IMPORTANT: Sensitivity alone does not tell us much about performance because a 100% sensitivity can be achieved trivially by labeling all test cases positive!

26 2510/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Best Measures for Comparison? ROC curves (Receiver Operating Characteristic (?!!) http://en.wikipedia.org/wiki/Roc_curve Correlation Coefficient Matthews correlation coefficient (MCC) MCC = 1 for a perfect prediction 0 for a completely random assignment -1 for a "perfectly incorrect" prediction Do not memorize this ! In signal detection theory, a receiver operating characteristic (ROC), or ROC curve is a plot of sensitivity vs (1 - specificity) for a binary classifier system as its discrimination threshold is varied. The ROC can also be represented equivalently by plotting fraction of true positives (TPR = true positive rate) vs fraction of false positives (FPR = false positive rate)signal detection theorybinary classifiertrue positivesfalse positives This slide has been changed

27 2610/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Brendel 2005 GeneSeqer: Input http://deepc2.psi.iastate.edu/cgi-bin/gs.cgi

28 2710/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Brendel 2005 GeneSeqer: Output

29 2810/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Brendel 2005 GeneSeqer: Gene Evidence Summary

30 2910/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Gene Prediction - Problems & Status? Common errors? False positive intergenic regions: 2 annotated genes actually correspond to a single gene False negative intergenic region: One annotated gene structure actually contains 2 genes False negative gene prediction: Missing gene (no annotation) Other: Partially incorrect gene annotation Missing annotation of alternative transcripts Current status? For ab initio prediction in eukaryotes: HMMs have better overall performance for detecting intron/exon boundaries Limitation? Training data: predictions are organism specific Combined ab initio/homology based predictions: Improved accurracy Limitation? Availability of identifiable sequence homologs in databases

31 3010/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Recommended Gene Prediction Software Ab initio GENSCAN: http://genes.mit.edu/GENSCAN.html http://genes.mit.edu/GENSCAN.html GeneMark.hmm: http://exon.gatech.edu/GeneMark/http://exon.gatech.edu/GeneMark/ others: GRAIL, FGENES, MZEF, HMMgene Similarity-based BLAST, GenomeScan, EST2Genome, Twinscan Combined: GeneSeqer, http://deepc2.psi.iastate.edu/cgi-bin/gs.cgihttp://deepc2.psi.iastate.edu/cgi-bin/gs.cgi ROSETTA  Consensus: because results depend on organisms & specific task, Always use more than one program! Two servers hat report consensus predictions GeneComber DIGIT

32 3110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Other Gene Prediction Resources: at ISU http://www.bioinformatics.iastate.edu/bioinformatics2go/

33 3210/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Other Gene Prediction Resources: GaTech, MIT, Stanford, etc. Current Protocols in Bioinformatics (BCB/ISU owns a copy - currently in my lab!) Chapter 4 Finding Genes 4.1 An Overview of Gene Identification: Approaches, Strategies, and Considerations 4.2 Using MZEF To Find Internal Coding Exons 4.3 Using GENEID to Identify Genes 4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes 4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm 4.6 Eukaryotic Gene Prediction Using GeneMark.hmm 4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome 4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences 4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation 4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences Lists of Gene Prediction Software http://www.bioinformaticsonline.org/links/ch_09_t_1.html http://cmgm.stanford.edu/classes/genefind/

34 3310/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Chp 9 - Promoter & Regulatory Element Prediction SECTION III GENE AND PROMOTER PREDICTION Xiong: Chp 9 Promoter & Regulatory Element Prediction Promoter & Regulatory Elements in Prokaryotes Promoter & Regulatory Elements in Eukaryotes Prediction Algorithms

35 3410/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Eukaryotic genomes Are packaged in chromatin & sequestered in a nucleus Are larger and have multiple linear chromosomes Contain mostly non-protein coding DNA (98-99%) Prokarytic genomes DNA is associated with a nucleoid, but no nucleus Much larger, usually single, circular chromosome Contain mostly protein encoding DNA Eukaryotes vs Prokaryotes: Genomes

36 3510/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Eukaryotes vs Prokryotes: Gene Structure

37 3610/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Eukaryotic genes Are larger and more complex than in prokaryotes Contain introns that are “spliced” out to generate mature mRNAs* Often undergo alternative splicing, giving rise to multiple RNAs* Are transcribed by 3 different RNA polymerases (instead of 1, as in prokaryotes) * In biology, statements such as this include an implicit “usually” or “often” Eukaryotes vs Prokaryotes: Genes

38 3710/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Primary level of control? Prokaryotes: Transcription initiation Eukaryotes: Transcription is also very important, but Expression is regulated at multiple levels many of which are post-transcriptional: RNA processing, transport, stability Translation initiation Protein processing, transport, stability Post-translational modification (PTM) Subcellular localization Recent important discoveries: small regulatory RNAs (miRNA, siRNA) are abundant and play very important roles in controlling gene expression in eukaryotes, often at post-transcriptional levels Eukaryotes vs Prokaryotes: Levels of Gene Regulation

39 3810/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Eukaryotes vs Prokaryotes: Regulatory Elements Prokaryotes: Promoters & operators (for operons) - cis-acting DNA signals Activators & repressors - trans-acting proteins (we won't discuss these…) Eukaryotes: Promoters & enhancers (for single genes) - cis-acting Transcription factors - trans-acting Important difference? What the RNA polymerase actually binds

40 3910/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Prokaryotic Promoters RNA polymerase complex recognizes promoter sequences located very close to and on 5’ side (“upstream”) of tansription initiation site sigma subunit Prokaryotic RNA polymerase complex binds directly to promoter, by virtue of its sigma subunit - no requirement for “transcription factors” binding first Prokaryotic promoter sequences are highly conserved: -10 region -35 region

41 4010/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Eukaryotic Promoters Eukaryotic RNA polymerase complexes do not bind directly to promoter sequences Transcription factors must bind first and serve as landmarks recognized by RNA polymerase complexes Eukaryotic promoter sequences are less highly conserved, but many promoters (for RNA polymerase II) contain : -30 region "TATA" box -100 region "CCAAT" box

42 4110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Eukaryotic Promoters vs Enhancers Both promoters & enhancers are binding sites for transcription factors (TFs) Promoters essential for initiation of transcription located “relatively” close to start site (usually <200 bp upstream, but can be located within gene, rather than upstream!) Enhancers needed for regulated transcription (differential expression in specific cell types, developmental stages, in response to environment, etc.) can be very far from start site (sometimes > 100 kb)

43 4210/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Eukaryotic genes are transcribed by 3 different RNA polymerases (Location of promoter regions, TFBSs & TFs differ, too) BIOS Scientific Publishers Ltd, 1999 Brown Fig 9.18 mRNA rRNA tRNA, 5S RNA

44 4310/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Prokaryotic Genes & Operons Genes with related functions are often clustered within operons (e.g., lac operon) Operons = genes with related functions that are transcribed and regulated as a single unit; one promoter controls expression of several proteins mRNAs produced from operons are “polycistronic” - a single mRNA encodes several proteins; i.e., there are multiple ORFs, each with its own AUG (START) & STOP codons, linked within one mRNA molecule

45 4410/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Promoter of lac operon in E. coli (Transcribed by prokaryotic RNA polymerase) BIOS Scientific Publishers Ltd, 1999 Brown Fig 9.17

46 4510/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Eukaryotic genes Genes with related functions are occasionally, but not usually clustered; instead, they share common regulatory regions (promoters, enhancers, etc.) Chromatin structure must also be “active” for transcription to occur

47 4610/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Eukaryotic genes have large & complex regulatory regions Cis-acting regulatory elements include: Promoters, enhancers, silencers Trans-acting regulatory factors include: Transcription factors (TFs), chromatin remodeling complexes, small RNAs BIOS Scientific Publishers Ltd, 1999 Brown Fig 9.17

48 4710/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Eukaryotic Promoters: DNA sequences required for initiation, usually <200 bp from start site Eukaryotic RNA polymerases bind by recognizing a complex of TFs bound at promotor ~250 bp Pre-mRNA First, TFs must bind short motifs (TFBSs) within promoters; then RNA polymerase can bind and initiate transcription of RNA

49 4810/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Eukaryotic promoters & enhancer regions often contain many different TFBS motifs Fig 9.13 Mount 2004

50 4910/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Simplified View of Promoters in Eukaryotes Fig 5.12 Baxevanis & Ouellette 2005

51 5010/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Eukaryotic Activators vs Repressors Regions far from the promoter can act as "enhancers" or "repressors" of transcription by serving as binding sites for activator or repressor proteins (TFs ) Activator proteins (TFs) bind to enhancers & interact with RNAP to stimulate transcription Repressors block the action of activators repressor prevents binding of activator enhancer Gene repressor 100 - 50,000 bp promoter RNAP enhancer proteins interact with RNAP transcription

52 5110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Eukaryotic Transcription Factors (TFs) Transcription factors = proteins that interact with the RNA polymerase complex to activate or repress transcription TFs often contain both: a trans-activating domain a DNA binding domain or motif TFs recognize and bind specific short DNA sequence motifs called “transcription factor binding sites” (TFBSs) Databases for TFs &TFBSs include: TRANSFAC, http://www.generegulation.com/cgibin/pub/databases/transfac TRANSFAC, http://www.generegulation.com/cgibin/pub/databases/transfac JASPAR Here motif = amino acid sequence in protein Here motif = nucleotide sequence in DNA

53 5210/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Zinc Finger Proteins - Transcription Factors Common in eukaryotic proteins ~ 1% of mammalian genes encode zinc-finger proteins (ZFPs) In C. elegans, there are > 500 ! Can be used as highly specific DNA binding modules Potentially valuable tools for directed genome modification (esp. in plants) & human gene therapy - one clinical trial will begin soon! BIOS Scientific Publishers Ltd, 1999 Brown Fig 9.12 Did you go to Dave Segal's seminar? Your TAs Pete & Jeff work on designing better ZFPs!

54 5310/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Promoter Prediction Algorithms & Software Xiong -

55 5410/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Eukaryotes vs Prokaryotes: Promoter Prediction Promoter prediction is much easier in prokaryotes Why? Highly conserved Simpler gene structures More sequenced genomes! (for comparative approaches) Methods? Previously: mostly HMM-based Now: similarity-based comparative methods because so many genomes available Xiong textbook: 1) "Manual method"= rules of Wang et al (see text) 2) BPROM - uses linear discriminant function

56 5510/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Eukaryotes vs Prokaryotes: Promoter Prediction Promoter prediction is much easier in prokaryotes Why? Highly conserved Simpler gene structures More sequenced genomes! (for comparative approaches) Methods? Previously: mostly HMM-based Now: similarity-based comparative methods because so many genomes available Xiong textbook: 1) "Manual method"= rules of Wang et al (see text) 2) BPROM - uses linear discriminant function

57 5610/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Predicting Promoters in Eukaryotes  Closely related to gene prediction! Obtain genomic sequence Use sequence-similarity based comparison (BLAST, MSA) to find related genes But: "regulatory" regions are much less well- conserved than coding regions Locate ORFs Identify Transcription Start Site (TSS) (if possible!) Use Promoter Prediction Programs Analyze motifs, etc. in DNA sequence (TRANSFAC, JASPAR)

58 5710/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Predicting promoters: Steps & Strategies Identify TSS --if possible? One of biggest problems is determining exact TSS! Not very many full-length cDNAs! Good starting point? (human & vertebrate genes) Use FirstEFFirstEF found within UCSC Genome BrowserUCSC Genome Browser or submit to FirstEF web serverFirstEF Fig 5.10 Baxevanis & Ouellette 2005

59 5810/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Automated Promoter Prediction Strategies 1)Pattern-driven algorithms (ab initio) 2)Sequence-driven algorithms (homology based) 3)Combined "evidence-based" BEST RESULTS? Combined, sequential

60 5910/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 1) Pattern-driven Algorithms Success depends on availability of collections of annotated transcription factor binding sites (TFBSs) Tend to produce very large numbers of false positives (FPs) Why? Binding sites for specific TFs are often variable Binding sites are short (typically 6-10 bp) Interactions between TFs (& other proteins) influence both affinity & specificity of TF binding One binding site often recognized by multiple TFs Biology is complex: gene activation is often specific to organism/cell/stage/environmental condition; promoter and enhancer elements must mediate this

61 6010/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Take sequence context/biology into account Eukaryotes: clusters of TFBSs are common Prokaryotes: knowledge of  (sigma) factors helps Probability of "real" binding site higher if annotated transcription start site (TSS) is nearby But: What about enhancers? (no TSS nearby!) & only a small fraction of TSSs have been experimentally determinined Do the wet lab experiments! But: Promoter-bashing can be tedious… Ways to Reduce FPs in ab initio Prediction

62 6110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction 2) Sequence-driven Algorithms Assumption: Common functionality can be deduced from sequence conservation (Homology) Alignments of co-regulated genes should highlight elements involved in regulation Careful: How determine co-regulation? 1.Orthologous genes from difference species 2.Genes experimentally shown to be co-regulated (using microarrays??) Comparative promoter prediction: 1.Phylogenetic footprinting 2.Expression Profiling

63 6210/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Phylogenetic Footprinting Based on increasing availability of whole genome DNA sequences from many different species Selection of organisms for comparison is important not too close, not too far: good = human vs mouse To reduce FPs, must extract non-coding sequences and then align them; prediction depends on good alignment use MSA algorithms (e.g., CLUSTAL) more sensitive methods Gibbs sampling Expectation Maximization (EM) methods Examples of programs: Consite, rVISTA, PromH(W), Bayes aligner, Footprinter

64 6310/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Expression Profiling Based on increasing availability of whole genome mRNA expression data, esp., microarray data High-throughput simultaneous monitoring of expression levels of thousands of genes Assumptions: (sometimes valid, sometimes NOT) 1.Co-expression implies co-regulation 2.Co-regulated genes share common regulatory elements Drawbacks: 1.Signals are short & weak! Requires Gibbs sampling or EM: e.g., MEME, AlignACE, Melina 2.Prediction depends on determining which genes are co-expressed - usually by clustering - which an be error prone Examples of programs: INCLUSive - combined microarray analysis & motif detection PhyloCon - combined phylo footprinting & expression profiling)

65 6410/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Need sets of co-regulated genes For comparative (phylogenetic) methods Must choose appropriate species Different genomes evolve at different rates Classical alignment methods have trouble with translocations or inversions than change order of functional elements If background conservation of entire region is high, comparison is useless Not enough data (but Prokaryotes >>> Eukaryotes) Complexity: many regulatory elements are not conserved across species! Problems with Sequence-driven Algorithms

66 6510/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction TRANSFAC Matrix Entry: for TATA box Fields: Accession & ID Brief description TFs associated with this entry Weight matrix Number of sites used to build Other info Fig 5.13 Baxevanis & Ouellette 2005

67 6610/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Global Alignment of Human & Mouse Obese Gene Promoters (200 bp upstream from TSS) Fig 5.14 Baxevanis & Ouellette 2005

68 6710/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Annotated Lists of Promoter Databases & Promoter Prediction Software URLs from Mount textbook: Table 9.12 http://www.bioinformaticsonline.org/links/ch_09_t_2.html http://www.bioinformaticsonline.org/links/ch_09_t_2.html Table in Wasserman & Sandelin Nat Rev Genet article http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.htm http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.htm URLs from Baxevanis & Ouellette textbook: http://www.wiley.com/legacy/products/subject/life/bioinformatics/ch05.htm#links More lists: http://www.softberry.com/berry.phtml?topic=index&group=programs&subgroup=promoter http://www.softberry.com/berry.phtml?topic=index&group=programs&subgroup=promoter http://bioinformatics.ubc.ca/resources/links_directory/?subcategory_id=104 http://bioinformatics.ubc.ca/resources/links_directory/?subcategory_id=104 http://www3.oup.co.uk/nar/database/subcat/1/4/ http://www3.oup.co.uk/nar/database/subcat/1/4/

69 6810/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction Check out Optional Review & Try Associated Tutorial: Wasserman WW & Sandelin A (2004) Applied bioinformatics for identification of regulatory elements. Nat Rev Genet 5:276-287 http://proxy.lib.iastate.edu:2103/nrg/journal/v5/n4/full/nrg1315_fs.html Check this out: http://www.phylofoot.org/NRG_testcases/http://www.phylofoot.org/NRG_testcases/ Bottom line: this is a very "hot" area - new software for computational prediction of gene regulatory elements published every day!


Download ppt "110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29."

Similar presentations


Ads by Google