110/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction BCB 444/544 Lecture 26 Gene Prediction #26_Oct22
210/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Mon Oct 22 - Lecture 26 Gene Prediction Chp 8 - pp Wed Oct 24 - Lecture 27 (will not be covered on Exam 2) Regulatory Element Prediction Chp 9 - pp Thurs Oct 25 - Review Session & Project Planning Fri Oct 26 - EXAM 2 Required Reading (before lecture)
310/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Assignments & Announcements Sun Oct 21 - Study Guide for Exam 2 was posted Mon Oct 22 - HW#4 Due (no "correct" answer to post) Thu Oct 25 - Lab = Optional Review Session for Exam 544 Project Planning/Consult with DD & MT Fri Oct 26 - Exam 2 - Will cover: Lectures (thru Mon Sept 17) Labs 5-8 HW# 3 & 4 All assigned reading: Chps 6 (beginning with HMMs), 7-8, Eddy: What is an HMM Ginalski: Practical Lessons…
410/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction BCB 544 "Team" Projects 544 Extra HW#2 is next step in Team Projects Write ~ 1 page outline Schedule meeting with Michael & Drena to discuss topic Read a few papers Write a more detailed plan You may work alone if you prefer Last week of classes will be devoted to Projects Written reports due: Mon Dec 3 (no class that day) Oral presentations (15-20') will be: Wed-Fri Dec 5,6,7 1 or 2 teams will present during each class period See Guidelines for Projects posted online
510/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction BCB 544 Only: New Homework Assignment 544 Extra#2 (posted online Thurs?) No - sorry! sent by on Sat… Due: PART 1 - ASAP PART 2 - Fri Nov 2 by 5 PM Part 1 - Brief outline of Project, to Drena & Michael after response/approval, then: Part 2 - More detailed outline of project Read a few papers and summarize status of problem Schedule meeting with Drena & Michael to discuss ideas
610/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: Oct 25 Thur - BBMB Seminar 4:10 in 1414 MBB Dave Segal UC Davis Zinc Finger Protein Design Oct 19 Fri - BCB Faculty Seminar 2:10 in 102 ScI Guang Song ComS, ISU Probing functional mechanisms by structure-based modeling and simulations
710/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Chp 16 - RNA Structure Prediction SECTION V STRUCTURAL BIOINFORMATICS Xiong: Chp 16 RNA Structure Prediction (Terribilini) RNA Function Types of RNA Structures RNA Secondary Structure Prediction Methods Ab Initio Approach Comparative Approach Performance Evaluation
810/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Fig 6.2 Baxevanis & Ouellette 2005 Covalent & non-covalent bonds in RNA Primary: Covalent bonds Secondary/Tertiary Non-covalent bonds H-bonds (base-pairing) Base stacking This is a new slide
910/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction RNA Pseudoknots & Tetraloops huang/QD/mckay_hr.gif This is a new slide Review/Annual-Reports/1995/images/rna.gif Often have important regulatory or catalytic functions PseudoknotTetraloop
1010/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Base Pairing in RNA G-C, A-U, G-U ("wobble") & many variants See: IMB Image Library of Biological MoleculesIMB Image Library of Biological Molecules This slide has been changed
1110/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction RNA Secondary Structure Prediction Methods Two (three, recently) main types of methods: 1.Ab initio - based on calculating most energetically favorable secondary structure(s) Energy minimization (thermodynamics) 2.Comparative approach - based on comparisons of multiple evolutionarily-related RNA sequences Sequence comparison (co-variation) 3.Combined computational & experimental Use experimental constraints when available This slide has been changed
1210/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction RNA Secondary structure prediction - 3 3) Combined experimental & computational Experiments: Map single-stranded vs double- stranded regions in folded RNA How? Enzymes: S1 nuclease, T1 RNase Chemicals: kethoxal, DMS, OH Software: Mfold Sfold RNAStructure RNAFold RNAlifold This is a new slide Kethoxal modification (mild) (strong) DMS modification (mild) (strong) G DMS
1310/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Ab Initio Prediction: Clarifications Free energy is calculated based on parameters determined in the wet lab Correction: Use known energy associated with each type of nearest-neighbor pair (base-stacking) (not base-pair) Base-pair formation is not independent: multiple base-pairs adjacent to each other are more favorable than individual base-pairs - cooperative - because of base-stacking interactions Bulges and loops adjacent to base-pairs have a free energy penalty This slide has been changed
1410/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction A U A=U Basepair G = -1.2 kcal/mole A U U A A=U U=A G = -1.6 kcal/mole Basepair What gives here? C Staben 2005 Energy minimization: What are the rules? This is a new slide
1510/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Energy minimization calculations: Base-stacking is critical - Tinocco et al. C Staben 2005 This is a new slide
1610/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Ab Initio Energy Calculation Search for all possible base-pairing patterns Calculate total energy of each structure based on all stabilizing and destabilizing forces Fig 6.3 Baxevanis & Ouellette 2005 Total free energy for a specific RNA conformation = Sum of incremental energy terms for: helical stacking (sequence dependent) loop initiation unpaired stacking (favorable "increments" are < 0) This slide has been changed
1710/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Dynamic Programming Finding optimal secondary structure is difficult - lots of possibilities Compare RNA sequence with itself Apply scoring scheme based on energy parameters for base stacking, cooperativity, and penalties for destabilizing forces (loops, bulges) Find path that represents most energetically favorable secondary structure This slide has been changed
1810/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction 3 - Popular Programs that use Combined Computational Experimental Approaches Mfold Sfold RNAStructure RNAFold RNAlifold
1910/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction SL X SL Y SL Z SL Y SL Z SL X SL Y SL Z SL X SL Y SL Z SL X Mfold kcal/mol RNAstructure kcal/molRNAfold kcal/mol Sfold kcal/mol Comparison of Predictions for Single RNA using Different Methods JH Lee 2007
2010/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Mfold plus constraints kcal/mol Mfold kcal/mol Comparison of Mfold Predictions: -/+ Constraints JH Lee 2007
2110/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Performance Evaluation Ab initio methods? correlation coefficient = 20-60% Comparative approaches? correlation coefficient = % Programs that require user to supply MSA are more accurate Comparative programs are consistently more accurate than ab initio Base-pairs predicted by comparative sequence analysis for large & small subunit rRNAs are 97% accurate when compared with high resolution crystal structures! - Gutell, Pace BEST APPROACH? Methods that combine computational prediction (ab initio & comparative) with experimental constraints (from chemical/enzymatic modification studies) This slide has been changed
2210/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Chp 8 - Gene Prediction SECTION III GENE AND PROMOTER PREDICTION Xiong: Chp 8 Gene Prediction Categories of Gene Prediction Programs Gene Prediction in Prokaryotes Gene Prediction in Eukaryotes
2310/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction What is a gene? segment of DNA, some of which is "structural," i.e., transcribed to give a functional RNA product, & some of which is "regulatory" Genes can encode: mRNA (for protein) other types of RNA (tRNA, rRNA, miRNA, etc.) Genes differ in eukaryotes vs prokaryotes (& archaea), both structure & regulation What is a Gene?
2410/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Gene Finding Problem: Given a new genomic DNA sequence, identify coding regions and their predicted RNA and protein sequences ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT Steps: 1.Search against protein / EST database 2.Apply gene prediction programs (many programs available) 3.Analyze regulatory regions
2510/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Gene Prediction in Prokaryotes vs Eukaryotes Prokaryotes Small genomes ·10 6 bp About 90% of genome is coding Simple gene structure Prediction success ~99% Eukaryotes Large genomes 10 7 – bp Often less than 2% coding Complicated gene structure (splicing, long exons) Prediction success % ATGTAA Promotor Open reading frame (ORF) Start codonStop codon Promotor 5’ UTR ExonsIntrons 3’ UTR ATGTAA Splice sites
2610/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction DNA "Signals" Used by Gene Finding Algorithms 1.Exploit the regular gene structure ATG—Exon1—Intron1—Exon2—…—ExonN—STOP 2.Recognize “coding bias” CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-… 3.Recognize splice sites Intron—cAGt—Exon—gGTgag—Intron 4.Model the duration of regions Introns tend to be much longer than exons, in mammals Exons are biased to have a given minimum length 5.Use cross-species comparison Gene structure is conserved in mammals Exons are more similar (~85%) than introns
2710/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Computational Gene Finding Approaches Ab initio methods Search by signal: find DNA sequences involved in gene expression. Search by content: Test statistical properties distinguishing coding from non-coding DNA Similarity based methods Database search: exploit similarity to proteins, ESTs, and cDNAs Comparative genomics: exploit aligned genomes Do other organisms have similar sequence? Hybrid methods - best
2810/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Examples of Gene Prediction Software Ab initio Genscan, GeneMark.hmm, Genie, GeneID… Similarity-based BLAST, Procrustes… Hybrids GeneSeqer, GenomeScan, GenieEST, Twinscan, SGP, ROSETTA, CEM, TBLASTX, SLAM. BEST? Ab initio - Genescan (according to some assessments) Hybrid - GeneSeqer But depends on organism & specific task Lists of Gene Prediction Software
2910/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Synthesis & Processing of Eukaryotic mRNA exon 1exon 2exon 3intron Transcription Splicing (remove introns) Capping & polyadenylation Export to cytoplasm AAAAA 3’5’ 3’ 5’3’ 7Me G m 1' transcript (RNA) Mature mRNA DN Gene in DNA
3010/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction What are cDNAs & ESTs? cDNA libraries are important for determining gene structure & studying regulation of gene expression Isolate RNA (always from a specific organism, region, and time point) Convert RNA to complementary DNA (with reverse transcriptase) Clone into cDNA vector Sequence the cDNA inserts Short cDNAs are called ESTs or Expressed Sequence Tags ESTs are strong evidence for genes Full-length cDNAs can be difficult to obtain vector insert
3110/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction UniGene: Unique genes via ESTs Find UniGene at NCBI: UniGene clusters contain many ESTs UniGene data come from many cDNA libraries. When you look up a gene in UniGene, you can obtain information re: level & tissue distribution of expression
3210/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Gene Prediction Overview of steps & strategies What sequence signals can be used? What other types of information can be used? Algorithms HMMs, Bayesian models, neural nets Gene prediction software 3 major types many, many programs!
3310/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Overview of Gene Prediction Strategies What sequence signals can be used? Transcription: TF binding sites, promoter, initiation site, terminator, GC islands, etc. Processing signals: Splice donor/acceptors, polyA signal Translation: Start (AUG = Met) & stop (UGA,UUA, UAG) ORFs, codon usage What other types of information can be used? Homology (sequence comparison, BLAST) cDNAs & ESTs (experimental data, pairwise alignment)
3410/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Gene prediction: Eukaryotes vs prokaryotes Gene prediction is easier in microbial genomes Why? Smaller genomes Simpler gene structures Many more sequenced genomes! (for comparative approaches) Many microbial genomes have been fully sequenced & whole-genome "gene structure" and "gene function" annotations are available e.g., GeneMark.hmmGeneMark.hmm TIGR Comprehensive Microbial Resource (CMR)TIGRComprehensive Microbial Resource (CMR) NCBI Microbial GenomesNCBIMicrobial Genomes
3510/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Predicting Genes - Basic steps: Obtain genomic sequence BLAST it! Perform database similarity search (with EST & cDNA databases, if available) Translate in all 6 reading frames (i.e., "6-frame translation") Compare with protein sequence databases Use Gene Prediction software to locate genes Analyze regulatory sequences Refine gene prediction
3610/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Predicting Genes - Details: 1. 1st, mask to "remove" repetitive elements (ALUs, etc.) 2.Perform database search on translated DNA (BlastX,TFasta) 3.Use several programs to predict genes (GENSCAN, GeneMark.hmm, GeneSeqer) 4.Search for functional motifs in translated ORFs (Blocks, Motifs, etc.) & in neighboring DNA sequences 5.Repeat
3710/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Perform pairwise alignment with large gaps in one sequence (due to introns) Align genomic DNA with cDNA, ESTs, protein sequences Score semi-conserved sequences at splice junctions Using Bayesian model or MM Score coding constraints in translated exons Using a Bayesian model or MM Spliced Alignment Algorithm Brendel 2005 GeneSeqerGeneSeqer - Brendel et al.- ISU Intron GT AG Splice sites Donor Acceptor Brendel et al (2004) Bioinformatics 20: 1157
3810/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Brendel - Spliced Alignment II: Compare with protein probes Genomic DNA Start codonStop codon Protein Brendel 2005
3910/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Information Content I i : Extent of Splice Signal Window: i: ith position in sequence Ī : avg information content over all positions >20 nt from splice site Ī : avg sample standard deviation of Ī Splice Site Detection Brendel 2005 Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal? YES
4010/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction Human T2_GT Human T2_AG Information content vs position Brendel 2005 Which sequences are exons & which are introns? How can you tell? Brendel et al (2004) Bioinformatics 20: 1157
4110/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction enen e n+1 inin i n+1 PGPG P A(n) P G (1-P G )P D(n+1) (1-P G )(1-P D(n+1) ) 1-P A(n) PGPG Markov Model for Spliced Alignment Brendel 2005