110/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II BCB 444/544 Lecture 27 Gene Prediction II #27_Oct24.

Slides:



Advertisements
Similar presentations
An Introduction to Bioinformatics Finding genes in prokaryotes.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ab initio gene prediction Genome 559, Winter 2011.
10/26/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction (really!)1 10/26/05 Promoter Prediction (really!)
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction methods Gene indices Mapping cDNA on genomic DNA Genome-genome.
Finding Eukaryotic Open reading frames.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Introduction to BioInformatics GCB/CIS535
Gene Finding Charles Yan.
CSE182-L10 Gene Finding.
CSE182-L12 Gene Finding.
Comparative ab initio prediction of gene structures using pair HMMs
Eukaryotic Gene Finding
Lecture 12 Splicing and gene prediction in eukaryotes
CSE182-L10 MS Spec Applications + Gene Finding + Projects.
Eukaryotic Gene Finding
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Biological Motivation Gene Finding in Eukaryotic Genomes
Gene Structure and Identification
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
10/19/05 D Dobbs ISU - BCB 444/544X: Gene Regulation1 10/19/05 Gene Regulation (formerly Gene Prediction - 2)
Bikash Shakya Emma Lang Jorge Diaz.  BLASTx entire sequence against 9 plant genomes. RepeatMasker  55.47% repetitive sequences  82.5% retroelements.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
10/21/05 D Dobbs ISU - BCB 444/544X: Gene Prediction1 10/21/05 Gene Prediction (formerly Gene Prediction - 3)
BME 110L / BIOL 181L Computational Biology Tools October 29: Quickly that demo: how to align a protein family (10/27)
BME 110L / BIOL 181L Computational Biology Tools February 19: In-class exercise: a phylogenetic tree for that.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Genome Annotation Rosana O. Babu.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
10/24/05 D Dobbs ISU - BCB 444/544X: Promoter Prediction1 10/24/05 Promoter Prediction RNA Structure & Function Prediction.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
110/29/07BCB 444/544 F07 ISU Dobbs #28- Promoter Prediction BCB 444/544 Lecture 28 Gene Prediction - finish it Promoter Prediction #28_Oct29.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
How can we find genes? Search for them Look them up.
110/22/07BCB 444/544 F07 ISU Dobbs #26 - Gene Prediction BCB 444/544 Lecture 26 Gene Prediction #26_Oct22.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Applied Bioinformatics
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Definitions of Annotation Interpreting raw sequence data into useful biological information Information attached to genomic coordinates with start and.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
9/24/07BCB 444/544 F07 ISU Dobbs #14 - Review: Nucleus, Chromosomes, Genes, RNA, Protein1 BCB 444/544 Lecture 14 Review: Nucleus, Chromosomes, Genes, RNA,
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
bacteria and eukaryotes
What is a Hidden Markov Model?
Genes, Genomes, and Genomics
Eukaryotic Gene Finding
Ab initio gene prediction
Introduction to Bioinformatics II
BCB 444/544 F07 ISU Terribilini #29- Phylogenetics
Genome Annotation and the Human Genome
Presentation transcript:

110/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II BCB 444/544 Lecture 27 Gene Prediction II #27_Oct24

210/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Mon Oct 22 - Lecture 26 Gene Prediction Chp 8 - pp Wed Oct 24 - Lecture 27 (will not be covered on Exam 2) Promoter & Regulatory Element Prediction Chp 9 - pp Thurs Oct 25 - Review Session & Project Planning Fri Oct 26 - EXAM 2 Required Reading (before lecture)

310/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Assignments & Announcements Mon Oct 22 - Study Guide for Exam 2 was posted, finally… Mon Oct 22 - HW#4 Due (no "correct" answer to post) Thu Oct 25 - no Lab => Optional Review Session for Exam 544 Project Planning/Consult with DD & MT Fri Oct 26 - Exam 2 - Will cover: Lectures (thru Mon Sept 17) Labs 5-8 HW# 3 & 4 All assigned reading: Chps 6 (beginning with HMMs), 7-8, Eddy: What is an HMM Ginalski: Practical Lessons…

410/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II BCB 544 "Team" Projects 544 Extra HW#2 is next step in Team Projects Write ~ 1 page outline Schedule meeting with Michael & Drena to discuss topic Read a few papers Write a more detailed plan You may work alone if you prefer Last week of classes will be devoted to Projects Written reports due: Mon Dec 3 (no class that day) Oral presentations (15-20') will be: Wed-Fri Dec 5,6,7 1 or 2 teams will present during each class period  See Guidelines for Projects posted online

510/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II BCB 544 Only: New Homework Assignment 544 Extra#2 (posted online Thurs?) No - sorry! sent by on Sat… Due: PART 1 - ASAP PART 2 - Fri Nov 2 by 5 PM Part 1 - Brief outline of Project, to Drena & Michael after response/approval, then: Part 2 - More detailed outline of project Read a few papers and summarize status of problem Schedule meeting with Drena & Michael to discuss ideas

610/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Seminars this Week BCB List of URLs for Seminars related to Bioinformatics: Oct 25 Thur - BBMB Seminar 4:10 in 1414 MBB Dave Segal UC Davis Zinc Finger Protein Design Oct 19 Fri - BCB Faculty Seminar 2:10 in 102 ScI Guang Song ComS, ISU Probing functional mechanisms by structure-based modeling and simulations

710/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Chp 8 - Gene Prediction SECTION III GENE AND PROMOTER PREDICTION Xiong: Chp 8 Gene Prediction Categories of Gene Prediction Programs Gene Prediction in Prokaryotes Gene Prediction in Eukaryotes

810/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II What is a gene? segment of DNA, some of which is "structural," i.e., transcribed to give a functional RNA product, & some of which is "regulatory" Genes can encode: mRNA (for protein) other types of RNA (tRNA, rRNA, miRNA, etc.) Genes differ in eukaryotes vs prokaryotes (& archaea), both structure & regulation What is a Gene?

910/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Synthesis & Processing of Eukaryotic mRNA exon 1exon 2exon 3intron Transcription Splicing (remove introns) Capping & polyadenylation Export to cytoplasm AAAAA 3’5’ 3’ 5’3’ 7Me G m 1' transcript (RNA) Mature mRNA DN Gene in DNA

1010/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II What are cDNAs & ESTs? cDNA libraries are important for determining gene structure & studying regulation of gene expression Isolate RNA (always from a specific organism, region, and time point) Convert RNA to complementary DNA (with reverse transcriptase) Clone into cDNA vector Sequence the cDNA inserts Short cDNAs are called ESTs or Expressed Sequence Tags ESTs are strong evidence for genes Full-length cDNAs can be difficult to obtain vector insert

1110/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II UniGene: Unique genes via ESTs Find UniGene at NCBI: UniGene clusters contain many ESTs UniGene data come from many cDNA libraries. When you look up a gene in UniGene, you can obtain information re: level & tissue distribution of expression

1210/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Gene Prediction in Prokaryotes vs Eukaryotes Prokaryotes Small genomes ·10 6 bp About 90% of genome is coding Simple gene structure Prediction success ~99% Eukaryotes Large genomes 10 7 – bp Often less than 2% coding Complicated gene structure (splicing, long exons) Prediction success % ATGTAA Promotor Open reading frame (ORF) Start codonStop codon Promotor 5’ UTR ExonsIntrons 3’ UTR ATGTAA Splice sites

1310/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Prediction is Easier in Microbial Genomes Why? Smaller genomes Simpler gene structures Many more sequenced genomes! (for comparative approaches) Many microbial genomes have been fully sequenced & whole-genome "gene structure" and "gene function" annotations are available e.g., GeneMark.hmm, GlimmerGeneMark.hmm TIGR Comprehensive Microbial Resource (CMR)TIGRComprehensive Microbial Resource (CMR) NCBI Microbial GenomesNCBIMicrobial Genomes

1410/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Gene Prediction - The Problem Problem: Given a new genomic DNA sequence, identify coding regions and their predicted RNA and protein sequences ATTACCATGGGGCAGGGTCAGATATAATGCCCTCATTTT

1510/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Computational Gene Prediction: Approaches Ab initio methods Search by signal: find DNA sequences involved in gene expression. Search by content: Test statistical properties distinguishing coding from non-coding DNA Similarity-based methods Database search: exploit similarity to proteins, ESTs, cDNAs Comparative genomics: exploit aligned genomes Do other organisms have similar sequence? Hybrid methods - best

1610/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Computational Gene Prediction: Algorithms 1.Neural Networks (NNs) (more on these later…) e.g., GRAIL 2.Linear discriminant analysis (LDA) (see text) e.g., FGENES, MZEF 3.Markov Models (MMs) & Hidden Markov Models (HMMs) e.g., GeneSeqer - uses MMs GENSCAN - uses 5th order HMMs - (see text) HMMgene - uses conditional maximum likelihood (see text)

1710/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Gene Prediction Strategies What sequence signals can be used? Transcription: TF binding sites, promoter, initiation site, terminator, GC islands, etc. Processing signals: Splice donor/acceptors, polyA signal Translation: Start (AUG = Met) & stop (UGA,UUA, UAG) ORFs, codon usage What other types of information can be used? Homology (sequence comparison, BLAST) cDNAs & ESTs (experimental data, pairwise alignment)

1810/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Signals Search Approach: Build models (PSSMs, profiles, HMMs, …) and search against DNA. Detected instances provide evidence for genes

1910/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II DNA Signals Used in Gene Prediction 1.Exploit the regular gene structure ATG—Exon1—Intron1—Exon2—…—ExonN—STOP 2.Recognize “coding bias” CAG-CGA-GAC-TAT-TTA-GAT-AAC-ACA-CAT-GAA-… 3.Recognize splice sites Intron—cAGt—Exon—gGTgag—Intron 4.Model the duration of regions Introns tend to be much longer than exons, in mammals Exons are biased to have a given minimum length 5.Use cross-species comparison Gene structure is conserved in mammals Exons are more similar (~85%) than introns

2010/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Content Search Observation: Encoding a protein affects statistical properties of DNA sequence: Nucleotide composition Hexamer frequency GC content (CpG islands, exon/intron) Uneven usage of synonymous codons (codon bias) Method: Evaluate these differences (coding statistics) to differentiate between coding and non-coding regions

2110/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Human Codon Usage

2210/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Predicting Genes based on Codon Usage Differences Algorithm: Process sliding window Use codon frequencies to compute probability of coding versus non-coding Plot log-likelihood ratio: Coding Profile of ß-globin gene Exons

2310/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II In different genomes: Translate DNA into all 6 reading frames and search against proteins (TBLASTX,BLASTX, etc.) Within same genome: Search with EST/cDNA database (EST2genome, BLAT, etc.). Problems: Will not find “new” or RNA genes (non-coding genes). Limits of similarity are hard to define Small exons might be overlooked Similarity-Based Methods: Database Search ATTGCGTAGGGCGCT TAACGCATCCCGCGA

2410/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Similarity-Based Methods: Comparative Genomics Idea: Functional regions are more conserved than non-functional ones; high similarity in alignment indicates gene Advantages: May find uncharacterized or RNA genes Problems: Finding suitable evolutionary distance Finding limits of high similarity (functional regions) human mouse GGTTTT--ATGAGTAAAGTAGACACTCCAGTAACGCGGTGAGTAC----ATTAA | ||||| ||||| ||| ||||| ||||||||||||| | | C-TCAGGAATGAGCAAAGTCGAC---CCAGTAACGCGGTAAGTACATTAACGA-

2510/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II HumanMouse Human-Mouse Homology Comparison of 1196 orthologous genes Sequence identity between genes in human vs mouse Exons: 84.6% Protein: 85.4% Introns: 35% 5’ UTRs: 67% 3’ UTRs: 69%

2610/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Gene Prediction Flowchart Fig 5.15 Baxevanis & Ouellette 2005

2710/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Predicting Genes - Basic steps: Obtain genomic sequence BLAST it! Perform database similarity search (with EST & cDNA databases, if available) Translate in all 6 reading frames (i.e., "6-frame translation") Compare with protein sequence databases Use Gene Prediction software to locate genes Compare results obtained using different programs Analyze regulatory sequences, too Refine gene prediction

2810/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Predicting Genes - a few Details: 1. 1st, mask to "remove" repetitive elements (ALUs, etc.) 2.Perform database search on translated DNA (BlastX,TFasta) 3.Use several programs to predict genes & find ORFs (GENSCAN, GeneSeqer, GeneMark.hmm, GRAIL) 4.Search for functional motifs in translated ORFs & in neighboring DNA sequences (InterPro, Transfac) 5.Repeat

2910/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Thanks to Volker Brendel, ISU for the following Figs & Slides Slightly modified from: BSSI Genome Informatics Module 05.html#moduleB V Brendel Brendel et al (2004) Bioinformatics 20: 1157

3010/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II GeneSeqer Genomic Sequence EST or protein database (Suffix Array/Suffix Tree) Fast Search Spliced Alignment Output Assembly Brendel 2005

3110/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Perform pairwise alignment with large gaps in one sequence (due to introns) Align genomic DNA with cDNA, ESTs, protein sequences Score semi-conserved sequences at splice junctions Using Bayesian probability model & 1st order MM Score coding constraints in translated exons Using Bayesian model Spliced Alignment Algorithm GeneSeqerGeneSeqer - Brendel et al.- ISU Intron GT AG Splice sites Donor Acceptor Brendel et al (2004) Bioinformatics 20: tent/abstract/20/7/1157 Brendel 2005

3210/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II GT AG EXON I NTRON Splice sites Donor site Acceptor site Signals: Pre-mRNA Splicing Translation Protein Splicing mRNA Cap- -Poly(A) Transcription pre-mRNA Cap--Poly(A) Genomic DNA Start codonStop codon Brendel 2005

3310/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Brendel - Spliced Alignment I: Compare with cDNA or EST probes Genomic DNA Start codonStop codon mRNA -Poly(A) Cap- 5’-UTR 3’-UTR Start codonStop codon Brendel 2005

3410/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Brendel - Spliced Alignment II: Compare with protein probes Genomic DNA Start codonStop codon Protein Brendel 2005

3510/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Information Content I i : Extent of Splice Signal Window: i: ith position in sequence Ī : avg information content over all positions >20 nt from splice site  Ī : avg sample standard deviation of Ī Splice Site Detection Do DNA sequences surrounding splice "consensus" sequences contribute to splicing signal? YES Brendel 2005

3610/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Human T2_GT Human T2_AG Information Content vs Position Which sequences are exons & which are introns? How can you tell? Brendel 2005

3710/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II GTAG Zea mays GTAG Arabidopsis thaliana GTAGAspergillus GTAG S. pombe GTAG C. elegans GTAGDrosophila GTAG Gallus gallus GTAG Rattus norvegicus GTAG Mus musculus GTAG Home sapiens Number of True Splice Sites / Phase TypeSpecies Donor (GT) & Acceptor (AG) Sites Used for Model Training Brendel 2005

3810/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II enen e n+1 inin i n+1 PGPG P A(n) P  G (1-P  G )P D(n+1) (1-P  G )(1-P D(n+1) ) 1-P A(n) PGPG Markov Model for Spliced Alignment Brendel 2005

3910/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Evaluation of Predictions Normalized specificity: Actual TrueFalse PP=TP+FP PN=FN+TN AP=TP+FNAN=FP+TN Predicted True False TNFN FP TP Specificity: Misclassification rates: Coverage Sensitivity: Predicted Positives True Positives False Positives Recall Do not memorize this !

4010/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Evaluation of Predictions - in English Actual TrueFalse PP=TP+FP PN=FN+TN AP=TP+FN AN=FP+TN Predicted True False TNFN FP TP Specificity: Sensitivity: = Coverage In English? Sensitivity is the fraction of all positive instances having a true positive prediction. = Recall In English? Specificity is the fraction of all predicted positives that are, in fact, true positives. IMPORTANT: in medical jargon, Specificity is sometimes defined differently (what we define here as "Specificity" is sometimes referred to as "Positive predictive value") IMPORTANT: Sensitivity alone does not tell us much about performance because a 100% sensitivity can be trivially achieved by labeling all test cases positive!

4110/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Best Measures for Comparison? ROC curves (Receiver Operating Characteristic (?!!) Correlation Coefficient Matthews correlation coefficient (MCC) MCC = 1 for a perfect prediction 0 for a completely random assignment -1 for a "perfectly incorrect" prediction Do not memorize this ! In signal detection theory, a receiver operating characteristic (ROC), or ROC curve is a plot of sensitivity vs (1 - specificity) for a binary classifier system as its discrimination threshold is varied. The ROC can also be represented equivalently by plotting fraction of true positives (TPR = true positive rate) vs fraction of false positives (FPR = false positive rate)signal detection theorybinary classifiertrue positivesfalse positives

4210/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II  Sn Human GT site Human AG site Sn A. thaliana AG site A. thaliana GT site   GenSeqer Performance? Plots such as these (& ROCs) are much better than using a "single number" to compare different methods Such plots illustrate trade-off: Sn vs Sp Note: the above are not ROC curves (plots of Sn vs 1-Sp) Brendel 2005

4310/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II GT AG 7C A. thaliana GT AG 7C C. elegans GT AG 2C Drosophila GT AG 2C Homo sapiens Sp (%)  (%) Sn (%) Bayes Factor Test Site Set True False SiteModelSpecies GeneSeqer Results on Different Genomes Brendel 2005

4410/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Performance of GeneSeqer vs Others? Comparison with ab initio gene prediction: vs GENSCAN an HMM-based ab initio method "Winner" depends on: Availability of ESTs Level of similarity to protein homologs Brendel 2005

4510/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Target protein alignment score Brendel 2005 GENSCAN - Burge, MIT GeneSeqer vs GENSCAN (Exon prediction)

4610/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Brendel 2005 GENSCAN - Burge, MIT GeneSeqer vs GENSCAN (Intron prediction)

4710/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Brendel 2005 GeneSeqer: Input

4810/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Brendel 2005 GeneSeqer: Output

4910/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Brendel 2005 GeneSeqer: Gene Evidence Summary

5010/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Gene Prediction - Problems & Status? Common errors? False positive intergenic regions: 2 annotated genes actually correspond to a single gene False negative intergenic region: One annotated gene structure actually contains 2 genes False negative gene prediction: Missing gene (no annotation) Other: Partially incorrect gene annotation Missing annotation of alternative transcripts Current status? For ab initio prediction in eukaryotes: HMMs have better overall performance for detecting untron/exon boundaries Limitation? Training data: predictions are organism specific Combined ab initio/homology based predictions: Improved accurracy Limitation? Availability of identifiable sequence homologs in databases

5110/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Recommended Gene Prediction Software Ab initio GENSCAN: GeneMark.hmm: others: GRAIL, FGENES, MZEF, HMMgene Similarity-based BLAST, GenomeScan, EST2Genome, Twinscan Combined: GeneSeqer, ROSETTA  Consensus: because results depend on organisms & specific task, Always use more than one program! Two servers hat report consensus predictions GeneComber DIGIT

5210/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Other Gene Prediction Resources: at ISU

5310/24/07BCB 444/544 F07 ISU Dobbs #27 - Gene Prediction II Other Gene Prediction Resources: GaTech, MIT, Stanford, etc. Current Protocols in Bioinformatics (BCB/ISU owns a copy - currently in my lab!) Chapter 4 Finding Genes 4.1 An Overview of Gene Identification: Approaches, Strategies, and Considerations 4.2 Using MZEF To Find Internal Coding Exons 4.3 Using GENEID to Identify Genes 4.4 Using GlimmerM to Find Genes in Eukaryotic Genomes 4.5 Prokaryotic Gene Prediction Using GeneMark and GeneMark.hmm 4.6 Eukaryotic Gene Prediction Using GeneMark.hmm 4.7 Application of FirstEF to Find Promoters and First Exons in the Human Genome 4.8 Using TWINSCAN to Predict Gene Structures in Genomic DNA Sequences 4.9 GrailEXP and Genome Analysis Pipeline for Genome Annotation 4.10 Using RepeatMasker to Identify Repetitive Elements in Genomic Sequences Lists of Gene Prediction Software