Gene prediction. Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatg ctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgc.

Slides:



Advertisements
Similar presentations
An Introduction to Bioinformatics Finding genes in prokaryotes.
Advertisements

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Profiles for Sequences
Computational Gene Finding
Finding Eukaryotic Open reading frames.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Gene Finding. Finding Genes Prokaryotes –Genome under 10Mb –>85% of sequence codes for proteins Eukaryotes –Large Genomes (up to 10Gb) –1-3% coding for.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
Gene Identification Lab
Gene Finding Charles Yan.
Comparative ab initio prediction of gene structures using pair HMMs
Gene Prediction: Statistical Approaches Lecture 22.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Eukaryotic Gene Finding
Introduction to Molecular Biology. G-C and A-T pairing.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken (and rapidly mixed) from William Stafford Noble, Larry Hunter, and Eyal Pribman.
Lecture 12 Splicing and gene prediction in eukaryotes
Eukaryotic Gene Finding
Biological Motivation Gene Finding in Eukaryotic Genomes
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
3. Genome Annotation: Gene Prediction. Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and end positions.
Gene Structure and Identification
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal.
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Gene Prediction: Statistical Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 20, 2005 ChengXiang Zhai Department of Computer Science.
Predictive methods using DNA sequences Unit 11 BIOL221T: Advanced Bioinformatics for Biotechnology Irene Gabashvili, PhD.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Genome Annotation Haixu Tang School of Informatics.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
ORF Calling. Why? Need to know protein sequence Protein sequence is usually what does the work Functional studies Crystallography Proteomics Similarity.
Applied Bioinformatics
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
Gene Structure Prediction (Gene Finding) I519 Introduction to Bioinformatics, 2012.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Definitions of Annotation Interpreting raw sequence data into useful biological information Information attached to genomic coordinates with start and.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
ORF Calling.
bacteria and eukaryotes
A Quest for Genes What’s a gene? gene (jēn) n.
Interpolated Markov Models for Gene Finding
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
More on translation.
Introduction to Bioinformatics II
Gene Prediction: Statistical Approaches
The Toy Exon Finder.
Presentation transcript:

Gene prediction

Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatg ctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgc ggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcgg ctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccg atgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg cggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcat gcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagct gggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgca tgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggcta tgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcg gctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatga caatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggc tatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgcta agctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaa tgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcg gctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgca tgcggctatgctaagctcatgcgg

Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatg ctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgc ggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcgg ctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccg atgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg cggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcat gcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagct gggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgca tgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggcta tgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcg gctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatga caatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggc tatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgcta agctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaa tgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcg gctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgca tgcggctatgctaagctcatgcgg

Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatg ctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgc ggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcgg ctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccg atgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctg cggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcat gcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagct gggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgca tgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggcta tgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcg gctatgctaagctcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatga caatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggc tatgctaatgcatgcggctatgctaagctcggctatgctaatgaatggtcttgggatttaccttggaatgcta agctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaa tgcatgcggctatgctaagctgggaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcg gctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgca tgcggctatgctaagctcatgcgg Gene!

Newspaper written in unknown language –Certain pages contain encoded message, say 99 letters on page 7, 30 on page 12 and 63 on page 15. How do you recognize the message? You could probably distinguish between the ads and the story (ads contain the “$” sign often) Statistics-based approach to Gene Prediction tries to make similar distinctions between exons and introns. Gene Prediction Analogy

Noting the differing frequencies of symbols (e.g. ‘%’, ‘.’, ‘-’) and numerical symbols could you distinguish between a story and the stock report in a foreign newspaper? Statistical Approach: Metaphor in Unknown Language

Statistical: coding segments (exons) have typical sequences on either end and use different subwords than non-coding segments (introns). Similarity-based: many human genes are similar to genes in mice, chicken, or even bacteria. Therefore, already known mouse, chicken, and bacterial genes may help to find human genes. Two Approaches to Gene Prediction

If you could compare the day’s news in English, side-by-side to the same news in a foreign language, some similarities may become apparent Similarity-Based Approach: Metaphor in Different Languages

Annotation of Genomic Sequence Given the sequence of an organism’s genome, we would like to be able to identify: –Genes –Exon boundaries & splice sites –Beginning and end of translation –Alternative splicings –Regulatory elements (e.g. promoters) The only certain way to do this is experimentally, but it is time consuming and expensive. Computational methods can achieve reasonable accuracy quickly, and help direct experimental approaches. primary goals secondary goals

Prokaryotic Gene Structure Promoter CDS Terminator transcription Genomic DNA mRNA  Most bacterial promoters contain the Shine-Delgarno signal, at about -10 that has the consensus sequence: 5'-TATAAT-3'.  The terminator: a signal at the end of the coding sequence that terminates the transcription of RNA  The coding sequence is composed of nucleotide triplets. Each triplet codes for an amino acid. The AAs are the building blocks of proteins.

Pieces of a (Eukaryotic) Gene (on the genome) 5’ 3’ 5’ ~ Mbp 5’ 3’ 5’ … … … … ~ kbp exons (cds & utr) / introns (~ bp) (~ bp) Polyadenylation site promoter (~10 3 bp) enhancers (~ bp) other regulatory sequences (~ bp)

What is Computational Gene Finding? Given an uncharacterized DNA sequence, find out: –Which region codes for a protein? –Which DNA strand is used to encode the gene? –Which reading frame is used in that strand? –Where does the gene starts and ends? –Where are the exon-intron boundaries in eukaryotes? –(optionally) Where are the regulatory sequences for that gene?

Prokaryotic Vs. Eukaryotic Gene Finding Prokaryotes: small genomes 0.5 – 10·10 6 bp high coding density (>90%) no introns –Gene identification relatively easy, with success rate ~ 99% Problems: overlapping ORFs short genes finding TSS and promoters Eukaryotes: large genomes 10 7 – bp low coding density (<50%) intron/exon structure –Gene identification a complex problem, gene level accuracy ~50% Problems: many

What is it about genes that we can measure (and model)? Most of our knowledge is biased towards protein-coding characteristics –ORF (Open Reading Frame): a sequence defined by in- frame AUG and stop codon, which in turn defines a putative amino acid sequence. –Codon Usage: most frequently measured by CAI (Codon Adaptation Index) Other phenomena –Nucleotide frequencies and correlations: value and structure –Functional sites: splice sites, promoters, UTRs, polyadenylation sites

General Things to Remember about (Protein-coding) Gene Prediction Software It is, in general, organism-specific It works best on genes that are reasonably similar to something seen previously It finds protein coding regions far better than non- coding regions In the absence of external (direct) information, alternative forms will not be identified It is imperfect! (It’s biology, after all…)

Gene Finding: Different Approaches Similarity-based methods (extrinsic) - use similarity to annotated sequences : –proteins –cDNAs –ESTs Comparative genomics - Aligning genomic sequences from different species Ab initio gene-finding (intrinsic) Integrated approaches

Similarity-based methods Based on sequence conservation due to functional constraints Use local alignment tools (Smith-Waterman algo, BLAST, FASTA) to search protein, cDNA, and EST databases Will not identify genes that code for proteins not already in databases (can identify ~50% new genes) Limits of the regions of similarity not well defined

Comparative Genomics Based on the assumption that coding sequences are more conserved than non-coding Two approaches: –intra-genomic (gene families) –inter-genomic (cross-species) Alignment of homologous regions Difficult to define limits of higher similarity Difficult to find optimal evolutionary distance (pattern of conservation differ between loci)

Summary for Extrinsic Approaches Strengths: Rely on accumulated pre-existing biological data, thus should produce biologically relevant predictions Weaknesses: Limited to pre-existing biological data Errors in databases Difficult to find limits of similarity

Ab initio Gene Finding Input: A DNA string over the alphabet {A,C,G,T} Output: An annotation of the string showing for every nucleotide whether it is coding or non-coding AAAGCATGCATTTAACGAGTGCATCAGGACTCCATACGTAATGCCG Gene finder Using only sequence information Identifying only coding exons of protein-coding genes (transcription start site, 5’ and 3’ UTRs are ignored) Integrates coding statistics with signal detection

A eukaryotic gene This is the human p53 tumor suppressor gene on chromosome 17. Genscan is one of the most popular gene prediction algorithms. This particular gene lies on the reverse strand. 3’ untranslated region Final exon Initial exon Introns Internal exons

Observations Given (walk, shop, clean) –What is the probability of this sequence of observations? (is he really still at home, or did he skip the country) –What was the most likely sequence of rainy/sunny days?

Signals vs contents In gene finding, a small pattern within the genomic DNA is referred to as a signal, whereas a region of genomic DNA is a content. Examples of signals: splice sites, starts and ends of transcription or translation, branch points, transcription factor binding sites Examples of contents: exons, introns, UTRs, promoter regions

The CpG island problem Methylation in human genome –“CG” -> “TG” happens in most places except “start regions” of genes and within genes – CpG islands = 100-1,000 bases before a gene starts Question –Given a long sequence, how would we find the CpG islands in it?

Promoters Promoters are DNA segments upstream of transcripts that initiate transcription Promoter attracts RNA Polymerase to the transcription start site 5’ Promoter 3’

Splice signals (mice): GT, AG

Splice site detection 5’ 3’ Donor site Position %

Real splice sites Real splice sites show some conservation at positions beyond the first two. We can add additional arrows to model these states. weblogo.berkeley.edu

Ribosomal Binding Site

Prior knowledge The translated region must have a length that is a multiple of 3. Some codons are more common than others. Exons are usually shorter than introns. The translated region begins with a start signal and ends with a stop codon. 5’ splice sites (exon to intron) are usually GT; 3’ splice sites (intron to exon) are usually AG. The distribution of nucleotides and dinucleotides is usually different in introns and exons.

Gene Prediction and Motifs Upstream regions of genes often contain motifs that can be used for gene prediction -10 STOP ATG TATACT Pribnow Box TTCCAAGGAGG Ribosomal binding site Transcription start site

Positional dependence In this data, every time a “G” appears in position 1, an “A” appears in position 3. Conversely, an “A” in position 1 always occurs with a “T” in position 3. ACTGACTTGCACACTTACTAGCATACTAACTTACTGACTTGCACACTTACTAGCATACTAACTT

Example of (Positional) Weight Matrix Computed by measuring the frequency of every element of every position of the site (weight) Score for any putative site is the sum of the matrix values (converted in probabilities) for that sequence (log-likelihood score) Disadvantages: –cut-off value required –assumes independence between adjacent bases TACGAT TATAAT GATACT TATGAT TATGTT A C G T

Conditional probability What is the probability of observing an “A” at position 2, given that we observed a “C” at the previous position? GCG CAG CCG GCG CCG CCG GCG CCT CCG GGG CGG GCG AGG CAG CCT CAT CCT GCG

Conditional probability What is the probability of observing an “A” at position 2, given that we observed a “C” at the previous position? Answer: total number of CA’s divided by total number of C’s in position 1. 3/11 = 27% Probability of observing CA = 3/18 = 17%. GCG CAG CCG GCG CCG CCG GCG CCT CCG GGG CGG GCG AGG CAG CCT CAT CCT GCG

Conditional probability What is the probability of observing a “G” at position 3, given that we observed a “C” at the previous position? GCG CAG CCG GCG CCG CCG GCG CCT CCG GGG CGG GCG AGG CAG CCT CAT CCT GCG

Conditional probability What is the probability of observing a “G” at position 3, given that we observed a “C” at the previous position? Answer: 9/12 = 75%. GCG CAG CCG GCG CCG CCG GCG CCT CCG GGG CGG GCG AGG CAG CCT CAT CCT GCG

Promoter Structure in Prokaryotes (E.Coli) Transcription starts at offset 0. Pribnow Box (-10) Gilbert Box (-30) Ribosomal Binding Site (+10)

Detect potential coding regions by looking at ORFs –A genome of length n is comprised of (n/3) codons –Stop codons break genome into segments between consecutive Stop codons –The subsegments of these that start from the Start codon (ATG) are ORFs ORFs in different frames may overlap Genomic Sequence Open reading frame ATGTGA Open Reading Frames (ORFs)

Long open reading frames may be a gene. At random, we should expect one stop codon every (64/3) ~= 21 codons. However, genes are usually much longer than this A basic approach is to scan for ORFs whose length exceeds certain threshold. This is naïve because some genes (e.g. some neural and immune system genes) are relatively short Long vs.Short ORFs

Testing ORFs: Codon Usage Create a 64-element hash table and count the frequencies of codons in an ORF Amino acids typically have more than one codon, but in nature certain codons are more in use Uneven use of the codons may characterize a real gene This compensate for pitfalls of the ORF length test

Open Reading Frames in Bacteria Without introns, look for long open reading frame (start codon ATG, …, stop codon TAA, TAG, TGA) Short genes are missed (<300 nucleotides) Shadow genes (overlapping open reading frames on opposite DNA strands) are hard to detect Some genes start with UUG, AUA, UUA and CUG for start codon Some genes use TGA to create selenocysteine and it is not a stop codon

Coding Statistics Unequal usage of codons in the coding regions is a universal feature of the genomes –uneven usage of amino acids in existing proteins –uneven usage of synonymous codons (correlates with the abundance of corresponding tRNAs) We can use this feature to differentiate between coding and non-coding regions of the genome Coding statistics - a function that for a given DNA sequence computes a likelihood that the sequence is coding for a protein

Coding Statistics Many different ones –codon usage –hexamer usage –GC content –compositional bias between codon positions –nucleotide periodicity –…

Codon Usage in Human Genome

AA codon /1000 frac Ser TCG Ser TCA Ser TCT Ser TCC Ser AGT Ser AGC Pro CCG Pro CCA Pro CCT Pro CCC AA codon /1000 frac Leu CTG Leu CTA Leu CTT Leu CTC Ala GCG Ala GCA Ala GCT Ala GCC Gln CAG Gln CAA Codon Usage in Mouse Genome

Codon Usage and Likelihood Ratio An ORF is more “believable” than another if it has more “likely” codons Do sliding window calculations to find ORFs that have the “likely” codon usage Allows for higher precision in identifying true ORFs; much better than merely testing for length. However, average vertebrate exon length is 130 nucleotides, which is often too small to produce reliable peaks in the likelihood ratio Further improvement: in-frame hexamer count (frequencies of pairs of consecutive codons)

Splicing Signals Try to recognize location of splicing signals at exon-intron junctions. This has yielded a weakly conserved donor splice site and acceptor splice site Profiles for sites are still weak, and lends the problem to the Hidden Markov Model (HMM) approaches, which capture the statistical dependencies between sites

Donor and Acceptor Sites: GT and AG dinucleotides The beginning and end of exons are signaled by donor and acceptor sites that usually have GT and AC dinucleotides Detecting these sites is difficult, because GT and AC appear very often exon 1exon 2 GTAC Acceptor Site Donor Site

A more realistic (and complex) HMM model for Gene Prediction (Genie)

Assessing performance: Sensitivity & Specificity Testing of predictions is performed on sequences where the gene structure is known Sensitivity is the fraction of known genes (or bases or exons) correctly predicted –“Am I finding the things that I’m supposed to find” Specificity is the fraction of predicted genes (or bases or exons) that correspond to true genes –“What fraction of my predictions are true?” In general, increasing one decreases the other

Measures of Prediction Accuracy, Part 1 Nucleotide level accuracy Sensitivity= Specificity= TN FP FNTN TPFN TP FN REALITY PREDICTION number of correct exons number of actual exons number of correct exons number of predicted exons

Measures of Prediction Accuracy, Part 2 Exon level accuracy REALITY PREDICTION WRONG EXON CORRECT EXON MISSING EXON

Graphic View of Specificity and Sensitivity

Quantifying the tradeoff: Correlation Coefficient

Examples of Gene Finders FGENES – linear DF for content and signal sensors and DP for finding optimal combination of exons GeneMark – HMMs enhanced with ribosomal binding site recognition Genie – neural networks for splicing, HMMs for coding sensors, overall structure modeled by HMM Genscan – WM, WA and decision trees as signal sensors, HMMs for content sensors, overall HMM HMMgene – HMM trained using conditional maximum likelihood Morgan – decision trees for exon classification, also Markov Models MZEF – quadratic DF, predict only internal exons

Ab initio Gene Finding is Difficult Genes are separated by large intergenic regions Genes are not continuous, but split in a number of (small) coding exons, separated by (larger) non- coding introns –in humans coding sequence comprise only a few percents of the genome and an average of 5% of each gene Sequence signals that are essential for elucidation of a gene structure are degenerate and highly unspecific Alternative splicing Repeat elements (>50% in humans) – some contain coding regions

Problems with Ab initio Gene Finding No biological evidence In long genomic sequences many false positive predictions Prediction accuracy high, but not sufficient

Integrated Approaches for Gene Finding Programs that integrate results of similarity searches with ab initio techniques (GenomeScan, FGENESH+, Procrustes) Programs that use synteny between organisms (ROSETTA, SLAM) Integration of programs predicting different elements of a gene (EuGène) Combining predictions from several gene finding programs (combination of experts)

Combining Programs’ Predictions Set of methods used and they way they are integrated differs between individual programs Different programs often predict different elements of an actual gene they could complement each other yielding better prediction

Related Work This approach was suggested by several authors Burset and Guigó (1996) –Investigated correlation between 9 gene-finding programs –99% of exons predicted by all programs were correct –1% of exons completely missed by all programs Murakami and Tagaki (1998) –Five methods for combining the prediction by 4 gene-finding programs –Nucleotide level accuracy measures improved by 3-5% in comparison with the best single

AND and OR Methods exon 1 exon 2 union intersection

Combining Genscan and HMMgene High prediction accuracy as well as reliability of their exon probability made them the best candidates for our study Genscan predicted 77% of exons correctly, HMMgene 75%, both 87% GenscanHMMgene