Eukaryotic Gene Finding with GlimmerHMM Mihaela Pertea Assistant Research Scientist CBCB.

Slides:

Advertisements

Similar presentations

Gene Prediction: Similarity-Based Approaches

Advertisements

GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.

BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.

Hidden Markov Models in Bioinformatics

Ab initio gene prediction Genome 559, Winter 2011.

 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.

1 DNA Analysis Amir Golnabi ENGS 112 Spring 2008.

Hidden Markov Models CBB 231 / COMPSCI 261. An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite.

Ka-Lok Ng Dept. of Bioinformatics Asia University

Hidden Markov Models in Bioinformatics

Profiles for Sequences

Hidden Markov Models Theory By Johan Walters (SR 2003)

Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.

Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry

Computational Gene Finding

Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.

CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.

Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.

Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.

Gene Finding (DNA signals) Genome Sequencing and assembly

Gene Finding Charles Yan.

CSE182-L10 Gene Finding.

CSE182-L12 Gene Finding.

Comparative ab initio prediction of gene structures using pair HMMs

Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.

Eukaryotic Gene Finding

CSE182-L8 Gene Finding. Project EST clustering and assembly Given a collection of EST (3’/5’) sequences, your goal is to cluster all ESTs from the same.

Lecture 12 Splicing and gene prediction in eukaryotes

CSE182-L10 MS Spec Applications + Gene Finding + Projects.

Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.

Eukaryotic Gene Finding

DNA Feature Sensors B. Majoros. What is Feature Sensing? A feature is any DNA subsequence of biological significance. For practical reasons, we recognize.

Biological Motivation Gene Finding in Eukaryotic Genomes

Dynamic Programming (cont’d) CS 466 Saurabh Sinha.

Hidden Markov Models In BioInformatics

Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.

BINF6201/8201 Hidden Markov Models for Sequence Analysis

Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.

Simple cluster structure of triplet distributions in genetic texts Andrei Zinovyev Institute des Hautes Etudes Scientifique, Bures-sur-Yvette.

CMSC 828N lecture notes: Eukaryotic Gene Finding with Generalized HMMs Mihaela Pertea and Steven Salzberg Center for Bioinformatics and Computational Biology,

Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.

Mark D. Adams Dept. of Genetics 9/10/04

Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.

From Genomes to Genes Rui Alves.

Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.

Generalized Hidden Markov Models for Eukaryotic Gene Prediction Ela Pertea Assistant Research Scientist CBCB.

Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.

JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.

Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.

Applied Bioinformatics

(H)MMs in gene prediction and similarity searches.

Gene Structure Prediction (Gene Finding) I519 Introduction to Bioinformatics, 2012.

1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.

BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Dynamic Programming (cont’d) CS 466 Saurabh Sinha.

Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.

1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.

10. Decision Trees and Markov Chains for Gene Finding.

Hidden Markov Models BMI/CS 576

bacteria and eukaryotes

Genome Annotation (protein coding genes)

What is a Hidden Markov Model?

Interpolated Markov Models for Gene Finding

Eukaryotic Gene Finding

Ab initio gene prediction

Recitation 7 2/4/09 PSSMs+Gene finding

Hidden Markov Models (HMMs)

The Toy Exon Finder.

Presentation transcript:

Eukaryotic Gene Finding with GlimmerHMM Mihaela Pertea Assistant Research Scientist CBCB

Outline Brief overview of the eukaryotic gene finding problem GlimmerHMM architecture: signal sensors, coding statistics, GHMMs Training GlimmerHMM GlimmerHMM results

Eukaryotic Gene Finding Goals Given an uncharacterized DNA sequence, find out: –Which regions code for proteins? –Which DNA strand is used to encode each gene? –Where does the gene starts and ends? –Where are the exon-intron boundaries in eukaryotes? Overall accuracy usually below 50%

The Problem Given a string S over the alphabet {A,C,G,T}, find the “optimal” parse of S (with respect to some coding score function): S=s 1,s 2,…,s n Here, s i represents a coding or a non-coding subsequence of S.

Gene Finding: Different Approaches Similarity-based methods. These use similarity to annotated sequences like proteins, cDNAs, or ESTs (e.g. Procrustes, GeneWise). Ab initio gene-finding. These don’t use external evidence to predict sequence structure (e.g. GlimmerHMM, GeneZilla, Genscan, SNAP). Comparative (homology) based gene finders. These align genomic sequences from different species and use the alignments to guide the gene predictions (e.g. TWAIN, SLAM, TWINSCAN, SGP-2). Integrated approaches. These combine multiple forms of evidence, such as the predictions of other gene finders (e.g. Jigsaw, EuGène, Gaze)

Why ab-initio gene prediction? Ab initio gene finders can predict novel genes not clearly homologous to any previously known gene.

Eukaryotic Gene Finding with Parse Graphs 1.Build a parse graph. A parse graph represents all (or all high-scoring) open reading frames. Each vertex is a signal and each edge is a feature such as an exon or intron. Coding statistics and signal sensors are integrated in a mathematical gene model using machine learning techniques: HMMs/GHMMs, decision trees, neural networks, etc. 2.Find highest-scoring path through the parse graph, usually using dynamic programming to efficiently enumerate all possible parses, score them, and choose the maximal scoring one. Whereas most gene-finders give only the highest-scoring gene model, GlimmerHMM’s parse graph can be used to explore the sub-optimal gene models. When GlimmerHMM’s prediction is not exactly correct, the true gene model is often one of the top few sub-optimal parses.

Signal Sensors Signals – short sequence patterns in the genomic DNA that are recognized by the cellular machinery.

GCTATCGATTCTCTAATCGTCTATCGATCGTGGTATCGTACGTTCATTACTGACT... sensor 1 sensor 2 sensor n... ATG’s GT’S AG’s... signal queues sequence: detect putative signals during left-to-right pass over squence insert into type-specific signal queues...ATG ATG......ATG GT newly detected signal elements of the “ATG” queue trellis links Efficient Decoding via Signal Sensors

ATGGATGCTACTTGACGTACTTAACTTACCGATCTCT in-frame stop codon! The Notion of “Eclipsing”

…ACTGATGCGCGATTAGAGTCATGGCGATGCATCTAGCTAGCTATATCGCGTAGCTAGCTAGCTGATCTACTATCGTAGC… Signal sensor We slide a fixed-length model or “window” along the DNA and evaluate score(signal) at each point: When the score is greater than some threshold (determined empirically to result in a desired sensitivity), we remember this position as being the potential site of a signal. The most common signal sensor is the Weight Matrix: A 100% A = 31% T = 28% C = 21% G = 20% T 100% G 100% A = 18% T = 32% C = 24% G = 26% A = 19% T = 20% C = 29% G = 32% A = 24% T = 18% C = 26% G = 32% Identifying Signals In DNA with a Signal Sensor

Signal Sensors in GlimmerHMM Given a signal X of fixed length λ, estimate the distributions: p + (X) = the probability that X is a signal p - (X) = the probability that X is not a signal Compute the score of the signal: …GGCTAGTCATGCCAAACGCGG… …AAACCTAGTATGCCCACGTTGT… …ACCCAGTCCCATGACCACACACAACC… …ACCCTGTGATGGGGTTTTAGAAGGACTC…

Start and stop codon scoring Score all potential start/stop codons within a window of length 19. The probability of generating the sequence is given by: (WAM model or inhomogeneous Markov model) CATCCACCATGGAGAACCACCATGG Kozak consensus

Splice site prediction The splice site score is a combination of: first or second order inhomogeneous Markov models on windows around the acceptor and donor sites MDD decision trees longer Markov models to capture difference between coding and noncoding on opposite sides of site (optional) maximal splice site score within 60 bp (optional) 16bp24bp

A key observation regarding splice sites and start and stop codons is that all of these signals delimit the boundaries between coding and noncoding regions within genes (although the situation becomes more complex in the case of alternative splicing). One might therefore consider weighting a signal score by some function of the scores produced by the coding and noncoding content sensors applied to the regions immediately 5 and 3 of the putative signal: Codong-noncoding Boundaries

When identifying putative signals in DNA, we may choose to completely ignore low-scoring candidates in the vicinity of higher-scoring candidates. The purpose of the local optimality criterion is to apply such a weighting in cases where two putative signals are very close together, with the chosen weight being 0 for the lower-scoring signal and 1 for the higher-scoring one. Local Optimality Criterion

Rather than using one weight array matrix for all splice sites, MDD differentiates between splice sites in the training set based on the bases around the AG/GT consensus: Each leaf has a different WAM trained from a different subset of splice sites. The tree is induced empirically for each genome. Maximal Dependence Decomposition (MDD) (Arabidopsis thaliana MDD trees)

MDD uses the Χ 2 measure between the variable K i representing the consensus at position i in the sequence and the variable N j which indicates the nucleotide at position j : where O x,y is the observed count of the event that K i =x and N j =y, and E x,y is the value of this count expected under the null hypothesis that K i and N j are independent. Split if, for the cuttof P=0.001, 3df. MDD splitting criterion GAATGGA GAATGAA TATTGGA GAGTGGC GCATGCT AGATGGG CACTGGA GAATGTA Example: position: consensus: 71411All [CGT] A OEOEOEOEO AllTGCA N +5 K -2 Χ 2 -2,5 =2.9

Donor/Acceptor sites at location k: DS(k) = S comb (k,16) + (S cod (k-80)-S nc (k-80)) + (S nc (k+2)-S cod (k+2)) AS(k) = S comb (k,24) + (S nc (k-80)-S cod (k-80)) + (S cod (k+2)-S nc (k+2)) S comb (k,i) = score computed by the Markov model/MDD method using window of i bases S cod/nc (j) = score of coding/noncoding Markov model for 80bp window starting at j Splice Site Scoring

False negatives(%): test data False negatives(%): train data False positives(%) Trade-off between False-Positive Rates and False- Negative Rates ThresholdFNFP Acceptor train file (7.00%)8921(2.16%) Donor train file (7.01%)7163(2.05%) Acceptor test file (10.06%)1060(2.67%) Donor test file (10.16%)497( 1.47%) Arabidopsis thaliana data

Coding Statistics Unequal usage of codons in the coding regions is a universal feature of the genomes We can use this feature to differentiate between coding and noncoding regions of the genome Coding statistics - a function that for a given DNA sequence computes a likelihood that the sequence is coding for a protein Many different ones ( codon usage, hexamer usage,GC content, Markov chains, IMM, ICM.)

A three-periodic ICM uses three ICMs in succession to evaluate the different codon positions, which have different statistics: ATC GAT CGA TCA GCT TAT CGC ATC ICM 0 ICM 1 ICM 2 P[C|M 0 ] P[G|M 1 ] P[A|M 2 ] The three ICMs correspond to the three phases. Every base is evaluated in every phase, and the score for a given stretch of (putative) coding DNA is obtained by multiplying the phase-specific probabilities in a mod 3 fashion: GlimmerHMM uses 3-periodic ICMs for coding and homogeneous (non-periodic) ICMs for noncoding DNA. 3-periodic ICMs

The Advantages of Periodicity and Interpolation

HMMs and Gene Structure Nucleotides {A,C,G,T} are the observables Different states generate nucleotides at different frequencies A simple HMM for unspliced genes: AAAGC ATG CAT TTA ACG AGA GCA CAA GGG CTC TAA TGCCG The sequence of states is an annotation of the generated string – each nucleotide is generated in intergenic, start/stop, coding state ATG TAA

An HMM is a following: An HMM is a stochastic machine M=(Q, , P t, P e ) consisting of the following: a finite set of states, Q={q 0, q 1,..., q m } a finite alphabet  ={s 0, s 1,..., s n } a transition distribution P t : Q×Q  [0,1] i.e., P t (q j | q i ) an emission distribution P e : Q×   [0,1] i.e., P e (s j | q i ) q 0q 0 100% 80% 15% 30% 70% 5% R =0% Y = 100% q 1 Y =0% R = 100% q 2 M 1 =({q 0,q 1,q 2 },{ Y, R },P t,P e ) P t ={(q 0,q 1,1), (q 1,q 1,0.8), (q 1,q 2,0.15), (q 1,q 0,0.05), (q 2,q 2,0.7), (q 2,q 1,0.3)} P e ={(q 1, Y,1), (q 1, R,0), (q 2, Y,0), (q 2, R,1) } An Example Recall: “Pure” HMMs

exon length geometric distribution geometric HMMs & Geometric Feature Lengths

Lengths Distribution in Human Feature lengths were computed for Human chromosome 22 with RefSeq annotation (as of July 2005).

Generalized Hidden Markov Models Advantages: * Submodel abstraction * Architectural simplicity * State duration modeling Disadvantages: * Decoding complexity

A GHMM is a following: A GHMM is a stochastic machine M=(Q, , P t, P e, P d ) consisting of the following: a finite set of states, Q={q 0, q 1,..., q m } a finite alphabet  ={s 0, s 1,..., s n } a transition distribution P t : Q×Q  [0,1] i.e., P t (q j | q i ) an emission distribution P e : Q×  * × N  [0,1] i.e., P e (s j | q i,d j ) a duration distribution P e : Q× N  [0,1] i.e., P d (d j | q i ) each state now emits an entire subsequence rather than just one symbol feature lengths are now explicitly modeled, rather than implicitly geometric emission probabilities can now be modeled by any arbitrary probabilistic model there tend to be far fewer states => simplicity & ease of modification Key Differences Ref: Kulp D, Haussler D, Reese M, Eeckman F (1996) A generalized hidden Markov model for the recognition of human genes in DNA. ISMB '96. Generalized HMMs

emission prob. transition prob. Recall: Decoding with an HMM

emission prob. transition prob. duration prob. Decoding with a GHMM

Given a sequence S, we would like to determine the parse  of that sequence which segments the DNA into the most likely exon/intron structure: The parse  consists of the coordinates of the predicted exons, and corresponds to the precise sequence of states during the operation of the GHMM (and their duration, which equals the number of symbols each state emits). This is the same as in an HMM except that in the HMM each state emits bases with fixed probability, whereas in the GHMM each state emits an entire feature such as an exon or intron. parse  exon 1exon 2exon 3 AGCTAGCAGTCGATCATGGCATTATCGGCCGTAGTACGTAGCAGTAGCTAGTAGCAGTCGATAGTAGCATTATCGGCCGTAGCTACGTAGCGTAGCTC sequence S prediction Gene Prediction with a GHMM

GHMMs generalize HMMs by allowing each state to emit a subsequence rather than just a single symbol Whereas HMMs model all feature lengths using a geometric distribution, coding features can be modeled using an arbitrary length distribution in a GHMM Emission models within a GHMM can be any arbitrary probabilistic model (“submodel abstraction”), such as a neural network or decision tree GHMMs tend to have many fewer states => simplicity & modularity GHMMs generalize HMMs by allowing each state to emit a subsequence rather than just a single symbol Whereas HMMs model all feature lengths using a geometric distribution, coding features can be modeled using an arbitrary length distribution in a GHMM Emission models within a GHMM can be any arbitrary probabilistic model (“submodel abstraction”), such as a neural network or decision tree GHMMs tend to have many fewer states => simplicity & modularity GHMMs Summary

GlimmerHMM architecture I2I1I0 Exon2Exon1Exon0 Exon Sngl Init Exon I1I2 Exon1Exon2 Term Exon I0 Exon0 Exon Sngl Init Exon + forward strand - backward strand Phase-specific introns Four exon types Uses GHMM to model gene structure (explicit length modeling) WAM and MDD for splice sites ICMs for exons, introns and intergenic regions Different model parameters for regions with different GC content Can emit a graph of high- scoring ORFS Intergenic

θ=(P t,P e,P d ) Training the Gene Finder

estimate via labeled training data construct a histogram of observed feature lengths Training for GHMMs

Need of training organism specific gene finders

–parameter mismatching: train on a close relative –use a comparative GF trained on a close relative –use BLAST to find conserved genes & curate them, use as training set –augment training set with genes from related organisms, use weighting –manufacture artificial training data long ORFs –be sensitive to sample sizes during training by reducing the number of parameters (to reduce overtraining) fewer states (1 vs. 4 exon states, intron=intergenic) lower-order models –pseudocounts –smoothing (esp. for length distributions) Gene Finding in the Dark: Dealing with Small Sample Sizes

train (800) test (200) G (1000 genes) donors acceptors starts stops exons introns intergenic train-model model files SLOP evaluation reported accuracy SLOP = Separate Local Optimization of Parameters

train (800) test (200) T (1000 genes) final evaluation reported accuracy MLE model files control parms gradient ascent evaluation accuracy final model files “peeking” GRAPE GRAPE = GRadient Ascent Parameter Estimation unseen (1000)

Evaluation of Gene Finding Programs Nucleotide level accuracy TN FPFNTN TPFN TP FN REALITY PREDICTION Sensitivity: Specificity:

More Measures of Prediction Accuracy Exon level accuracy REALITY PREDICTION WRONG EXON CORRECT EXON MISSING EXON

Nuc Sens Nuc Spec Nuc Acc Exon Sens Exon Spec Exon Acc Exact Genes GlimmerHMM86%72%79%72%62%67% 17% Genscan86%68%77%69%60%65% 13% GlimmerHMM’s performace compared to Genscan on 963 human RefSeq genes selected randomly from all 24 chromosomes, non-overlapping with the training set. The test set contains 1000 bp of untranslated sequence on either side (5' or 3') of the coding portion of each gene. GlimmerHMM on human data

GlimmerHMM on other species Nucleotide Level Exon LevelCorreclty Predicted Genes Size of test set SnSpSnSp Arabidopsis thaliana 97%99%84%89%60%809 genes Cryptococcus neoformans 96%99%86%88%53%350 genes Coccidoides posadasii 99% 84%86%60%503 genes Oryza sativa95%98%77%80%37%1323 genes GlimmerHMM is also trained on: Aspergillus fumigatus, Entamoeba histolytica, Toxoplasma gondii, Brugia malayi, Trichomonas vaginalis, and many others.

GlimmerHMM is a high-performance ab initio gene finder All three programs were tested on a test data set of 809 genes, which did not overlap with the training data set of GlimmerHMM. All genes were confirmed by full-length Arabidopsis cDNAs and carefully inspected to remove homologues. Arabidopsis thaliana test results NucleotideExonGene SnSpAccSnSpAccSnSpAcc GlimmerHMM SNAP Genscan