Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
Marjolijn Elsinga & Elze de Groot1 Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot.
Hidden Markov Model.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Ab initio gene prediction Genome 559, Winter 2011.
Hidden Markov Models Eine Einführung.
Hidden Markov Models.
MNW2 course Introduction to Bioinformatics
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Lecture 6, Thursday April 17, 2003
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models Lecture 6, Thursday April 17, 2003.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Master’s course Bioinformatics Data Analysis and Tools
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
CSE182-L10 Gene Finding.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
Biological Motivation Gene Finding in Eukaryotic Genomes
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Similarity-Based Approaches.
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Hidden Markov Models In BioInformatics
MNW2 course Introduction to Bioinformatics Lecture 22: Markov models Centre for Integrative Bioinformatics FEW/FALW
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Gene Prediction Chengwei Luo, Amanda McCook, Nadeem Bulsara, Phillip Lee, Neha Gupta, and Divya Anjan Kumar.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
10/29/20151 Gene Finding Project (Cont.) Charles Yan.
From Genomes to Genes Rui Alves.
CS Statistical Machine learning Lecture 24
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
(H)MMs in gene prediction and similarity searches.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Hidden Markov Models BMI/CS 576
bacteria and eukaryotes
What is a Hidden Markov Model?
Ab initio gene prediction
Hidden Markov Models Part 2: Algorithms
Professor of Computer Science and Mathematics
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
Hidden Markov Models (HMMs)
Professor of Computer Science and Mathematics
Professor of Computer Science and Mathematics
Microbial gene identification using interpolated Markov models
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
Presentation transcript:

Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha

Gene finding in bacteria Large number of bacterial genomes sequenced (10 at the time of paper, 1997) Previous work: “Genemark” program identified gene as ORF that looks more like genes than non-genes. –Uses Markov chains of coding and non-coding sequence 5’ (starting) boundary not well predicted –Resolution of start point ~ 100 nucleotides

Genemark.hmm Builds on Genemark, but uses HMM for better prediction of start and stop Given DNA sequence S = {b 1,b 2,….b L } Find “functional sequence” A={a 1,…a L } where each a i = 0 if non-coding, 1 if coding in forward strand, 2 if coding in reverse strand Sounds like the Fair Bet Casino problem (sequence of coin types “fair” or “biased”) Find Pr(A | S) and report A that maximizes this

Functional sequence “A” carries information about where the coding function switched into non-coding (stop of gene) and vice versa. Model sequence by HMM with different states for “coding” and “non-coding” Maximum likelihood “A” is the optimal path through the HMM, given the sequence Viterbi algorithm to solve this problem

Hidden Markov Model In some states, choose (i) a length of sequence to emit and (ii) the sequence to emit This is different from the Fair Bet Casino problem. There, each state emitted exactly one observation (H or T)

Hidden Markov Model “Typical” and “Atypical” gene states (one for each of forward and reverse strands) These two states emit coding sequence (between and excluding start and stop codons) with different codon usage patterns Clustering of E. coli genes showed that –majority of genes belong to one cluster (“Typical”) –many genes, believed to have been “horizontally transferred” into the genome, belong to another cluster (“Atypical”)

Hidden State Trajectory “A” This is similar to the “functional” sequence defined earlier –except that we have one for each state, not one for each nucleotide Sequence of M hidden states a i having duration d i : –A = {(a 1 d 1 ), (a 2 d 2 ), …. (a M d M )} –∑d i = L Find A * that maximizes Pr(A|S)

Formulation Find trajectory (path) A that has the highest probability of occurring simultaneously with the sequence S Maximizing Pr(A,S) is the same as maximizing Pr(A|S). Why ?

Solution Maximization problem solved by Viterbi algorithm (seen in previous lecture)

Solution maximizing over all possible trajectories

Solution Define (for dynamic progamming): the joint probability of a partial trajectory of m states (with the last state being a m ) and a partial sequence of length l. transition prob. prob. of durationprob. of sequence

Solution

Parameters of the HMM Transition probability distributions, emission probability distributions Fixed a priori –What was the other possibility ? –Learn parameters from data Emission probabilities of coding sequence state obtained from previous statistical studies: “What does a coding sequence look like in general?” Emission probabilities of non-coding sequence obtained similarly

Parameters of the HMM Probability that a state “a” has duration “d” (i.e., length of emission is d) is learned from frequency distribution of lengths of known coding sequences

Parameters of the HMM … and non-coding sequences

Parameters of the HMM Emission probabilities of start codon fixed from previous studies –Pr(ATG)=0.905, Pr(GTG)=0.090, Pr(TTG)=0.005 Transition probabilities: Non-coding to Typical/Atypical coding state = 0.85/0.15

Post-processing As per the HMM, two genes cannot overlap. In reality, genes may overlap ! G2 G1

Post-processing As per the HMM, two genes cannot overlap. In reality, genes may overlap ! G2 G1 Will predict second gene to begin here What about the start codon for that second gene?

Post-processing As per the HMM, two genes cannot overlap. In reality, genes may overlap ! G2 G1 Look for an RBS somewhere here. Take each start codon here, and find RBS -19 to -4 bp upstream of it

Ribosome binding site (RBS)

How to search for RBS? Take 325 genes from E. coli (bacterium) with known RBS Align them using sequence alignment Use this as a PWM to scan for RBS

Gene prediction in different species The coding and non-coding state emission probabilities need to be trained from each species for predicting genes in that species

Gene prediction accuracy Data set #1: all annotated E. coli genes Data set #2: non-overlapping genes Data set #3: Genes with known RBS Data set #4: Genes with known start positions

Results VA: Viterbi algorithm PP: With post-processing

Results Gene overlap is an important factor Performance goes up from 58% to 71% when overlapping genes are excluded from data set Post-processing helps a lot –58% --> 75% for data set #1 Missing genes: “False negatives” < 5% “Wrong” gene predictions: “False positives” ~8% –Are they really false positives, or are they unannotated genes?

Results Compared with other programs

Results Robustness to parameter settings Alternative set of transition probability values used Little change in performance (~20% change in parameter values leads to < 5% change in performance)

Higher Order Markov models Sequence emissions were modeled by a second order Markov chain. –Pr (X i |X i-1, X i-2,…X 1 ) = Pr (X i |X i-1, X i-2 ) Examined the effect of changing the “Markov order” (0,1,3,4,5) Even zeroth order Markov chain does pretty well.

Higher Order Markov models