Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.

Slides:



Advertisements
Similar presentations
GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Ab initio gene prediction Genome 559, Winter 2011.
Lecture 8: Hidden Markov Models (HMMs) Michael Gutkin Shlomi Haba Prepared by Originally presented at Yaakov Stein’s DSPCSP Seminar, spring 2002 Modified.
Hidden Markov Models Eine Einführung.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Patterns, Profiles, and Multiple Alignment.
Statistical NLP: Lecture 11
Hidden Markov Models in Bioinformatics
Hidden Markov Models Theory By Johan Walters (SR 2003)
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Hidden Markov Models in Bioinformatics Example Domain: Gene Finding Colin Cherry
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Hidden Markov Models Sasha Tkachev and Ed Anderson Presenter: Sasha Tkachev.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Lyle Ungar, University of Pennsylvania Hidden Markov Models.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Phylogenetic Trees Presenter: Michael Tung
Comparative ab initio prediction of gene structures using pair HMMs
Hidden Markov Models Usman Roshan BNFO 601. Hidden Markov Models Alphabet of symbols: Set of states that emit symbols from the alphabet: Set of probabilities.
Finding Genes based on Comparative Genomics Robin Raffard November, 30 th 2004 CS 374.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Lecture 12 Splicing and gene prediction in eukaryotes
Phylogenetic trees Sushmita Roy BMI/CS 576
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Hidden Markov Models In BioInformatics
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Hidden Markov Models for Sequence Analysis 4
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Doug Raiford Lesson 3.  Have a fully sequenced genome  How identify the genes?  What do we know so far? 10/13/20152Gene Prediction.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
1 MARKOV MODELS MARKOV MODELS Presentation by Jeff Rosenberg, Toru Sakamoto, Freeman Chen HIDDEN.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
(H)MMs in gene prediction and similarity searches.
Modelling evolution Gil McVean Department of Statistics TC A G.
Introducing Hidden Markov Models First – a Markov Model State : sunny cloudy rainy sunny ? A Markov Model is a chain-structured process where future states.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
ECFG for G ene Identification using DART Yuri Bendana, Sharon Chao, Karsten Temme.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Hidden Markov Models BMI/CS 576
bacteria and eukaryotes
Genome Annotation (protein coding genes)
Gil McVean Department of Statistics, Oxford
Eukaryotic Gene Finding
Ab initio gene prediction
Hidden Markov Models Part 2: Algorithms
Hidden Markov Models (HMMs)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Presentation transcript:

Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein

GPHMM CONSERVED Exon method 2 step GLASS n ROSETTA TWINSCAN which extends GENESCAN etc

  Do not exploit all information in evolutionary pattern   Not easily extended to multiple genome sequences.

(EHMM) Composed of : 1.Hidden Markov Model (HMM) 2.Phylogenetic Tree A Probabilistic model of both Genome Structure and Evolution

 Can handle any number of sequences in an alignment.  Can have properties of higher order HMM’s  Can handle variability in the sequences along the alignment  State of art evolutionary models can be incorporated later  Evolutionary events between different genomes are not treated independently

SCOPE Not to compete with the existing finding methods on performance but to illustrate the power of this approach. Relies on a pre produced alignment.

MARKOV CHAINS  A set of states  The transitions from one state to all other states, including itself, are governed by a probability distribution  First order Markov chain: the probabilities depend solely on the current state  n-th order Markov chain: n previous states

HIDDEN MARKOV MODEL 5 Components A set of states Matrix of transition probabilities ( A ) Set of alphabets ( C ) Set of emission distribution (e) Initial state distribution ( B )

  A C A A T G   T C A A C T A T C   A C A C - - A G C   A G A A T C   A C C G - - A T C NO 1:1 correspondence between states and symbols Why the name Hidden ? Example of hidden Markov model

Components  State k  Emits symbols (observables) C  PROBABILISTIC MODEL Emission Distribution e Emission Distribution e Initial state distribution B Initial state distribution B Transition Probabilities A Transition Probabilities A

Path Π Different paths possible for same sequence Different paths possible for same sequence

In EHMM Emission distribution e specified by e specified by Evolutionary model Ek Evolutionary model Ek Phylogenetic tree T Phylogenetic tree T

PHYLOGENETIC TREES

Motivation : The problem of explaining the evolutionary history of today's species  In Phylogenetic trees  Leaves represent present day species  Character states of inner nodes are missing data  Interior nodes represent hypothesized ancestors  The length of the brances of a tree represent the evolutionary difference.

Evolution is often modeled by continuous markov chains Here evolution along the branches of the phylogenetic tree is modelled by Ek Transition probability Pk ( t ) For a branch length t P k ( t ) = exp ( t Q k ) Increasing the number of sequences is increasing the amount of evolutionary information. THE ALIGNMENT COLUMN CORRESPONDS TO THE STATE OF ELOVUTION AT THE LEAVES OF THE PHYLOGENETIC TREE

Phylogenetic tree of the entries of the 3 alignment columns THE PEOPABILITY OF GENERATING AN ALIGNMENT COLUMN IN STATE K EQUALS PROBABILITY OF OBSERVING A GIVEN CHARACTER PATTERN ON THE LEAVES OF T WHEN GIVEN E k

 Codon based evolutionary model used to calculate emission probability of columns of A  Nucleotide Based evolutionary model used to calculate emission probability of column B  Emission probability of C is got from the equilibrium distribution of the the relevant evolutionary model

Parameter Estimation Parameters of HMM are estimated by a combination of Baum – Welch Baum – Welch Powell Powell Evolutionary model E divided into divided into E equ E evo

Initial State Distribution B can be estimated by Baum-Welch but It is generally set to for all states except the intergenic. The expectation step of Baum-Welch estimates the number of nucleotides emitted from each state the expected number of state transitions Expected number of times a state is used. Powell another optimization method estimates E evo phylogenetic tree T Baum – Welch method is used to estimate E equ A

Therefore Likelihood of an alignment ( x ) given a parameterization of the EHMM Can be found by the equation Here we are summing over all possible paths This can be done in linear time by Dynamic Programming

EHMM is fully probabilistic and can be used to simulate data and find genes. EUKARYOTIC GENOME MODEL can be used to generate alignments. Reduced model produces only inner exons.

Results Benefits of modeling evolution with a EHMM using a data set of orthologous mouse/human gene pair using a data set of orthologous mouse/human gene pair Benefit will depend on divergence between sequences compared Key parameter for modelling the difference between exons and introns is the dN/dS ratio.

Moreover we see that Evolutionary model shows a distinct difference between the intergenic /intron state and the codon state

Evaluations were performed on both single and aligned sequences

Graphical Representation

Simple model used now not comparable to state of art methods Any number of aligned sequences can be handled

Extensions of the model GENESCAN can be extended into HMM Splice site finders Models of ribosome binding site and promoter regions Non – geometric length distributions of exons Pseudo higher order EHMM can be constructed. Idea of pair HMM to multiple sequences

Disadvantages in present model  Existing frame work does not model gaps but treats it as missing data.  Optimal data for EHMM is a multiple alignment of full – length genome.  Challenge in constructions of the alignment is to reduce the noise per signal ratio. BUT ……….. BUT ………..