Markov models and applications Sushmita Roy BMI/CS 576 Oct 7 th, 2014.

Slides:



Advertisements
Similar presentations
Markov models and applications
Advertisements

. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
Probabilistic sequence modeling II: Markov chains Haixu Tang School of Informatics.
. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Marjolijn Elsinga & Elze de Groot1 Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
Hidden Markov Model.
Hidden Markov Models Eine Einführung.
Markov Models Charles Yan Markov Chains A Markov process is a stochastic process (random process) in which the probability distribution of the.
MNW2 course Introduction to Bioinformatics
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Modified from:
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Markov Chains Lecture #5
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Markov Models Charles Yan Spring Markov Models.
Hidden Markov Models Lecture 6, Thursday April 17, 2003.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
. Sequence Alignment via HMM Background Readings: chapters 3.4, 3.5, 4, in the Durbin et al.
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
CpG islands in DNA sequences
Master’s course Bioinformatics Data Analysis and Tools
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
The Human Genome (Harding & Sanger) * *20  globin (chromosome 11) 6*10 4 bp 3*10 9 bp *10 3 Exon 2 Exon 1 Exon 3 5’ flanking 3’ flanking 3*10 3.
. Markov Chains Lecture #5 Background Readings: Durbin et. al. Section 3.1 Prepared by Shlomo Moran, based on Danny Geiger’s and Nir Friedman’s.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
Hidden Markov Models In BioInformatics
Markov Chain Models BMI/CS 576 Fall 2010.
MNW2 course Introduction to Bioinformatics Lecture 22: Markov models Centre for Integrative Bioinformatics FEW/FALW
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
10/29/20151 Gene Finding Project (Cont.) Charles Yan.
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
CZ5226: Advanced Bioinformatics Lecture 6: HHM Method for generating motifs Prof. Chen Yu Zong Tel:
From Genomics to Geology: Hidden Markov Models for Seismic Data Analysis Samuel Brown February 5, 2009.
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
1 DNA Analysis Part II Amir Golnabi ENGS 112 Spring 2008.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models BMI/CS 576
Markov Chain Models BMI/CS 776
Learning Sequence Motif Models Using Expectation Maximization (EM)
Interpolated Markov Models for Gene Finding
CSE 5290: Algorithms for Bioinformatics Fall 2009
Hidden Markov Model Lecture #6
Presentation transcript:

Markov models and applications Sushmita Roy BMI/CS Oct 7 th, 2014

Readings Chapter 3 – Section 3.1 – Section 3.2

Key concepts Markov chains – Computing the probability of a sequence – Estimating parameters of a Markov model Hidden Markov models – States – Emission and transition probabilities – Parameter estimation – Forward and backward algorithm – Viterbi algorithm

Markov models So far we have assumed in our models that there is no dependency between consecutive base pair locations This assumption is rarely true in practice Markov models allow us to model the dependencies inherent in sequential data such as DNA or protein sequence

Markov models have wide applications Genome annotation – Given a genome sequence find functional untis of the genome Genes, CpG islands, promoters.. Sequence classification – A Hidden Markov model can represent a family of proteins Sequence alignment

Markov chain A Markov chain is the simplest type of Markov models Described by – A collection of states, each state representing a symbol observed in the sequence – At each state a symbol is emitted – At each state there is a some probability of transitioning to another state

Detecting CpG islands CpG islands are isolated regions of the genome corresponding to high concentrations of CG dinucleotides – Range from bps For example regions upstream of genes or gene promoters Given a short sequence of DNA, how can we decide it is a CpG island? What kind of a model should we use to represent a CpG island?

Markov chain model for DNA sequence We know dinucleotides are important We would our model to capture the dependency between consecutive positions

Markov chain model notation X t is a random variable denoting the state at time t a ij =P(X t =j|X t-1 =i) denotes the transition probability from state i to state j

A Markov chain model for DNA sequence A TC begin state transition transition probabilities G

Can also have an end state; allows the model to represent – A probability distribution over sequences of all possible lengths – preferences for ending sequences with certain symbols Markov chain models begin end A TC G

Computing the probability of a sequence from a first Markov model Let X be a sequence of random variables X 1 … X L representing a biological sequence From the chain rule of probability

Computing the probability of a sequence from a first Markov model Key property of a (1 st order) Markov chain: the probability of each X t depends only on the value of X t-1 This can be written as the initial probabilities or a transition probability from the “begin” state

Example of computing the probability from a Markov chain begin end A tc g What is P(CGGT)? P(C )P(G|C)P(G|G)P(T|G) P(End|T)

Learning Markov model parameters Model parameters – Initial probabilities P(X 0 ) – Transition probabilities P(X t |X t-1 ) Estimated via maximum likelihood – That is, if D is the sequence data, are parameters – is estimated to maximize

Maximum likelihood estimation Suppose we want to estimate the parameters P(a), P(c), P(g), P(t) then the maximum likelihood estimates are n a Number of occurrences of a

Maximum likelihood estimation Suppose we ’ re given the sequences accgcgctta gcttagtgac tagccgttac We obtain the following maximum likelihood estimates

Maximum Likelihood Estimation suppose instead we saw the following sequences gccgcgcttg gcttggtggc tggccgttgc then the maximum likelihood estimates are do we really want to set this to 0?

Laplace estimates of parameters instead of estimating parameters strictly from the data, we could start with some prior belief for each for example, we could use Laplace estimates where represents the number of occurrences of character i using Laplace estimates with the sequences GCCGCGCTTG GCTTGGTGGC TGGCCGTTGC pseudocount

Estimation for 1 st Order Probabilities to estimate a 1 st order parameter, such as P(c|g), we count the number of times that g follows the history c in our given sequences using Laplace estimates with the sequences GCCGCGCTTG GCTTGGTGGC TGGCCGTTGC

Build a Markov chain model for CpG islands Learn two Markov models – One from sequences that look like CpG (positive) – One from sequences that don’t look like CpG (negative) – Parameters estimated from ~60,000 nucleotides +ACGT A C G T ACGT A C G T CpG +CpG - G is much more likely to follow A in the CpG positive model

Using the Markov chains to classify a new sequence Let y denote a new sequence To use the Markov models to classify y we compute the score as The larger the value of S(y) the more likely is y a CpG island

Distribution of scores for CpG and non CpG examples CpG

Applying the CpG Markov chain models to classify new examples Is CGCGA a CpG island? – Score= Is CCTGG a CpG island? Score= Score=0.3746

Extensions to Markov chains Hidden Markov models (Next lectures) Higher-order Markov models Inhomogeneous Markov models

Order of a Markov model Describes how much history we want to keep First order Markov models are most common Second order n th order

Selecting the order of a Markov chain model Higher order models remember more “history” Additional history can have predictive value Example: – predict the next word in this sentence fragment “… the__” (duck, end, grain, tide, wall, …?)

Higher order Markov chains An n th order Markov chain over some alphabet A is equivalent to a first order Markov chain over the alphabet A n of n -tuples For example: a 2 nd order Markov model for DNA can be treated as a 1 st order Markov model over alphabet AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT

A first order Markov chain for a second-order Markov chain with {A,B} alphabet AA BA BB AB A sequence ABBAB is translated into a sequence of pairs AB BABBAB

A fifth-order Markov chain gctac aaaaa ctacg ctaca ctacc ctact P(a | gctac) begin P(gctac) P(c | gctac)

Order of Markov model The number of parameters grows exponentially with the order Consider an alphabet of A characters – For 1 st order we need to estimate A 2 parameters – For 2 nd order we need to estimate A 3 parameters – For n th order we need to estimate A n+1 parameters Higher order models need more data to estimate their parameters

Inhomogeneous Markov chains So far the Markov chain has the same transition probability for the entire sequence Sometimes it is useful to switch between Markov chains depending on where we are in the sequence For example, recall the genetic code that specifies what triplets of bases code for an amino acid – Each position in the codon have different statistics – This could be modeled by three Markov chains

An inhomogeneous Markov chain to model codons a t c g a t c g a t c g begin pos 1pos 2pos 3

Inhomogeneous Markov chain Let 1, 2 and 3 denote the three Markov chains – So our transition probabilities look like – a ij 1, a ij 2, a ij 3 Let x 1 be in codon position 3 Probability of x 2 x 3 x 4 x 5 x 6.. would be

Summary Markov models allow us to model dependencies in sequential data Markov models are described by – states, each state corresponding to a letter in our observed alphabet – Transition probabilities between states Parameter estimation can be done by counting the number of transitions between consecutive states (first order) – Laplace correction is often applied Often used for classifying sequences as CpG islands or not Extensions of Markov models – Order – Inhomogeneous Markov models

Errata Slide 14 corrected for the end position – P(C )P(G|C)P(G|G)P(T|G) P(End|T)