 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Marjolijn Elsinga & Elze de Groot1 Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
Hidden Markov Model.
Lecture 8: Hidden Markov Models (HMMs) Michael Gutkin Shlomi Haba Prepared by Originally presented at Yaakov Stein’s DSPCSP Seminar, spring 2002 Modified.
Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.
Hidden Markov Models Eine Einführung.
Hidden Markov Models.
Hidden Markov Models Modified from:
Statistical NLP: Lecture 11
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models Usman Roshan BNFO 601.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Markov Chains Lecture #5
Lecture 6, Thursday April 17, 2003
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models Lecture 6, Thursday April 17, 2003.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Lecture 5: Learning models using EM
S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
CpG islands in DNA sequences
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Hidden Markov Models Usman Roshan BNFO 601. Hidden Markov Models Alphabet of symbols: Set of states that emit symbols from the alphabet: Set of probabilities.
Hidden Markov Models 1 2 K … x1 x2 x3 xK.
Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Hidden Markov Models.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Hidden Markov Models CBB 231 / COMPSCI 261 part 2.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
1 DNA Analysis Part II Amir Golnabi ENGS 112 Spring 2008.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models BMI/CS 576
Hidden Markov Models - Training
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
CSE 5290: Algorithms for Bioinformatics Fall 2009
CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)
Presentation transcript:

 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in them the couple CpG is more frequent.  These CpG islands are known to appear in biologically more significant parts of the genome.

Two problems involving CpG islands: 1.Given a short genome sequence, decide if it comes from a CpG islands or not. 2.Given a long DNA sequence, locate all the CpG islands in it.

We like to show a Markov chain graphically as a collection of ‘states’, each of which corresponds to a particular residue, with arrows between the states.

A Markov chain is a triplet (Q,{p( x 1 = s )},A), where : Q is a finite set of states. Each state corresponds to a symbol in the alphabet . Initial state probabilities. A is the state transition probabilities, denoted by a st for each s,t  Q For each s,t  Q the transition probability is: a st  P ( x i = t | x i-1 = s ) Formal definition :

For any probabilistic model of sequences we can write the probability of the sequence as : P(x) = P (x L, x L-1,…..,x 1 ) By applying P(X,Y) = P(X|Y) P(Y) many times:

Begin and end states (  and  ) can be added to a Markov chain.   P(x 1 = s) = a  s P(  | x L = t) = a t 

INPUT : A short DNA sequence X = (x 1, …., x L )   (where  = { A,C,G,T }). QUESTION : Decide whether X is a CpG island. We can use two Markov chain models to solve this problem : one for dealing with CpG islands (the ‘+’ model) and the other for dealing with non CpG island (the ‘-’ model).

 c + st is the number of times letter t followed letter s in the CpG chain. The transition probabilities in each model are derived from a collection of human gene sequences, containing 48 putative CpG islands.  a + st denote the transition probability of s,t  inside a CpG island.

 Each row sums to one  The table is asymmetric

The higher this score, the more likely it is that X is a CpG island. Therefore we can compute a logaritmic likelihood score for a sequence X by :

CpG islands sequences shown in dark grey and non-CpG sequences in light grey.

INPUT : A long DNA sequence X = (x 1, …., x L )   (where  = { A,C,G,T }). QUESTION : Locate the CpG islands along X.

 Extract a sliding window X k = (x k+1, …, x k+l ) (where l << L and 1  k  L – l) from the sequence.  Calculate Score(X k ) for each one of the resulting subsequences.  Subsequences that receive positive scores are potential CpG islands. A naive approach main disadvantage : We have no information about the lengths of the islands.

Combine the two markov chains of the previous problem into a unified model, with a small probability of switching from one chain to the other at each transition point. A better solution :

In the new model : A +, C +, G + and T + emit A,C,G and T respectively in CpG island regions, and A-,C-,G- and T- corresponding in non-islands regions. State: A +, C +, G +, T +, A-, C-, G-, T- Emitted Symbol: A C G T A C G T

A Hidden Markov Model (HMM) is a triplet M = ( , Q,  ), where:  is an alphabet of symbols. Q is a finite set of states, capable of emitting symbols from the alphabet .  is a set of probabilities, comprised of: - State transition probabilities - Emission probabilities

State transition probabilities :  The state sequence  = (  1,…,  L ) is called a path.  the probability of a state depends only on the previous state: a kl = P (  i = l |  i-1 = k ) Emission probabilities :  The symbol sequence is X = (x 1,….,x L )    Emission probability is the probability that symbol b is seen when in state k : e k (b) = P (x i = b |  i = k )

Where for our convenience we denote :  0 = begin and  L+1 = end. The probability that the sequence X was generated by the model M given the path  is therefore:

Example : modeling a dishonest casino dealer

INPUT : A hidden Markov model M = ( , Q,  ) and a sequence X  , for which the generating path  = (  1,…,  L ) is unknown. QUESTION : Find the most probable generating path   for X In general there may be many state sequences that could give rise to any particular sequence of symbols.

In our example : (C+,G+,C+,G+), (C-,G-,C-,G-) and (C+,G-,C+,G-) would all generate the symbol sequence CGCG. However, they do so with very different probabilities. We are looking for a path such that P(X,   ) is maximized.   = argmax  {P(X,  ) }

 This algorithm is based on dynamic programming.  The most probable path  * can be found recursively. Suppose the probability v k (i) of the most probable path ending in state k with observation i is known for all states k. then these probabilities can be calculated for observation x i+1 as :

The algorithm :  Time complexity: O(L|Q| 2 )  Space complexity: O(L|Q|)

The values of v for the sequence CGCG. Example 1 : CpG islands

Example 2 : The casino

INPUT : A hidden Markov model M = ( , Q,  ) and a sequence X  , for which the generating path  = (  1,…,  L ) is unknown. QUESTION : What the probability that observation x i came from state k given the observed sequence? i.e. P(  i = k|X). This is the posterior probability of state k at time i when the emitted sequence X is known.

To answer this question we need two algorithms, both assume that X is known and  i = k :  Forward algorithm - f k (i) = P(x 1 … x i,  i = k) The probability of emitting the prefix ( x 1, …,x i ).  Backward algorithm - b k (i) = P(x i+1 … x L |  i = k) The probability of the suffix ( x i+1, …,x L ).

In Markov chains, the probability of a sequence was calculated by the equation : What is the probability P(x) for an HMM ? In order to find P(  i = k|X). We need to know p(x).

 many different state paths can generate the same sequence x  The probability of x is sum of the probabilities for all possible paths. In HMM…  The number of possible paths  increases exponentially with the length of the sequence  Enumerating all paths is not practical

 one approach : We can use probability of Viterbi path  * as an approximation to P(x).  In fact, the full probability can itself be calculated by a dynamic programming (like Viterbi), replacing the maximization steps with sums.

The algorithm :

Calculate the probability of the suffix (x i+1,…,x L ). b k (i) = P(x i+1 …x L |  i = k) The recursion start at the end of the sequence.

Back to the posterior problem… Now we can calculate P(  i = k|X). Where P(x) is the result of the forward (or backward) algorithm.

In Viterbi, Forward and Backward algorithms :  Complexity - Time complexity: O(L|Q| 2 ) - Space complexity: O(L|Q|)  working in log space to avoid underflow errors when implemented on computer. Some comments …

The casino example  The posterior probability of being in a fair die throughout the sequence  Loaded die used in blue areas Number of the roll Prob. Of fair die

Uses for Posterior Decoding - Two alternative forms of decoding : 1. When many different paths have almost the same probability as the most probable one, then we may want to consider other possible paths as well.  ** = argmax k {P(  i = k|X)} *  ** may not be a legitimate path, if some transitions are not permitted.

2.When we ’ re not interested in the state sequence itself, but in some other property derived from it. For example : g(k) = 1 if k  {A+,C+,G+,T+}, 0 otherwise. G(i | x) is posterior probability that base i is in a CpG island.

The most difficult problem faced when using HMMs is specifying the model :  Design the structure : which states and the connection between them.  The assignment of the transition and the emission probabilities a kl and e k (b).

Given training sequences X 1, …,X n  * of length L 1, …,L n respectively, which were all generated from the HMM M = ( , Q,  ). however, the values of the probabilities in , are unknown. We want to construct an HMM that will best characterize X 1, …,X n. We need to assign values to  that will maximize the probability of X 1, …,X n.

since the sequences where generated independently, Probability of X 1, …,X n with  is: Using the logarithmic score our goal is to find  * such that:

 The sequences X 1, …,X n are usually called the training sequences We shell examine two cases for parameter estimation:  Estimation when the state sequence is known.  Estimation when the state sequence is unknown.

Estimation when the state sequence is known When all the paths are known, we can count the number of times each particular transition or emission is used in the set of training sequence. A kl - number of transitions from the state k to l in all the state sequences E k (b) – the number of times that an emission of the symbol b occurred in state k in all the state sequences.

Laplace`s correction In order to avoid zero probability we will use : Where usually r kl and r k (b) Are usually equal 1.

Estimation when the state sequence is unknown In the case that the state sequences are not known, the problem of finding the optimal set of parameters  * is known to be NP-complete.  The Baum-Welch algorthim is a heuristic algorithm for finding a solution to the problem  Baum-Welch algorithm, which is a special case of the EM technique (Expectation and maximization).

Initialization: Assign arbitrary values to . Expectation: calculates A kl and E k (b) as the expected number of times each transition is used, given the training sequences. Probability that a kl is used at position i in sequence x is:

Sum over all positions and training sequences to get A kl Similary,we can find the expected number of times that letter b appears in state k.

Maximization: update the values of a kl and e k (b) according to the equations: This process is iterated until the improvement of Score(X 1,…,X n |  ) is less then a given parameter .

Baum-Welch converges to local maximum of the target function Score(X 1,…,X n |  ) Main problem : may exist several local maximum. Solution : 1.run the algorithm several times, each time with different initial values for . 2.Start with  values that are meaningful.

References Hidden markov models / ron shamir algmb/98/scribe/html/lec06/node1.html algmb/98/scribe/html/lec06/node1.html Biological Sequence Analysis, Durbin et el,chapter 3