COMP3456 – adapted from textbook slides www.bioalgorithms.info Hidden Markov Models.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Hidden Markov Model.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Hidden Markov Models Chapter 11. CG “islands” The dinucleotide “CG” is rare –C in a “CG” often gets “methylated” and the resulting C then mutates to T.
Hidden Markov Models.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.
Hidden Markov Models Eine Einführung.
Hidden Markov Models.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Modified from:
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Lecture 6, Thursday April 17, 2003
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.
Hidden Markov Models Lecture 6, Thursday April 17, 2003.
Hidden Markov Models.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models: an Introduction by Rachel Karchin.
CpG islands in DNA sequences
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Bioinformatics Hidden Markov Models. Markov Random Processes n A random sequence has the Markov property if its distribution is determined solely by its.
Hidden Markov Models.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models for Sequence Analysis 4
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
H IDDEN M ARKOV M ODELS. O VERVIEW Markov models Hidden Markov models(HMM) Issues Regarding HMM Algorithmic approach to Issues of HMM.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.
Hidden Markov Models An Introduction to Bioinformatics Algorithms (Jones and Pevzner)
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Other Models for Time Series. The Hidden Markov Model (HMM)
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models BMI/CS 576
Hidden Markov Models - Training
Hidden Markov Models Part 2: Algorithms
Hidden Markov Models.
Professor of Computer Science and Mathematics
Professor of Computer Science and Mathematics
Professor of Computer Science and Mathematics
CSE 5290: Algorithms for Bioinformatics Fall 2009
CSE 5290: Algorithms for Bioinformatics Fall 2011
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

COMP3456 – adapted from textbook slides Hidden Markov Models

COMP3456 – adapted from textbook slides Outline CG-islandsCG-islands The “Fair Bet Casino”The “Fair Bet Casino” Hidden Markov ModelsHidden Markov Models Decoding AlgorithmDecoding Algorithm Forward-Backward AlgorithmForward-Backward Algorithm Profile HMMsProfile HMMs HMM Parameter EstimationHMM Parameter Estimation Viterbi training Baum-Welch algorithm

COMP3456 – adapted from textbook slides CG-Islands Given 4 nucleotides: probability of occurrence is ≈1/4. Thus, probability of occurrence of a dinucleotide is ≈1/16.Given 4 nucleotides: probability of occurrence is ≈1/4. Thus, probability of occurrence of a dinucleotide is ≈1/16. However, the frequencies of dinucleotides in DNA sequences vary widely.However, the frequencies of dinucleotides in DNA sequences vary widely. In particular, CG is typically under- represented (frequency of CG is typically < 1/16)In particular, CG is typically under- represented (frequency of CG is typically < 1/16)

COMP3456 – adapted from textbook slides part 1: Hidden Markov Models

COMP3456 – adapted from textbook slides Why CG-Islands? CG is the least frequent dinucleotide because C in CG is easily methylated and has the tendency to mutate into TCG is the least frequent dinucleotide because C in CG is easily methylated and has the tendency to mutate into T However, the methylation is suppressed around genes in a genome. So, CG appears at relatively high frequency within these CG islands –However, the methylation is suppressed around genes in a genome. So, CG appears at relatively high frequency within these CG islands – so finding the CG islands in a genome is an important problem as it gives us a clue of where genes are.so finding the CG islands in a genome is an important problem as it gives us a clue of where genes are.

COMP3456 – adapted from textbook slides CG Islands and the “Fair Bet Casino” The CG islands problem can be modelled after a problem named “The Fair Bet Casino”The CG islands problem can be modelled after a problem named “The Fair Bet Casino” The game is to flip coins, which results in only two possible outcomes: Head or Tail.The game is to flip coins, which results in only two possible outcomes: Head or Tail. The Fair coin will give Heads and Tails with same probability ½.The Fair coin will give Heads and Tails with same probability ½. The Biased coin will give Heads with prob. ¾.The Biased coin will give Heads with prob. ¾.

COMP3456 – adapted from textbook slides The “Fair Bet Casino” (cont’d) Thus, we define the probabilities:Thus, we define the probabilities: P(H|F) = P(T|F) = ½P(H|F) = P(T|F) = ½ P(H|B) = ¾, P(T|B) = ¼P(H|B) = ¾, P(T|B) = ¼ The crooked dealer changes between Fair and Biased coins with probability 10%The crooked dealer changes between Fair and Biased coins with probability 10%

COMP3456 – adapted from textbook slides The Fair Bet Casino Problem Input: A sequence x = x 1 x 2 x 3 …x n of coin tosses made by two possible coins (F or B).Input: A sequence x = x 1 x 2 x 3 …x n of coin tosses made by two possible coins (F or B). Output: A sequence π = π 1 π 2 π 3 … π n, with each π i being either F or B indicating that x i is the result of tossing the Fair or Biased coin respectively.Output: A sequence π = π 1 π 2 π 3 … π n, with each π i being either F or B indicating that x i is the result of tossing the Fair or Biased coin respectively.

COMP3456 – adapted from textbook slides but there's a problem… Fair Bet Casino Problem Any observed outcome of coin tosses could have been generated by any sequence of states! We need to incorporate a way to grade different sequences differently. The Decoding Problem

COMP3456 – adapted from textbook slides P(x|fair coin) vs. P(x|biased coin) Suppose first that the dealer never changes coins. Some definitions:Suppose first that the dealer never changes coins. Some definitions: P(x|fair coin): prob. of the dealer using the F coin and generating the outcome x.P(x|fair coin): prob. of the dealer using the F coin and generating the outcome x. P(x|biased coin): prob. of the dealer using the B coin and generating outcome x.P(x|biased coin): prob. of the dealer using the B coin and generating outcome x.

COMP3456 – adapted from textbook slides P(x|fair coin) vs. P(x|biased coin) Π i=1,n p (x i |biased coin)=(3/4) k (1/4) n-k = k - the number of Heads in x.k - the number of Heads in x.

COMP3456 – adapted from textbook slides P(x|fair coin) vs. P(x|biased coin) P(x|fair coin)=P(x 1 …x n |fair coin)P(x|fair coin)=P(x 1 …x n |fair coin) Π i=1,n p (x i |fair coin)= (1/2) n Π i=1,n p (x i |fair coin)= (1/2) n P(x|biased coin)= P(x 1 …x n |biased coin)=P(x|biased coin)= P(x 1 …x n |biased coin)= Π i=1,n p (x i |biased coin)=(3/4) k (1/4) n-k = 3 k /4 n k - the number of Heads in x.k - the number of Heads in x.

COMP3456 – adapted from textbook slides P(x|fair coin) vs. P(x|biased coin) So what can we find out? P(x|fair coin) = P(x|biased coin) 1/2 n = 3 k /4 n 2 n = 3 k n = k log 2 3 when k = n / log 2 3 (k ≈0.67n)

COMP3456 – adapted from textbook slides Log-odds Ratio We define log-odds ratio as follows:We define log-odds ratio as follows: log 2 (P(x|fair coin) / P(x|biased coin))log 2 (P(x|fair coin) / P(x|biased coin)) = Σ k i=1 log 2 (p + (x i ) / p - (x i ))= Σ k i=1 log 2 (p + (x i ) / p - (x i )) = n – k log 2 3 = n – k log 2 3 This gives us a threshold at which support favours one model over the other.

COMP3456 – adapted from textbook slides Computing Log-odds Ratio in Sliding Windows x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 …x n Consider a sliding window of the outcome sequence. Find the log-odds for this short window. Log-odds value 0 Fair coin most likely used Biased coin most likely used Disadvantages: - the length of CG-island is not known in advance - different windows may classify the same position differently

COMP3456 – adapted from textbook slides Hidden Markov Model (HMM) Can be viewed as an abstract machine, with k hidden states, and that emits symbols from an alphabet Σ.Can be viewed as an abstract machine, with k hidden states, and that emits symbols from an alphabet Σ. Each hidden state has its own probability distribution, and the machine switches between states according to this probability distribution.Each hidden state has its own probability distribution, and the machine switches between states according to this probability distribution. While in a certain state, the machine makes two decisions:While in a certain state, the machine makes two decisions: 1.What state should I move to next? 2.What symbol - from the alphabet Σ - should I emit?

COMP3456 – adapted from textbook slides Why “Hidden”? Observers can see the emitted symbols of an HMM but have no ability to know which state the HMM is currently in.Observers can see the emitted symbols of an HMM but have no ability to know which state the HMM is currently in. Thus, the goal is to infer the most likely hidden states of an HMM based on the given sequence of emitted symbols.Thus, the goal is to infer the most likely hidden states of an HMM based on the given sequence of emitted symbols.

COMP3456 – adapted from textbook slides HMM Parameters Σ: set of emission characters.Σ: set of emission characters. Ex.: Σ = {H, T} for coin tossingEx.: Σ = {H, T} for coin tossing Σ = {1, 2, 3, 4, 5, 6} for dice tossing Σ = {1, 2, 3, 4, 5, 6} for dice tossing Q: set of hidden states, each emitting symbols from Σ.Q: set of hidden states, each emitting symbols from Σ. Q={F,B} for coin tossing Q={F,B} for coin tossing

COMP3456 – adapted from textbook slides HMM Parameters (cont’d) A = (a kl ): a |Q| x |Q| matrix of probability of changing from state k to state l.A = (a kl ): a |Q| x |Q| matrix of probability of changing from state k to state l. a FF = 0.9 a FB = 0.1 a FF = 0.9 a FB = 0.1 a BF = 0.1 a BB = 0.9 a BF = 0.1 a BB = 0.9 E = (e k (b)): a |Q| x |Σ| matrix of probability of emitting symbol b while being in state k.E = (e k (b)): a |Q| x |Σ| matrix of probability of emitting symbol b while being in state k. e F (0) = ½ e F (1) = ½ e F (0) = ½ e F (1) = ½ e B (0) = ¼ e B (1) = ¾ e B (0) = ¼ e B (1) = ¾

COMP3456 – adapted from textbook slides HMM for Fair Bet Casino The Fair Bet Casino in HMM terms:The Fair Bet Casino in HMM terms: Σ = {0, 1} : 0 for Tails and 1 for HeadsΣ = {0, 1} : 0 for Tails and 1 for Heads Q = {F,B} : F for Fair & B for Biased coin.Q = {F,B} : F for Fair & B for Biased coin. Transition Probabilities A Emission Probabilities ETransition Probabilities A Emission Probabilities E FairBiased Fair a FF = 0.9 a FB = 0.1 Biased a BF = 0.1 a BB = 0.9 Tails(0)Heads(1) Fair e F (0) = ½ e F (1) = ½ Biased e B (0) = ¼ e B (1) = ¾

COMP3456 – adapted from textbook slides HMM for Fair Bet Casino (cont’d) HMM model for the Fair Bet Casino Problem

COMP3456 – adapted from textbook slides Hidden Paths A path π = π 1 … π n in the HMM is defined as a sequence of states.A path π = π 1 … π n in the HMM is defined as a sequence of states. Consider path π = FFFBBBBBFFF and sequence x = Consider path π = FFFBBBBBFFF and sequence x = x π = F F F B B B B B F F F P(x i |π i ) ½ ½ ½ ¾ ¾ ¾ ¼ ¾ ½ ½ ½ P( π i-1  π i ) ½ 9 / 10 9 / 10 1 / 10 9 / 10 9 / 10 9 / 10 9 / 10 1 / 10 9 / 10 9 / 10 π i-1 to state π i Transition probability from state π i-1 to state π i π i Probability that x i was emitted from state π i

COMP3456 – adapted from textbook slides P(x|π) Calculation P(x|π): Probability that sequence x was generated by the path π:P(x|π): Probability that sequence x was generated by the path π:

COMP3456 – adapted from textbook slides P(x|π) Calculation P(x|π): Probability that sequence x was generated by the path π:P(x|π): Probability that sequence x was generated by the path π:

COMP3456 – adapted from textbook slides P(x|π) Calculation P(x|π): Probability that sequence x was generated by the path π:P(x|π): Probability that sequence x was generated by the path π: n n P(x|π) = P(π 0 → π 1 ) · Π P(x i | π i ) · P(π i → π i+1 )P(x|π) = P(π 0 → π 1 ) · Π P(x i | π i ) · P(π i → π i+1 ) i=1 i=1 = a π 0, π 1 · Π e π i (x i ) · a π i, π i+1 = a π 0, π 1 · Π e π i (x i ) · a π i, π i+1

COMP3456 – adapted from textbook slides P(x|π) Calculation P(x|π): Probability that sequence x was generated by the path π:P(x|π): Probability that sequence x was generated by the path π: n n P(x|π) = P(π 0 → π 1 ) · Π P(x i | π i ) · P(π i → π i+1 )P(x|π) = P(π 0 → π 1 ) · Π P(x i | π i ) · P(π i → π i+1 ) i=1 i=1 = a π 0, π 1 · Π e π i (x i ) · a π i, π i+1 = a π 0, π 1 · Π e π i (x i ) · a π i, π i+1 = Π e π i+1 (x i+1 ) · a π i, π i+1 = Π e π i+1 (x i+1 ) · a π i, π i+1 if we count from i=0 instead of i=1 if we count from i=0 instead of i=1

COMP3456 – adapted from textbook slides Decoding Problem Goal: Find an optimal hidden path of states given observations.Goal: Find an optimal hidden path of states given observations. Input: A sequence of observations x = x 1 …x n generated by an HMM M(Σ, Q, A, E)Input: A sequence of observations x = x 1 …x n generated by an HMM M(Σ, Q, A, E) Output: A path that maximizes P(x|π) over all possible paths π.Output: A path that maximizes P(x|π) over all possible paths π.

COMP3456 – adapted from textbook slides Building Manhattan for Building Manhattan for Decoding Problem Andrew Viterbi used the Manhattan grid model to solve the Decoding Problem.Andrew Viterbi used the Manhattan grid model to solve the Decoding Problem. Every choice of π = π 1 … π n corresponds to a path in the graph.Every choice of π = π 1 … π n corresponds to a path in the graph. The only valid direction in the graph is eastward.The only valid direction in the graph is eastward. This graph has |Q| 2 (n-1) edges (remember Q is the set of states; n is the length of the sequence).This graph has |Q| 2 (n-1) edges (remember Q is the set of states; n is the length of the sequence).

COMP3456 – adapted from textbook slides Edit Graph for Decoding Problem

COMP3456 – adapted from textbook slides Decoding Problem vs. Alignment Problem Valid directions in the alignment problem. Valid directions in the decoding problem.

COMP3456 – adapted from textbook slides Decoding Problem as Finding a Longest Path in a DAG The Decoding Problem is reduced to finding a longest path in the directed acyclic graph (DAG) above.The Decoding Problem is reduced to finding a longest path in the directed acyclic graph (DAG) above. Note: the length of the path is defined as the product of its edges’ weights, not the sum.Note: the length of the path is defined as the product of its edges’ weights, not the sum.

COMP3456 – adapted from textbook slides Decoding Problem (cont’d) Every path in the graph has the probability P(x|π).Every path in the graph has the probability P(x|π). The Viterbi algorithm finds the path that maximizes P(x|π) among all possible paths.The Viterbi algorithm finds the path that maximizes P(x|π) among all possible paths. The Viterbi algorithm runs in O(n|Q| 2 ) time.The Viterbi algorithm runs in O(n|Q| 2 ) time.

COMP3456 – adapted from textbook slides Decoding Problem: weights of edges w The weight w is given by: ??? (k, i)(l, i+1)

COMP3456 – adapted from textbook slides Decoding Problem: weights of edges w The weight w is given by: ?? (k, i)(l, i+1)

COMP3456 – adapted from textbook slides Decoding Problem: weights of edges w The weight w is given by: ?? (k, i)(l, i+1) n P(x|π) = Π e π i+1 (x i+1 ). a π i, π i+1 i=0 i=0

COMP3456 – adapted from textbook slides Decoding Problem: weights of edges w The weight w is given by: ? (k, i)(l, i+1) i-th term = i-th term =

COMP3456 – adapted from textbook slides Decoding Problem: weights of edges w The weight w= e l (x i+1 ). a kl (k, i)(l, i+1) i-th term = e π i (x i ). a π i, π i+1 = π i =k, π i+1 =l i-th term = e π i (x i ). a π i, π i+1 = e l (x i+1 ). a kl for π i =k, π i+1 =l

COMP3456 – adapted from textbook slides Decoding Problem: weights of edges w The weight w= e l (x i+1 ). a kl (k, i)(l, i+1) i-th term = e π i (x i ). a π i, π i+1 = π i =k, π i+1 =l i-th term = e π i (x i ). a π i, π i+1 = e l (x i+1 ). a kl for π i =k, π i+1 =l

COMP3456 – adapted from textbook slides Decoding Problem and Dynamic Programming s l,i+1 = max k Є Q { s k,i · weight of edge between (k,i) and (l,i+1)} = max k Є Q { s k,i · a kl · e l (x i+1 ) } = max k Є Q { s k,i · a kl · e l (x i+1 ) } = e l (x i+1 ) · max k Є Q { s k,i · a kl } = e l (x i+1 ) · max k Є Q { s k,i · a kl }

COMP3456 – adapted from textbook slides Decoding Problem (cont’d) Initialization:Initialization: s begin,0 = 1s begin,0 = 1 s k,0 = 0 for k ≠ begin.s k,0 = 0 for k ≠ begin. Let π * be the optimal path. Then,Let π * be the optimal path. Then, P( x |π * ) = max k Є Q { s k,n. a k,end }

COMP3456 – adapted from textbook slides Viterbi Algorithm The value of the product can become extremely small, which leads to overflowing.The value of the product can become extremely small, which leads to overflowing. To avoid overflowing, use log value instead.To avoid overflowing, use log value instead. s k,i+1 = log ( k Є Q { s k,i +} s k,i+1 = log (e l (x i+1 )) + max k Є Q { s k,i + log( a kl )}

COMP3456 – adapted from textbook slides Forward-Backward Problem Given: a sequence of coin tosses generated by an HMM.Given: a sequence of coin tosses generated by an HMM. Goal: find the probability that the dealer was using a biased coin at a particular time.Goal: find the probability that the dealer was using a biased coin at a particular time.

COMP3456 – adapted from textbook slides Forward Algorithm Define f k,i (forward probability) as the probability of emitting the prefix x 1 …x i and reaching the state π = k.Define f k,i (forward probability) as the probability of emitting the prefix x 1 …x i and reaching the state π = k. The recurrence for the forward algorithm:The recurrence for the forward algorithm: f k,i = e k (x i ). Σ f l,i- 1. a lk f k,i = e k (x i ). Σ f l,i- 1. a lk l Є Q l Є Q

COMP3456 – adapted from textbook slides Backward Algorithm However, forward probability is not the only factor affecting P(π i = k|x).However, forward probability is not the only factor affecting P(π i = k|x). The sequence of transitions and emissions that the HMM undergoes between π i+1 and π n also affect P(π i = k|x).The sequence of transitions and emissions that the HMM undergoes between π i+1 and π n also affect P(π i = k|x). forward x i backward

COMP3456 – adapted from textbook slides Backward Algorithm (cont’d) Define backward probability b k,i as the probability of being in state π i = k and emitting the suffix x i+1 …x n.Define backward probability b k,i as the probability of being in state π i = k and emitting the suffix x i+1 …x n. The recurrence for the backward algorithm:The recurrence for the backward algorithm:

COMP3456 – adapted from textbook slides Backward-Forward Algorithm P(x, π i = k) P(x) is the sum of P(x, π i = k) over all k The probability that the dealer used a biased coin at any moment i :The probability that the dealer used a biased coin at any moment i :

COMP3456 – adapted from textbook slides Finding Distant Members of a Protein Family A distant cousin of functionally related sequences in a protein family may have weak pairwise similarities with each member of the family and thus fail significance test.A distant cousin of functionally related sequences in a protein family may have weak pairwise similarities with each member of the family and thus fail significance test. However, they may have weak similarities with many members of the family.However, they may have weak similarities with many members of the family. The goal is to align a sequence to all members of the family at once.The goal is to align a sequence to all members of the family at once. A family of related proteins can be represented by their multiple alignment and the corresponding profile.A family of related proteins can be represented by their multiple alignment and the corresponding profile.

COMP3456 – adapted from textbook slides Profile Representation of Protein Families Aligned DNA sequences can be represented by a 4·n profile matrix reflecting the frequencies of nucleotides in every aligned position. A protein family can be represented by a profile representing frequencies of amino acids. A protein family can be represented by a 20·n profile representing frequencies of amino acids.

COMP3456 – adapted from textbook slides Profiles and HMMs HMMs can also be used for aligning a sequence against a profile representingHMMs can also be used for aligning a sequence against a profile representing protein family. protein family. A 20·n profile P corresponds to n sequentially linked match states M 1,…,M n in the profile HMM of P.A 20·n profile P corresponds to n sequentially linked match states M 1,…,M n in the profile HMM of P.

COMP3456 – adapted from textbook slides Multiple Alignments and Protein Family Classification Multiple alignment of a protein family usually shows variations in conservation along the length of a protein Example: after aligning many globin proteins, the biologists recognized that the helices region in globins are more conserved than others.

COMP3456 – adapted from textbook slides What are Profile HMMs ? A Profile HMM is a probabilistic representation of a multiple alignment. A given multiple alignment (of a protein family) is used to build a profile HMM. This model then may be used to find and score less obvious potential matches of new protein sequences.

COMP3456 – adapted from textbook slides Profile HMM A profile HMM

COMP3456 – adapted from textbook slides Building a profile HMM Multiple alignment is used to construct the HMM model. Assign each column to a Match state in HMM. Add Insertion and Deletion state. Estimate the emission probabilities according to amino acid counts in column. Different positions in the protein will have different emission probabilities. Estimate the transition probabilities between Match, Deletion and Insertion states The HMM model gets trained to derive the optimal parameters.

COMP3456 – adapted from textbook slides States of Profile HMM Match states M 1 …M n (plus begin/end states)Match states M 1 …M n (plus begin/end states) Insertion states I 0 I 1 …I nInsertion states I 0 I 1 …I n Deletion states D 1 …D nDeletion states D 1 …D n

COMP3456 – adapted from textbook slides Transition Probabilities in Profile HMM log(a MI )+log(a IM ) = gap initiation penalty log(a MI )+log(a IM ) = gap initiation penalty log(a II gap extension penalty log(a II ) = gap extension penalty

COMP3456 – adapted from textbook slides Emission Probabilities in Profile HMM Probability of emitting a symbol a at an Probability of emitting a symbol a at an insertion state I j : insertion state I j : e Ij (a) = p(a) where p(a) is the frequency of the where p(a) is the frequency of the occurrence of the symbol a in all the occurrence of the symbol a in all the sequences (as we have nothing else to go on). sequences (as we have nothing else to go on).

COMP3456 – adapted from textbook slides Profile HMM Alignment Define v M j (i) as the logarithmic likelihood score of the best path for matching x 1..x i to profile HMM ending with x i emitted by the state M j.Define v M j (i) as the logarithmic likelihood score of the best path for matching x 1..x i to profile HMM ending with x i emitted by the state M j. v I j (i) and v D j (i) are defined similarly (ending with insertion or deletion in the sequence x ). v I j (i) and v D j (i) are defined similarly (ending with insertion or deletion in the sequence x ).

COMP3456 – adapted from textbook slides Profile HMM Alignment: Dynamic Programming v M j-1 (i-1) + log(a M j-1, M j ) v M j-1 (i-1) + log(a M j-1, M j ) v M j (i) = log (e M j (x i )/p(x i )) + max v I j-1 (i-1) + log(a I j-1, M j ) v M j (i) = log (e M j (x i )/p(x i )) + max v I j-1 (i-1) + log(a I j-1, M j ) v D j-1 (i-1) + log(a D j-1, M j ) v D j-1 (i-1) + log(a D j-1, M j ) v M j (i-1) + log(a M j, I j ) v M j (i-1) + log(a M j, I j ) v I j (i) = log (e I j (x i )/p(x i )) + max v I j (i-1) + log(a I j, I j ) v I j (i) = log (e I j (x i )/p(x i )) + max v I j (i-1) + log(a I j, I j ) v D j (i-1) + log(a D j, I j ) v D j (i-1) + log(a D j, I j )

COMP3456 – adapted from textbook slides Paths in Edit Graph and Profile HMM A path through an edit graph and the corresponding path through a profile HMM

COMP3456 – adapted from textbook slides Making a Collection of HMM for Protein Families Use BLAST to separate a protein database into families of related proteins Construct a multiple alignment for each protein family. Construct a profile HMM model and optimize the parameters of the model (transition and emission probabilities). Align the target sequence against each HMM to find the best fit between a target sequence and an HMM

COMP3456 – adapted from textbook slides Application of Profile HMM to Modelling Globin Proteins Globins represent a large collection of protein sequences 400 globin sequences were randomly selected from all globins and used to construct a multiple alignment. Multiple alignment was used to assign an initial HMM This model then gets trained repeatedly with model lengths chosen randomly between 145 to 170, to get an HMM model with optimized probabilities.

COMP3456 – adapted from textbook slides How Good is the Globin HMM? 625 remaining globin sequences in the database were aligned to the constructed HMM resulting in a multiple alignment. This multiple alignment agrees extremely well with the structurally derived alignment. 25,044 proteins, were randomly chosen from the database and compared against the globin HMM. This experiment resulted in an excellent separation between globin and non-globin families.

COMP3456 – adapted from textbook slides PFAM ( Pfam decribes protein domains Each protein domain family in Pfam has: - Seed alignment: manually verified multiple alignment of a representative set of sequences. - HMM built from the seed alignment for further database searches. - Full alignment generated automatically from the HMM The distinction between seed and full alignments facilitates Pfam updates. - Seed alignments are stable resources. - HMM profiles and full alignments can be updated with newly found amino acid sequences.

COMP3456 – adapted from textbook slides PFAM Uses Pfam HMMs span entire domains that include both well-conserved motifs and less- conserved regions with insertions and deletions. It results in modelling complete domains that facilitates better sequence annotation and leads to a more sensitive detection.

COMP3456 – adapted from textbook slides HMM Parameter Estimation So far, we have assumed that the transition and emission probabilities are known.So far, we have assumed that the transition and emission probabilities are known. However, in most HMM applications, the probabilities are not known. It’s very hard to estimate the probabilities.However, in most HMM applications, the probabilities are not known. It’s very hard to estimate the probabilities.

COMP3456 – adapted from textbook slides part 2: estimation of HMM parameters

COMP3456 – adapted from textbook slides HMM Parameter Estimation Problem Given HMM with states and alphabet (emission characters) Independent training sequences x 1, … x m Find HMM parameters Θ (that is, a kl, e k (b)) that maximize P(x 1, …, x m | Θ) the joint probability of the training sequences.

COMP3456 – adapted from textbook slides Maximize the likelihood P(x 1, …, x m | Θ ) as a function of Θ is called the likelihood of the model. The training sequences are assumed independent, therefore P(x 1, …, x m | Θ ) = Π i P(x i | Θ ) The parameter estimation problem seeks Θ that realizes In practice the log likelihood is computed to avoid underflow errors

COMP3456 – adapted from textbook slides Two situations Known paths for training sequences, e.g., CpG islands marked on training sequences One evening the casino dealer allows us to see when he changes dice Unknown paths, e.g., CpG islands are not marked Do not see when the casino dealer changes dice

COMP3456 – adapted from textbook slides Known paths Let A kl = # of times each k → l transition is taken in the training sequences E k (b) = # of times b is emitted from state k in the training sequences Compute a kl and e k (b) as maximum likelihood estimators:

COMP3456 – adapted from textbook slides Pseudocounts  Some state k may not appear in any of the training sequences. This means A kl = 0 for every state l and a kl cannot be computed with the given equation.  To avoid this overfitting we use predetermined pseudocounts r kl and r k (b). A kl = # of transitions k→l + r kl E k (b) = # of emissions of b from k + r k (b) The pseudocounts reflect our prior biases about the probability values.

COMP3456 – adapted from textbook slides Unknown paths: Viterbi training Idea: use Viterbi decoding to compute the most probable path for training sequence x Start with some guess for initial parameters and compute π*, the most probable path for x using initial parameters. Iterate until there’s no change in π* : 1. Determine A kl and E k (b) as before 2. Compute new parameters a kl and e k (b) using the same formulae as before 3. Compute new π* for x and the current parameters

COMP3456 – adapted from textbook slides Viterbi training analysis The algorithm converges precisely There are finitely many possible paths. New parameters are uniquely determined by the current π*. There may be several paths for x with the same probability, hence we must compare the new π* with all previous paths having highest probability. It does not maximize the likelihood Π x P(x | Θ) but the contribution to the likelihood of the most probable path Π x P(x | Θ, π*) In general it performs less well than Baum-Welch

COMP3456 – adapted from textbook slides Unknown paths: Baum-Welch The general idea: 1. Guess initial values for parameters. “art” and experience, not science 2. Estimate new (better) values for parameters. how ? 3. Repeat until stopping criteria are met. what criteria ?

COMP3456 – adapted from textbook slides Better values for parameters We would need the A kl and E k (b) values but we cannot count transitions (the path is unknown) and do not want to use a most probable path. For all states k,l, symbol b and training sequence x, we Compute A kl and E k (b) as expected values, given the current parameters

COMP3456 – adapted from textbook slides Notation For any sequence of characters x emitted along some unknown path π, denote by “π i = k” the assumption that the state at position i (in which x i is emitted) is k.

COMP3456 – adapted from textbook slides Probabilistic setting for A k,l Given x 1, …,x m consider a discrete probability space with elementary events ε k,l, = “k → l is taken in x 1, …, x m ” For each x in {x 1,…,x m } and each position i in x let Y x,i be a random variable defined by Define Y = Σ x Σ i Y x,i random variable that counts # of times the event ε k,l happens in x 1,…,x m.

COMP3456 – adapted from textbook slides The meaning of A kl Let A kl be the expectation of Y E(Y) = Σ x Σ i E(Y x,i ) = Σ x Σ i P(Y x,i = 1) = Σ x Σ i P({ε k,l | π i = k and π i+1 = l}) = Σ x Σ i P(π i = k, π i+1 = l | x) Need to compute P(π i = k, π i+1 = l | x)

COMP3456 – adapted from textbook slides Probabilistic setting for E k (b) Given x 1, …,x m consider a discrete probability space with elementary events ε k,b = “b is emitted in state k in x 1, …,x m ” For each x in {x 1,…,x m } and each position i in x let Y x,i be a random variable defined by Define Y = Σ x Σ i Y x,i random variable that counts # of times the event ε k,b happens in x 1,…,x m.

COMP3456 – adapted from textbook slides The meaning of E k (b) Let E k (b) be the expectation of Y E(Y) = Σ x Σ i E(Y x,i ) = Σ x Σ i P(Y x,i = 1) = Σ x Σ i P({ε k,b | x i = b and π i = k}) Need to compute P(π i = k | x)

COMP3456 – adapted from textbook slides Computing new parameters Consider x = x 1 …x n training sequence Concentrate on positions i and i+1 Use the forward-backward values: f ki = P(x 1 … x i, π i = k) b ki = P(x i+1 … x n | π i = k)

COMP3456 – adapted from textbook slides Compute A kl (1) Prob k →  l is taken at position i of x P(π i = k, π i+1 = l | x 1 …x n ) = P(x, π i = k, π i+1 = l) / P(x) Compute P(x) using either forward or backward values We’ll show that P(x, π i = k, π i+1 = l) = b li+1 ·e l (x i+1 ) ·a kl ·f ki Expected # times k →  l is used in training sequences A kl = Σ x Σ i (b li+1 ·e l (x i+1 ) ·a kl ·f ki ) / P(x)

COMP3456 – adapted from textbook slides Compute A kl (2) P(x, π i = k, π i+1 = l) = P(x 1 …x i, π i = k, π i+1 = l, x i+1 …x n ) = P(π i+1 = l, x i+1 …x n | x 1 …x i, π i = k)·P(x 1 …x i,π i =k)= P(π i+1 = l, x i+1 …x n | π i = k)·f ki = P(x i+1 …x n | π i = k, π i+1 = l)·P(π i+1 = l | π i = k)·f ki = P(x i+1 …x n | π i+1 = l)·a kl ·f ki = P(x i+2 …x n | x i+1, π i+1 = l) · P(x i+1 | π i+1 = l) ·a kl ·f ki = P(x i+2 …x n | π i+1 = l) ·e l (x i+1 ) ·a kl ·f ki = b li+1 ·e l (x i+1 ) ·a kl ·f ki

COMP3456 – adapted from textbook slides Compute E k (b) Prob x i of x is emitted in state k P(π i = k | x 1 …x n ) = P(π i = k, x 1 …x n )/P(x) P(π i = k, x 1 …x n ) = P(x 1 …x i,π i = k,x i+1 …x n ) = P(x i+1 …x n | x 1 …x i,π i = k) · P(x 1 …x i,π i = k) = P(x i+1 …x n | π i = k) · f ki = b ki · f ki Expected # times b is emitted in state k

COMP3456 – adapted from textbook slides Finally, new parameters Can add pseudocounts as before.

COMP3456 – adapted from textbook slides Stopping criteria Cannot actually reach maximum (optimization of continuous functions) Therefore we need stopping criteria Compute the log likelihood of the model for current Θ Compare with previous log likelihood Stop if small difference Stop after a certain number of iterations

COMP3456 – adapted from textbook slides The Baum-Welch algorithm Initialization: Pick the best-guess for model parameters (or arbitrary) Iteration: 1. Forward for each x 2. Backward for each x 3. Calculate A kl, E k (b) 4. Calculate new a kl, e k (b) 5. Calculate new log-likelihood Until log-likelihood does not change much

COMP3456 – adapted from textbook slides Baum-Welch analysis Log-likelihood is increased by iterations Baum-Welch is a particular case of the EM (expectation maximization) algorithm Convergence is to a local maximum. Choice of initial parameters determines local maximum to which the algorithm converges

COMP3456 – adapted from textbook slides Log-likelihood is increased by iterations The relative entropy of two distributions P,Q H(P||Q) = Σ i P(x i ) log (P(x i )/Q(x i )) Property: H(P||Q) is positive H(P||Q) = 0 iff P(x i ) = Q(x i ) for all i Proof of property based on f(x) = x log x is positive f(x) = 0 iff x = 1 (except when log 2 )

COMP3456 – adapted from textbook slides Proof cont’d Log likelihood is log P(x | Θ) = log Σ π P(x,π | Θ) P(x,π | Θ) = P(π |x,Θ) P(x | Θ) Assume Θ t are the current parameters. Choose Θ t+1 such that log P(x | Θ t+1 ) greater than log P(x | Θ t ) log P(x | Θ) = log P(x,π | Θ) - log P(π |x,Θ) log P(x | Θ) = Σ π P(π |x,Θ t ) log P(x,π | Θ) - Σ π P(π |x,Θ t ) log P(π | x,Θ) because Σ π P(π |x,Θ t ) = 1

COMP3456 – adapted from textbook slides Proof cont’d Notation: Q(Θ | Θ t ) = Σ π P(π |x,Θ t ) log P(x,π | Θ) Show that Θ t+1 that maximizes log P(x | Θ) may be chosen to be some Θ that maximizes Q(Θ | Θ t ) log P(x | Θ) - log P(x | Θ t ) = Q(Θ | Θ t ) - Q(Θ t | Θ t ) + Σ π P(π |x,Θ t ) log (P(π |x,Θ t ) / P(π |x,Θ)) The sum is positive (relative entropy)

COMP3456 – adapted from textbook slides Proof cont’d Conclusion: log P(x | Θ) - log P(x | Θ t ) greater than Q(Θ | Θ t ) - Q(Θ t | Θ t ) with equality only when Θ = Θ t or when P(π |x,Θ t ) = P(π |x,Θ) for some Θ not = Θ t

COMP3456 – adapted from textbook slides Proof cont’d For an HMM P(x,π | Θ) = a 0,π1 Π i=1,|x| e πi (x i ) a πi,πi+1 Let A kl (π) = # times k→l appears in this product E k (b,π) = # times emission of b from k appears in this product The product is function of Θ but A kl (π), E k (b,π) do not depend on Θ

COMP3456 – adapted from textbook slides Proof cont’d Write the product using all the factors e k (b) to the power E k (b, π) a kl to the power A kl (π) Then replace the product in Q(Θ | Θ t ) = Σ π P(π |x,Θ t ) (Σ k=1,M Σ b E k (b, π) log e k (b) + Σ k=0,M Σ l=1,M A kl (π) log a kl )

COMP3456 – adapted from textbook slides Proof cont’d Remember A kl and E k (b) computed by the Baum-Welch alg at every iteration. Consider those computed at iteration t (based on Θ t ) Then A kl = Σ π P(π |x,Θ t ) A kl (π) E k (b) = Σ π P(π |x,Θ t ) E k (b, π) as expectations of A kl (π), resp. E k (b, π) over P(π |x,Θ t )

COMP3456 – adapted from textbook slides Proof cont’d Then Q(Θ | Θ t ) = Σ k=1,M Σ b E k (b) log e k (b) + Σ k=0,M Σ l=1,M A kl log a kl (changing order of summations) Note that Θ consists of {a kl } and {e k (b)}. The algorithm computes Θ t+1 to consist of A kl / Σ l’ A kl’ and E k (b) / Σ b’ E k (b’) Show that this Θ t+1 maximizes Q(Θ | Θ t ) (compute the differences for the A part and for the E part)

COMP3456 – adapted from textbook slides Speech Recognition Create an HMM of the words in a languageCreate an HMM of the words in a language Each word is a hidden state in Q.Each word is a hidden state in Q. Each of the basic sounds in the language is a symbol in Σ.Each of the basic sounds in the language is a symbol in Σ. Input: use speech as the input sequence.Input: use speech as the input sequence. Goal: find the most probable sequence of states.Goal: find the most probable sequence of states.

COMP3456 – adapted from textbook slides Speech Recognition: Building the Model Analyze some large source of English sentences, such as a database of newspaper articles, to form probability matrixes.Analyze some large source of English sentences, such as a database of newspaper articles, to form probability matrixes. A 0i : the chance that word i begins a sentence.A 0i : the chance that word i begins a sentence. A ij : the chance that word j follows word i.A ij : the chance that word j follows word i.

COMP3456 – adapted from textbook slides Building the Model (cont’d) Analyze English speakers to determine what sounds are emitted with what words.Analyze English speakers to determine what sounds are emitted with what words. E k (b): the chance that sound b is spoken in word k. Allows for alternate pronunciation of words.E k (b): the chance that sound b is spoken in word k. Allows for alternate pronunciation of words.

COMP3456 – adapted from textbook slides Speech Recognition: Using the Model Use the same dynamic programming algorithm as beforeUse the same dynamic programming algorithm as before Weave the spoken sounds through the model the same way we wove the rolls of the die through the casino model.Weave the spoken sounds through the model the same way we wove the rolls of the die through the casino model. π represents the most likely set of words.π represents the most likely set of words.

COMP3456 – adapted from textbook slides Using the Model (cont’d) How well does it work?How well does it work? Common words, such as ‘the’, ‘a’, ‘of’ make prediction less accurate, since there are so many words that follow normally.Common words, such as ‘the’, ‘a’, ‘of’ make prediction less accurate, since there are so many words that follow normally.

COMP3456 – adapted from textbook slides Improving Speech Recognition Initially, we were using a ‘bigram,’ a graph connecting every two words.Initially, we were using a ‘bigram,’ a graph connecting every two words. Expand that to a ‘trigram’Expand that to a ‘trigram’ Each state represents two words spoken in succession.Each state represents two words spoken in succession. Each edge joins those two words (A B) to another state representing (B C)Each edge joins those two words (A B) to another state representing (B C) Requires n 3 vertices and edges, where n is the number of words in the language.Requires n 3 vertices and edges, where n is the number of words in the language. Much better, but still limited context.Much better, but still limited context.

COMP3456 – adapted from textbook slides References Slides for CS 262 course at Stanford given by Serafim Batzoglou