6.047/ Computational Biology: Genomes, Networks, Evolution

Slides:

Advertisements

Similar presentations

Hidden Markov Model in Biological Sequence Analysis – Part 2

Advertisements

Marjolijn Elsinga & Elze de Groot1 Markov Chains and Hidden Markov Models Marjolijn Elsinga & Elze de Groot.

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

Learning HMM parameters

Hidden Markov Model.

Hidden Markov Models Eine Einführung.

Hidden Markov Models.

 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.

Hidden Markov Models Modified from:

Hidden Markov Models Theory By Johan Walters (SR 2003)

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

CpG islands in DNA sequences

HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.

Lecture 6, Thursday April 17, 2003

Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.

Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.

Hidden Markov Models. Decoding GIVEN x = x 1 x 2 ……x N We want to find  =  1, ……,  N, such that P[ x,  ] is maximized  * = argmax  P[ x,  ] We.

Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.

Hidden Markov Models Lecture 6, Thursday April 17, 2003.

Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.

Hidden Markov Models Lecture 5, Tuesday April 15, 2003.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Lecture 5: Learning models using EM

S. Maarschalkerweerd & A. Tjhang1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter

CpG islands in DNA sequences

Hidden Markov Models Lecture 5, Tuesday April 15, 2003.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.

Hidden Markov Models 1 2 K … x1 x2 x3 xK.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter

Hidden Markov Models.

Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.

Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.

CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

1 Markov Chains. 2 Hidden Markov Models 3 Review Markov Chain can solve the CpG island finding problem Positive model, negative model Length? Solution:

HMM Hidden Markov Model Hidden Markov Model. CpG islands CpG islands In human genome, CG dinucleotides are relatively rare In human genome, CG dinucleotides.

. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

BINF6201/8201 Hidden Markov Models for Sequence Analysis

Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.

Hidden Markov Models Usman Roshan CS 675 Machine Learning.

Hidden Markov Models BMI/CS 776 Mark Craven March 2002.

CS5263 Bioinformatics Lecture 12: Hidden Markov Models and applications.

Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Algorithms in Computational Biology11Department of Mathematics & Computer Science Algorithms in Computational Biology Markov Chains and Hidden Markov Model.

Eric Xing © Eric CMU, Machine Learning Structured Models: Hidden Markov Models versus Conditional Random Fields Eric Xing Lecture 13,

Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.

Hidden Markov Models – Concepts 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.

Hidden Markov Models BMI/CS 576

Comp. Genomics Recitation 6 14/11/06 ML and EM.

What is a Hidden Markov Model?

Hidden Markov Models 1 2 K … x1 x2 x3 xK.

Modeling Biological Sequence using Hidden Markov Models

Hidden Markov Models - Training

Hidden Markov Models Part 2: Algorithms

Bayesian Models in Machine Learning

Professor of Computer Science and Mathematics

- Viterbi training - Baum-Welch algorithm - Maximum Likelihood

CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)

6.096 – Algorithms for Computational Biology – Lecture 7

Hidden Markov Model ..

Professor of Computer Science and Mathematics

Professor of Computer Science and Mathematics

CSE 5290: Algorithms for Bioinformatics Fall 2009

CSE 5290: Algorithms for Bioinformatics Fall 2009

Hidden Markov Model Lecture #6

CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (I)

CISC 667 Intro to Bioinformatics (Fall 2005) Hidden Markov Models (II)

Presentation transcript:

6.047/6.878 - Computational Biology: Genomes, Networks, Evolution Modeling Biological Sequence and Hidden Markov Models (part II - The algorithms) Lecture 7 Sept 25, 2008

Challenges in Computational Biology 4 Genome Assembly 9 Regulatory motif discovery 6 Gene Finding DNA 2 Sequence alignment 14 Comparative Genomics TCATGCTAT TCGTGATAA TGAGGATAT TTATCATAT TTATGATTT 3 Database search 10 Evolutionary Theory 5 Gene expression analysis RNA transcript 4 Cluster discovery 9 Gibbs sampling 11 Protein network analysis 12 Regulatory network inference 13 Emerging network properties

Course Outline Lec Date Topic Homework 1 Thu Sep 04 Intro: Biology, Algorithms, Machine Learning Recitation1: Probability, Statistics, Biology PS1 2 Tue Sep 09 Global/local alignment/DynProg (due 9/15) 3 Thu Sep 11 StringSearch/Blast/DB Search Recitation2: Affine gaps alignment / Hashing with combs 4 Tue Sep 16 Clustering/Basics/Gene Expression/Sequence Clustering PS2 5 Thu Sep 18 Classification/Feature Selection/ROC/SVM (due 9/22) Recitation3: Microarrays 6 Tue Sep 23 HMMs1 - Evaluation / Parsing 7 Thu Sep 25 HMMs2 - PosteriorDecoding/Learning Recitation4: Posterior decoding review, Baum-Welch Learning PS3 8 Tue Sep 30 Generalized HMMs and Gene Prediction (due 10/06) 9 Thu Oct 02 Regulatory Motifs/Gibbs Sampling/EM Recitation5: Entropy, Information, Background models 10 Tue Oct 07 Gene evolution: Phylogenetic algorithms, NJ, ML, Parsimony 11 Thu Oct 09 MolecularEvolution/Coalescence/Selection/MKS/KaKs Recitation 6: Gene Trees, Species Trees, Reconciliation PS4 12 Tue Oct 14 Population Genomics - Fundamentals (due 10/20) 13 Thu Oct 16 Population Genomics - Association Studies Recitation 7: Population genomics 14 Tue Oct 21 MIDTERM (Review Session Mon Oct 20rd at 7pm) Midterm 15 Thu Oct 23 GenomeAssembly/EulerGraphs Recitation 8: Brainstorming for Final Projects (MIT) Project 16 Tue Oct 28 ComparativeGenomics1:BiologicalSignalDiscovery/EvolutionarySignatures Phase I 17 Thu Oct 30 ComparativeGenomics2:Phylogenomics, Gene and Genome Duplication (due 11/06) 18 Tue Nov 04 Conditional Random Fields/Gene Finding/Feature Finding 19 Thu Nov 06 Regulatory Networks/Bayesian Networks Tue Nov 11 Veterans Day Holiday 20 Thu Nov 13 Inferring Biological Networks/Graph Isomorphism/Network Motifs Phase II 21 Tue Nov 18 Metabolic Modeling 1 - Dynamic systems modeling (due 11/24) 22 Thu Nov 20 Metabolic Modeling 2 - Flux balance analysis & metabolic control analysis 23 Tue Nov 25 Systems Biology (Uri Alon) Thu Nov 27 ThanksGiving Break 24 Tue Dec 02 Module Networks (Aviv Regev) 25 Thu Dec 04 Synthetic Biology (Tom Knight) Phase III 26 Tue Dec 09 Final Presentations - Part I (11am) (due 12/08) 27 Final Presentations - Part II (5pm) Course Outline

Modeling biological sequences Intergenic CpG island Promoter First exon Intron Other exon Intron GGTTACAGGATTATGGGTTACAGGTAACCGTTGTACTCACCGGGTTACAGGATTATGGGTTACAGGTAACCGGTACTCACCGGGTTACAGGATTATGGTAACGGTACTCACCGGGTTACAGGATTGTTACAGG Ability to generate DNA sequences of a certain type Not exact alignment to previously known gene Preserving ‘properties’ of type, not identical sequence Ability to recognize DNA sequences of a certain type What (hidden) state is most likely to have generated observations Find set of states and transitions that generated a long sequence Ability to learn distinguishing characteristics of each type Training our generative models on large datasets Learn to classify unlabelled data

Markov Chains & Hidden Markov Models aAT A+ T+ A+ C+ G+ T+ aAC aGT C+ G+ A: 1 C: 0 G: 0 T: 0 A: 0 C: 1 G: 0 T: 0 A: 0 C: 0 G: 1 T: 0 A: 0 C: 0 G: 0 T: 1 aGC Markov Chain Q: states p: initial state probabilities A: transition probabilities HMM Q: states V: observations p: initial state probabilities A: transition probabilities E: emission probabilities

HMM nomenclature 1 2 K P(x,) = a01 * Πi ei(xi)  aii+1 start … x2 x3 xK x1 ast π is the (hidden) path es(xi) x is the (observed) sequence Find path * that maximizes total joint probability P[ x,  ] P(x,) = a01 * Πi ei(xi)  aii+1 start emission transition

HMM for the dishonest casino model aFL transitions P(πi=L|πi-1=F) aLL 0.05 aFF 0.95 0.95 FAIR LOADED 0.05 aLF emissions eF(1) = P(xi=1|πi=F)=1/6 eL(1) = P(xi=1|πi=L) = 1/10 eL(2) = 1/10 eL(3) = 1/10 eL(4) = 1/10 eL(5) = 1/10 eL(6) = 1/2 eF(2) = 1/6 eF(3) = 1/6 eF(4) = 1/6 eF(5) = 1/6 eF(6) = 1/6

HMM for CpG islands A: 1 C: 0 G: 0 T: 0 A: 0 C: 1 G: 0 T: 0 A: 0 C: 0 G: 1 T: 0 A: 0 C: 0 G: 0 T: 1 Build a single model that combines both Markov chains: ‘+’ states: A+, C+, G+, T+ Emit symbols: A, C, G, T in CpG islands ‘-’ states: A-, C-, G-, T- Emit symbols: A, C, G, T in non-islands Emission probabilities distinct for the ‘+’ and the ‘-’ states Infer most likely set of states, giving rise to observed emissions  ‘Paint’ the sequence with + and - states A+ T+ G+ C+ A- T- G- C- A: 0 C: 0 G: 1 T: 0 A: 1 C: 0 G: 0 T: 0 A: 0 C: 1 G: 0 T: 0 A: 0 C: 0 G: 0 T: 1 We would like is to build a single model for the entire sequence that incorporates both Markov chains. We have two states corresponding to each nucleotide symbol: A+ A- The transition probabilities are set so that within each group they are the same as in the Markov models. But there are small probabilities of switching between the components. Switch from + to – is more likely than from – to +. There is not 1-to-1 correspondence between states and symbols. Each state might emit different symbols. There is no way to say which state emitted symbol xi Let me give you another example of HMM: Give example with URNs on p.260 or a kid selecting balls from boxes and busy mom Let’s give a more formal definition of an HMM Question: Why do we need so many states? In the Dishonest Casino we only had 2 states: Fair / Loaded Why do we need 8 states here: 4 CpG+ / 4 CpG- ?  Encode ‘memory’ of previous state: count nucleotide transitions!

The main questions on HMMs PARSING SCORING LEARNING Scoring = Joint probability of a sequence and a path, given the model GIVEN a HMM M, a path , and a sequence x, FIND Prob[ x,  | M ] “Running the model”, simply multiply emission and transition probabilities Application: “all fair” vs. “all loaded” comparisons Decoding = parsing a sequence into the optimal series of hidden states GIVEN a HMM M, and a sequence x, FIND the sequence * of states that maximizes P[ x,  | M ]  Viterbi algorithm, dynamic programming, max score over all paths, trace pointers find path Model evaluation = total probability of a sequence, summed across all paths GIVEN a HMM M, a sequence x FIND the total probability P[x | M] summed across all paths  Forward algorithm, sum score over all paths (same result as backward) State likelihood = total prob that emission xi came from state k, across all paths GIVEN a HMM M, a sequence x FIND the total probability P[i = k | x, M)  Posterior decoding: run forward & backward algorithms to & from state I =k Supervised learning = optimize parameters of a model given training data GIVEN a HMM M, with unspecified transition/emission probs., labeled sequence x, FIND parameters  = (ei, aij) that maximize P[ x |  ] Simply count frequency of each emission and transition observed in the training data Unsupervised learning = optimize parameters of a model given training data GIVEN a HMM M, with unspecified transition/emission probs., unlabeled sequence x, Viterbi training: guess parameters, find optimal Viterbi path (#2), update parameters (#5), iterate Baum-Welch training: guess, sum over all emissions/transitions (#4), update (#5), iterate

Multiply emissions, transitions 1. Scoring Multiply emissions, transitions

1. Scoring aG-,C- aC-,G+ aC+,G- a0,C+ aG+,0 C G C G end start a0,C+ G+ G+ aG+,0 C+ C+ eC+(C) eG-(G) eC-(C) eG+(G) C G C G P(p,x) = (a0,C+* 1) * (aC+,G-* 1) * (aG-,C-* 1) * (aC-,G+* 1) * (aG+,0) Probability of given path p & observations x

The main questions on HMMs Scoring = Joint probability of a sequence and a path, given the model GIVEN a HMM M, a path , and a sequence x, FIND Prob[ x,  | M ] “Running the model”, simply multiply emission and transition probabilities Application: “all fair” vs. “all loaded” comparisons Decoding = parsing a sequence into the optimal series of hidden states GIVEN a HMM M, and a sequence x, FIND the sequence * of states that maximizes P[ x,  | M ]  Viterbi algorithm, dynamic programming, max score over all paths, trace pointers find path Model evaluation = total probability of a sequence, summed across all paths GIVEN a HMM M, a sequence x FIND the total probability P[x | M] summed across all paths  Forward algorithm, sum score over all paths (same result as backward) State likelihood = total probability that emission xi came from state k, across all paths GIVEN a HMM M, a sequence x FIND the total probability P[i = k | x, M)  Posterior decoding: run forward & backward algorithms to & from state I =k Supervised learning = optimize parameters of a model given training data GIVEN a HMM M, with unspecified transition/emission probs., labeled sequence x, FIND parameters  = (Ei, Aij) that maximize P[ x |  ] Simply count frequency of each emission and transition observed in the training data Unsupervised learning = optimize parameters of a model given training data GIVEN a HMM M, with unspecified transition/emission probs., unlabeled sequence x, Viterbi training: guess parameters, find optimal Viterbi path (#2), update parameters (#5), iterate Baum-Welch training: guess, sum over all emissions/transitions (#4), update (#5), iterate

2. Decoding: How can we find the most likely path? Viterbi algorithm

Finding most likely state path C- C- C- C- A- A- A- A- T+ T+ T+ T+ end start G+ G+ G+ G+ C+ C+ C+ C+ A+ A+ A+ A+ C G C G Given the observed emissions, what was the path?

Finding the most likely path 1 2 K … x2 x3 xK x1 Find path * that maximizes total joint probability P[ x,  ] P(x,) = a01 * Πi ei(xi)  aii+1 start emission transition

Calculate maximum P(x,) recursively … … ajk k Vk(i) hidden states Vj(i-1) j ek … xi-1 xi observations Assume we know Vj for the previous time step (i-1) Calculate Vk(i) = ek(xi) * maxj ( Vj(i-1)  ajk ) max ending in state j at step i Transition from state j current max this emission all possible previous states j

The Viterbi Algorithm Traceback: Input: x = x1……xN State 1 2 Vk(i) K x1 x2 x3 ………………………………………..xN Traceback: Follow max pointers back In practice: Use log scores for computation Running time and space: Time: O(K2N) Space: O(KN) Input: x = x1……xN Initialization: V0(0)=1, Vk(0) = 0, for all k > 0 Iteration: Vk(i) = eK(xi)  maxj ajk Vj(i-1) Termination: P(x, *) = maxk Vk(N)

The main questions on HMMs PARSING SCORING LEARNING Scoring = Joint probability of a sequence and a path, given the model GIVEN a HMM M, a path , and a sequence x, FIND Prob[ x,  | M ] “Running the model”, simply multiply emission and transition probabilities Application: “all fair” vs. “all loaded” comparisons Decoding = parsing a sequence into the optimal series of hidden states GIVEN a HMM M, and a sequence x, FIND the sequence * of states that maximizes P[ x,  | M ]  Viterbi algorithm, dynamic programming, max score over all paths, trace pointers find path Model evaluation = total probability of a sequence, summed across all paths GIVEN a HMM M, a sequence x FIND the total probability P[x | M] summed across all paths  Forward algorithm, sum score over all paths (same result as backward) State likelihood = total probability that emission xi came from state k, across all paths GIVEN a HMM M, a sequence x FIND the total probability P[i = k | x, M)  Posterior decoding: run forward & backward algorithms to & from state I =k Supervised learning = optimize parameters of a model given training data GIVEN a HMM M, with unspecified transition/emission probs., labeled sequence x, FIND parameters  = (Ei, Aij) that maximize P[ x |  ] Simply count frequency of each emission and transition observed in the training data Unsupervised learning = optimize parameters of a model given training data GIVEN a HMM M, with unspecified transition/emission probs., unlabeled sequence x, Viterbi training: guess parameters, find optimal Viterbi path (#2), update parameters (#5), iterate Baum-Welch training: guess, sum over all emissions/transitions (#4), update (#5), iterate

3. Model evaluation: Total P(x|M), summed over all paths Forward algorithm

Simple: Given the model, generate some sequence x 1 2 K … 1 1 2 K … 1 2 K … 1 2 K … … a02 2 2 K e2(x1) x1 x2 x3 xn Given a HMM, we can generate a sequence of length n as follows: Start at state 1 according to prob a01 Emit letter x1 according to prob e1(x1) Go to state 2 according to prob a12 … until emitting xn

Complex: Given x, was it generated by the model? 1 2 K … x1 x2 x3 xn e2(x1) a02 Given a sequence x, What is the probability that x was generated by the model (using any path)? P(x) = Σπ P(x,π) Challenge: exponential number of paths (cheap) alternative: Calculate probability over maximum (Viterbi) path π* (real) solution Calculate sum iteratively using dynamic programming

The Forward Algorithm – derivation Define the forward probability: fl(i) = P(x1…xi, i = l) = 1…i-1 P(x1…xi-1, 1,…, i-2, i-1, i = l) el(xi) = k 1…i-2 P(x1…xi-1, 1,…, i-2, i-1=k) akl el(xi) = k fk(i-1) akl el(xi) = el(xi) k fk(i-1) akl

Calculate total probability Σπ P(x,) recursively … … ajk k fk(i) hidden states fj(i-1) j ek … xi-1 xi observations Assume we know fj for the previous time step (i-1) Calculate fk(i) = ek(xi) * sumj ( fj(i-1)  ajk ) sum ending in state j at step i transition from state j current max this emission every possible previous state j

The Forward Algorithm In practice: Input: x = x1……xN State 1 2 fk(i) K x1 x2 x3 ………………………………………..xN In practice: Sum of log scores is difficult  approximate exp(1+p+q)  scaling of probabilities Running time and space: Time: O(K2N) Space: O(KN) Input: x = x1……xN Initialization: f0(0)=1, fk(0) = 0, for all k > 0 Iteration: fk(i) = eK(xi)  sumj ajk fj(i-1) Termination: P(x, *) = sumk fk(N)

Summary Generative model ‘Running’ the model Observing a sequence Hidden states Observed sequence ‘Running’ the model Generate a random sequence Observing a sequence What is the most likely path generating it? Viterbi algorithm What is the total probability generating it? Sum probabilities over all paths Forward algorithm Next: Classification What is the probability that “CGGTACG” came from CpG+ ?

The main questions on HMMs Scoring = Joint probability of a sequence and a path, given the model GIVEN a HMM M, a path , and a sequence x, FIND Prob[ x,  | M ] “Running the model”, simply multiply emission and transition probabilities Application: “all fair” vs. “all loaded” comparisons Decoding = parsing a sequence into the optimal series of hidden states GIVEN a HMM M, and a sequence x, FIND the sequence * of states that maximizes P[ x,  | M ]  Viterbi algorithm, dynamic programming, max score over all paths, trace pointers find path Model evaluation = total probability of a sequence, summed across all paths GIVEN a HMM M, a sequence x FIND the total probability P[x | M] summed across all paths  Forward algorithm, sum score over all paths (same result as backward) State likelihood = total probability that emission xi came from state k, across all paths GIVEN a HMM M, a sequence x FIND the total probability P[i = k | x, M)  Posterior decoding: run forward & backward algorithms to & from state I =k Supervised learning = optimize parameters of a model given training data GIVEN a HMM M, with unspecified transition/emission probs., labeled sequence x, FIND parameters  = (Ei, Aij) that maximize P[ x |  ] Simply count frequency of each emission and transition observed in the training data Unsupervised learning = optimize parameters of a model given training data GIVEN a HMM M, with unspecified transition/emission probs., unlabeled sequence x, Viterbi training: guess parameters, find optimal Viterbi path (#2), update parameters (#5), iterate Baum-Welch training: guess, sum over all emissions/transitions (#4), update (#5), iterate

Find the likelihood an emission xi is generated by a state 4. State likelihood Find the likelihood an emission xi is generated by a state

Calculate P(π7= CpG+ | x7=G) With no knowledge (no characters) P( πi=k ) = most likely state (prior) Time spent in markov chain states With very little knowledge (just that character) P( πi=k | xi=G ) = (prior) * (most likely emission) Emission probabilities adjusted for time spent With knowledge of entire sequence (all characters) P( πi=k | x=AGCGCG…GATTATCGTCGTA) Sum over all paths that emit ‘G’ at position 7  Posterior decoding

Motivation for the Backward Algorithm We want to compute P(i = k | x), the probability distribution on the ith position, given x We start by computing P(i = k, x) = P(x1…xi, i = k, xi+1…xN) = P(x1…xi, i = k) P(xi+1…xN | x1…xi, i = k) = P(x1…xi, i = k) P(xi+1…xN | i = k) Forward, fk(i) Backward, bk(i)

The Backward Algorithm – derivation Define the backward probability: bk(i) = P(xi+1…xN | i = k) = i+1…N P(xi+1,xi+2, …, xN, i+1, …, N | i = k) = l i+1…N P(xi+1,xi+2, …, xN, i+1 = l, i+2, …, N | i = k) = l el(xi+1) akl i+1…N P(xi+2, …, xN, i+2, …, N | i+1 = l) = l el(xi+1) akl bl(i+1)

Calculate total end probability recursively … akl l bl(i+1) hidden states bk(i) k … el … xi xi+1 observations Assume we know bl for the next time step (i+1) Calculate bk(i) = suml ( el(xi+1)  akl  bl(i+1) ) transition to next state prob sum from state l to end current max next emission sum over all possible next states

The Backward Algorithm State 1 2 bk(i) K x1 x2 x3 ………………………………………..xN In practice: Sum of log scores is difficult  approximate exp(1+p+q)  scaling of probabilities Running time and space: Time: O(K2N) Space: O(KN) Input: x = x1……xN Initialization: bk(N) = ak0, for all k Iteration: bk(i) = l el(xi+1) akl bl(i+1) Termination: P(x) = l a0l el(x1) bl(1)

Putting it all together: Posterior decoding State 1 2 P(k) K x1 x2 x3 ………………………………………..xN P(k) = P( πi=k | x ) = fk(i)*bk(i) / P(x) Probability that ith state is k, given all emissions x Posterior decoding Define most likely state for every of sequence x ^i = argmaxk P(i = k | x) Posterior decoding ‘path’ ^i For classification, more informative than Viterbi path * More refined measure of “which hidden states” generated x However, it may give an invalid sequence of states Not all jk transitions may be possible

Summary Generative model ‘Running’ the model Observing a sequence Hidden states Observed sequence ‘Running’ the model Generate a random sequence Observing a sequence What is the most likely path generating it? Viterbi algorithm What is the total probability generating it? Sum probabilities over all paths Forward algorithm Classification What is the probability that “CGGTACG” came from CpG+ ? Forward + backward algorithm What is the most probable state for every position Posterior decoding

The main questions on HMMs PARSING SCORING LEARNING Scoring = Joint probability of a sequence and a path, given the model GIVEN a HMM M, a path , and a sequence x, FIND Prob[ x,  | M ] “Running the model”, simply multiply emission and transition probabilities Application: “all fair” vs. “all loaded” comparisons Decoding = parsing a sequence into the optimal series of hidden states GIVEN a HMM M, and a sequence x, FIND the sequence * of states that maximizes P[ x,  | M ]  Viterbi algorithm, dynamic programming, max score over all paths, trace pointers find path Model evaluation = total probability of a sequence, summed across all paths GIVEN a HMM M, a sequence x FIND the total probability P[x | M] summed across all paths  Forward algorithm, sum score over all paths (same result as backward) State likelihood = total probability that emission xi came from state k, across all paths GIVEN a HMM M, a sequence x FIND the total probability P[i = k | x, M)  Posterior decoding: run forward & backward algorithms to & from state I =k Supervised learning = optimize parameters of a model given training data GIVEN a HMM M, with unspecified transition/emission probs., labeled sequence x, FIND parameters  = (Ei, Aij) that maximize P[ x |  ] Simply count frequency of each emission and transition observed in the training data Unsupervised learning = optimize parameters of a model given training data GIVEN a HMM M, with unspecified transition/emission probs., unlabeled sequence x, Viterbi training: guess parameters, find optimal Viterbi path (#2), update parameters (#5), iterate Baum-Welch training: guess, sum over all emissions/transitions (#4), update (#5), iterate

Estimate model parameters based on labeled training data 5: Supervised learning Estimate model parameters based on labeled training data

Two learning scenarios Case 1. Estimation when the “right answer” is known Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good (experimental) annotations of the CpG islands GIVEN: the casino player allows us to observe him one evening, as he changes dice and produces 10,000 rolls Case 2. Estimation when the “right answer” is unknown GIVEN: the porcupine genome; we don’t know how frequent are the CpG islands there, neither do we know their composition GIVEN: 10,000 rolls of the casino player, but we don’t see when he changes dice QUESTION: Update the parameters  of the model to maximize P(x|)

Case 1. When the right answer is known Given x = x1…xN for which the true  = 1…N is known, Define: Akl = # times kl transition occurs in  Ek(b) = # times state k in  emits b in x We can show that the maximum likelihood parameters  are: Akl Ek(b) akl = ––––– ek(b) = ––––––– i Aki c Ek(c)

Case 1. When the right answer is known Intuition: When we know the underlying states, Best estimate is the average frequency of transitions & emissions that occur in the training data Drawback: Given little data, there may be overfitting: P(x|) is maximized, but  is unreasonable 0 probabilities – VERY BAD Example: Given 10 casino rolls, we observe x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3  = F, F, F, F, F, F, F, F, F, F Then: aFF = 1; aFL = 0 eF(1) = eF(3) = .2; eF(2) = .3; eF(4) = 0; eF(5) = eF(6) = .1

Pseudocounts Solution for small training sets: Add pseudocounts Akl = # times kl transition occurs in  + rkl Ek(b) = # times state k in  emits b in x + rk(b) rkl, rk(b) are pseudocounts representing our prior belief Larger pseudocounts  Strong priof belief Small pseudocounts ( < 1): just to avoid 0 probabilities

r0F = r0L = rF0 = rL0 = 1; Pseudocounts Example: dishonest casino We will observe player for one day, 500 rolls Reasonable pseudocounts: r0F = r0L = rF0 = rL0 = 1; rFL = rLF = rFF = rLL = 1; rF(1) = rF(2) = … = rF(6) = 20 (strong belief fair is fair) rF(1) = rF(2) = … = rF(6) = 5 (wait and see for loaded) Above #s pretty arbitrary – assigning priors is an art

The main questions on HMMs Scoring = Joint probability of a sequence and a path, given the model GIVEN a HMM M, a path , and a sequence x, FIND Prob[ x,  | M ] “Running the model”, simply multiply emission and transition probabilities Application: “all fair” vs. “all loaded” comparisons Decoding = parsing a sequence into the optimal series of hidden states GIVEN a HMM M, and a sequence x, FIND the sequence * of states that maximizes P[ x,  | M ]  Viterbi algorithm, dynamic programming, max score over all paths, trace pointers find path Model evaluation = total probability of a sequence, summed across all paths GIVEN a HMM M, a sequence x FIND the total probability P[x | M] summed across all paths  Forward algorithm, sum score over all paths (same result as backward) State likelihood = total probability that emission xi came from state k, across all paths GIVEN a HMM M, a sequence x FIND the total probability P[i = k | x, M)  Posterior decoding: run forward & backward algorithms to & from state I =k Supervised learning = optimize parameters of a model given training data GIVEN a HMM M, with unspecified transition/emission probs., labeled sequence x, FIND parameters  = (Ei, Aij) that maximize P[ x |  ] Simply count frequency of each emission and transition observed in the training data Unsupervised learning = optimize parameters of a model given training data GIVEN a HMM M, with unspecified transition/emission probs., unlabeled sequence x, Viterbi training: guess parameters, find optimal Viterbi path (#2), update parameters (#5), iterate Baum-Welch training: guess, sum over all emissions/transitions (#4), update (#5), iterate

6: Unsupervised learning Estimate model parameters based on unlabeled training data

Learning case 2. When the right answer is unknown We don’t know the true Akl, Ek(b) Idea: We estimate our “best guess” on what Akl, Ek(b) are We update the parameters of the model, based on our guess We repeat

Case 2. When the right answer is unknown Starting with our best guess of a model M, parameters : Given x = x1…xN for which the true  = 1…N is unknown, We can get to a provably more likely parameter set  Principle: EXPECTATION MAXIMIZATION Estimate Akl, Ek(b) in the training data Update  according to Akl, Ek(b) Repeat 1 & 2, until convergence

Estimating new parameters To estimate Akl: At each position i of sequence x, Find probability transition kl is used: P(i = k, i+1 = l | x) = [1/P(x)]  P(i = k, i+1 = l, x1…xN) = Q/P(x) where Q = P(x1…xi, i = k, i+1 = l, xi+1…xN) = = P(i+1 = l, xi+1…xN | i = k) P(x1…xi, i = k) = = P(i+1 = l, xi+1xi+2…xN | i = k) fk(i) = = P(xi+2…xN | i+1 = l) P(xi+1 | i+1 = l) P(i+1 = l | i = k) fk(i) = = bl(i+1) el(xi+1) akl fk(i) fk(i) akl el(xi+1) bl(i+1) So: P(i = k, i+1 = l | x, ) = –––––––––––––––––– P(x | ) (For one such transition, at time step ii+1)

Estimating new parameters (Sum over all kl transitions, at any time step i) So, fk(i) akl el(xi+1) bl(i+1) Akl = i P(i = k, i+1 = l | x, ) = i ––––––––––––––––– P(x | ) Similarly, Ek(b) = [1/P(x)] {i | xi = b} fk(i) bk(i)

Estimating new parameters (Sum over all training seqs, all kl transitions, all time steps i) If we have several training sequences, x1, …, xM, each of length N, fk(i) akl el(xi+1) bl(i+1) Akl = x i P(i = k, i+1 = l | x, ) = x i –––––––––––––––– P(x | ) Similarly, Ek(b) = x (1/P(x)) {i | xi = b} fk(i) bk(i)

The Baum-Welch Algorithm Initialization: Pick the best-guess for model parameters (or arbitrary) Iteration: Forward Backward Calculate Akl, Ek(b) Calculate new model parameters akl, ek(b) Calculate new log-likelihood P(x | ) GUARANTEED TO BE HIGHER BY EXPECTATION-MAXIMIZATION Until P(x | ) does not change much

The Baum-Welch Algorithm – comments Time Complexity: # iterations  O(K2N) Guaranteed to increase the log likelihood of the model P( | x) = P(x, ) / P(x) = P(x | ) / ( P(x) P() ) Not guaranteed to find globally best parameters Converges to local optimum, depending on initial conditions Too many parameters / too large model: Overtraining

Alternative: Viterbi Training Initialization: Same Iteration: Perform Viterbi, to find * Calculate Akl, Ek(b) according to * + pseudocounts Calculate the new parameters akl, ek(b) Until convergence Notes: Convergence is guaranteed – Why? Does not maximize P(x | ) In general, worse performance than Baum-Welch

The main questions on HMMs PARSING SCORING LEARNING Scoring = Joint probability of a sequence and a path, given the model GIVEN a HMM M, a path , and a sequence x, FIND Prob[ x,  | M ] “Running the model”, simply multiply emission and transition probabilities Application: “all fair” vs. “all loaded” comparisons Decoding = parsing a sequence into the optimal series of hidden states GIVEN a HMM M, and a sequence x, FIND the sequence * of states that maximizes P[ x,  | M ]  Viterbi algorithm, dynamic programming, max score over all paths, trace pointers find path Model evaluation = total probability of a sequence, summed across all paths GIVEN a HMM M, a sequence x FIND the total probability P[x | M] summed across all paths  Forward algorithm, sum score over all paths (same result as backward) State likelihood = total probability that emission xi came from state k, across all paths GIVEN a HMM M, a sequence x FIND the total probability P[i = k | x, M)  Posterior decoding: run forward & backward algorithms to & from state I =k Supervised learning = optimize parameters of a model given training data GIVEN a HMM M, with unspecified transition/emission probs., labeled sequence x, FIND parameters  = (Ei, Aij) that maximize P[ x |  ] Simply count frequency of each emission and transition observed in the training data Unsupervised learning = optimize parameters of a model given training data GIVEN a HMM M, with unspecified transition/emission probs., unlabeled sequence x, Viterbi training: guess parameters, find optimal Viterbi path (#2), update parameters (#5), iterate Baum-Welch training: guess, sum over all emissions/transitions (#4), update (#5), iterate

The main questions on HMMs: Pop quiz PARSING SCORING LEARNING Scoring = Joint probability of a sequence and a path, given the model GIVEN a HMM M, a path , and a sequence x, FIND Prob[ x,  | M ] “Running the model”, simply multiply emission and transition probabilities Application: “all fair” vs. “all loaded” comparisons Decoding = parsing a sequence into the optimal series of hidden states GIVEN a HMM M, and a sequence x, FIND the sequence * of states that maximizes P[ x,  | M ]  Viterbi algorithm, dynamic programming, max score over all paths, trace pointers find path Model evaluation = total probability of a sequence, summed across all paths GIVEN a HMM M, a sequence x FIND the total probability P[x | M] summed across all paths  Forward algorithm, sum score over all paths (same result as backward) State likelihood = total probability that emission xi came from state k, across all paths GIVEN a HMM M, a sequence x FIND the total probability P[i = k | x, M)  Posterior decoding: run forward & backward algorithms to & from state I =k Supervised learning = optimize parameters of a model given training data GIVEN a HMM M, with unspecified transition/emission probs., labeled sequence x, FIND parameters  = (Ei, Aij) that maximize P[ x |  ] Simply count frequency of each emission and transition observed in the training data Unsupervised learning = optimize parameters of a model given training data GIVEN a HMM M, with unspecified transition/emission probs., unlabeled sequence x, Viterbi training: guess parameters, find optimal Viterbi path (#2), update parameters (#5), iterate Baum-Welch training: guess, sum over all emissions/transitions (#4), update (#5), iterate

What have we learned ? Generative model ‘Running’ the model Hidden states / Observed sequence ‘Running’ the model Generate a random sequence Observing a sequence What is the most likely path generating it? Viterbi algorithm What is the total probability generating it? Sum probabilities over all paths Forward algorithm Classification What is the probability that “CGGTACG” came from CpG+ ? Forward + backward algorithm What is the most probable state for every position Posterior decoding Training Estimating parameters of the HMM When state sequence is known Simply compute maximum likelihood A and E When state sequence is not known Baum-Welch: Iterative estimation of all paths / frequencies Viterbi training: Iterative estimation of best path / frequencies