Class review Sushmita Roy BMI/CS 576 Dec 11 th, 2014.

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
Expectation Maximization
Supervised Learning Recap
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Machine Learning and Data Mining Clustering
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
Hidden Markov Models Eine Einführung.
Hidden Markov Models.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Statistical NLP: Lecture 11
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Hidden Markov Models Fundamentals and applications to bioinformatics.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
… Hidden Markov Models Markov assumption: Transition model:
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Lecture 5: Learning models using EM
Learning Bayesian Networks
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Probabilistic Graphical models for molecular networks Sushmita Roy BMI/CS 576 Nov 11 th, 2014.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Probabilistic methods for phylogenetic trees (Part 2)
Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data Cristina Manfredotti.
Gaussian Mixture Models and Expectation Maximization.
Dependency networks Sushmita Roy BMI/CS 576 Nov 26 th, 2013.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Lecture 19: More EM Machine Learning April 15, 2010.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
Lecture 17 Gaussian Mixture Models and Expectation Maximization
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
CS Statistical Machine learning Lecture 24
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Dependency networks Sushmita Roy BMI/CS 576 Nov 25 th, 2014.
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Flat clustering approaches
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Module Networks BMI/CS 576 Mark Craven December 2007.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Other Models for Time Series. The Hidden Markov Model (HMM)
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models BMI/CS 576
Statistical Models for Automatic Speech Recognition
Hierarchical clustering approaches for high-throughput data
Hidden Markov Models Part 2: Algorithms
Input Output HMMs for modeling network dynamics
LECTURE 15: REESTIMATION, EM AND MIXTURES
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Class review Sushmita Roy BMI/CS Dec 11 th, 2014

What you should know Markov models and Hidden Markov models – Forward, Viterbi and backward algorithms – Parameter estimation Clustering – Hierarchical, flat, model-based (Gaussian mixture models) Network modeling and analysis – Bayesian networks vs dependency networks – Network reconstruction algorithms – Properties of networks Phylogenetic trees – Distance-based methods: Neighbor Joining, UPGMA – Parsimony methods: Weighted Parsimony algorithm – Probabilistic methods: Felsenstein algorithm Sequence alignment – Global and local alignment – Pairwise and multiple sequence alignment

A Markov chain model for DNA sequence A TC begin state transition transition probabilities G

Estimating parameters of a Markov chain Parameters of a Markov chain are the transition probabilities The parameters that need to be estimated are determined by the structure of the model Laplace correction needs to be applied (if asked) only to the parameters that need to be estimated

Homework 4: 1a ACGEnd Begin4/62/6 A3/142/149/14 C4/113/111/113/11 G4/16 5/163/16 Don’t need to estimate these parameters 1.Probabilities associated with outgoing transitions must sum to 1 2.Laplace correction for “Begin” adds a count of 2 to the denominator 3.Laplace correction for “A” adds 3 to the denominator 4.Laplace correction for “G” and “C” adds 4 to the denominator

Hidden Markov models The learning problems Forward algorithm: – Compute the likelihood of an observed sequence Viterbi algorithm – Find the most likely state assignment Baum-Welch algorithm – E-step Forward algorithm Backward algorithm – M-step Parameter estimation based on expected counts of transitions and emissions

Formally defining a HMM States Emission alphabet Parameters – State transition probabilities for probabilistic transitions from state at time t to state at time t+1 – Emission probabilities for probabilistically emitting symbols from a state

An example HMM 0.8 probability of emitting character A in state 2 probability of a transition from state 1 to state A 0.4 C 0.1 G 0.2 T 0.3 A 0.1 C 0.4 G 0.4 T 0.1 A 0.2 C 0.3 G 0.3 T 0.2 beginend A 0.4 C 0.1 G 0.1 T 0.4

Three important questions in HMMs How likely is an HMM to have generated a given sequence? – Forward algorithm What is the most likely “path” for generating a sequence of observations – Viterbi algorithm How can we learn an HMM from a set of sequences? – Forward-backward or Baum-Welch (an EM algorithm)

Reviewing the notation States with emissions will be numbered from 1 to K – 0 begin state, N end state observed character at position t Observed sequence Hidden state sequence or path Transition probabilities Emission probabilities: Probability of emitting symbol b from state k

How likely is a given sequence: Forward algorithm Define as the probability of observing and ending in state k at time t This can be written recursively as follows

Steps of the Forward algorithm Initialization: denote 0 for the “begin” state Recursion: for t=1 to T Termination

Learning without hidden information Learning is simple if we know the correct path for each sequence in our training set Estimate parameters by counting the number of times each parameter is used across the training set 5 C A G T begin end

Learning without hidden information Transition probabilities Emission probabilities Number of transitions from state k to state l Number of times c is emitted from k

Learning with hidden information 5 C A G T 0 begin end ???? if we don’t know the correct path for each sequence in our training set, consider all possible paths for the sequence estimate parameters through a procedure that counts the expected number of times each parameter is used across the training set

Learning HMM parameters with the Baum-Welch algorithm algorithm sketch: – initialize parameters of model – iterate until convergence calculate the expected number of times each transition or emission is used adjust the parameters to maximize the likelihood of these expected values

The expectation step We need to know the probability of the symbol at t being produced by state k, given the entire sequence x Given these we can compute our expected counts for state transitions, character emissions We also need to know the probability of symbol at t and (t+1) being produced by state k, and l respectively given sequence x

Computing First we compute the probability of the entire observed sequence with the t th symbol being generated by state k Then our quantity of interest is computed as Obtained from the forward algorithm

To compute We need the forward and backward algorithm Forward algorithm f k (t) Backward algorithm b k (t) Computing

Steps of the backward algorithm Initialization ( t=T ) Recursion ( t=T-1 to 1 ) Termination Note, the same quantity can be obtained from the forward algorithm as well

Computing Using the forward and backward variables, this is computed as

Computing This is the probability of symbols at t and t+1 emitted from states k and l given the entire sequence x

Putting it all together Assume we are given J training instances x 1,..,x j,.. x J Expectation step – Using current parameter values compute for each x j Apply the forward and backward algorithms Compute – expected number of transitions between all pairs of states – expected number of emissions for all states Maximization step – Using current expected counts Compute the transition and emission probabilities

The expectation step: emission count We need the expected number of times c is emitted by state k x j : j th training sequences sum over positions where c occurs in x

The expectation step: transition count Expected number of times of transitions from k to l

The maximization step Estimate new emission parameters by: Just like in the simple case but typically we’ll do some “smoothing” (e.g. add pseudocounts) Estimate new transition parameters by

The Baum-Welch algorithm initialize the parameters of the HMM iterate until convergence – initialize, with pseudocounts – E-step: for each training set sequence j = 1…n calculate values for sequence j add the contribution of sequence j to, – M-step: update the HMM parameters using,

Baum-Welch algorithm example Given – The HMM with the parameters initialized as shown – Two training sequences TAG, ACG A 0.1 C 0.4 G 0.4 T 0.1 A 0.4 C 0.1 G 0.1 T 0.4 beginend we’ll work through one iteration of Baum-Welch

Baum-Welch example (cont) Determining the forward values for TAG Here we compute just the values that are needed for computing successive values. For example, no point in calculating f 2 (1) In a similar way, we also compute forward values for ACG

Baum-Welch example (cont) Determining the backward values for TAG Again, here we compute just the values that are needed In a similar way, we also compute backward values for ACG

Baum-Welch example (cont) determining the expected emission counts for state 1 contribution of TAG contribution of ACG *note that the forward/backward values in these two columns differ; in each column they are computed for the sequence associated with the column

Baum-Welch example (cont) Determining the expected transition counts (not using pseudocounts) Contribution of TAG + Contribution of ACG

Baum-Welch example (cont) Maximization step: determining probabilities for state 1

Clustering Find groups of entities that exhibit similar attributes Hierarchical clustering – Different types of linkage – Obtaining a flat clustering from a dendrogram Flat clustering – K-means – Gaussian mixture Key issues to think about – How to pick k, or how to decide where to cut in hierarchical clustering – Validation and interpretation of clusters – Distance metric to define dissimilarity between entities

K -means algorithm Input: K, number of clusters, a set X={x 1,.. x n } of data points, where x i are p -dimensional vectors Initialize – Select initial cluster means Repeat until convergence – Assign each x i to cluster C(i) such that – Re-estimate the mean of each cluster based on new members

K -means clustering consider an example in which our vectors have 2 dimensions profile cluster center

K -means clustering each iteration involves two steps – assignment of profiles to clusters – re-computation of the means assignment re-computation of means

K -means: updating the mean To compute the mean of the c th cluster Number of genes in cluster c All genes in cluster c

K -means stopping criteria Assignment of objects to clusters don’t change Fix the max number of iterations Optimization criterion changes by a small value

Gaussian mixture model based clustering K -means is hard clustering – At each iteration, a datapoint is assigned to one and only one cluster We can do soft clustering based on Gaussian mixture models – Each cluster is represented by a distribution (in our case a Gaussian) – We assume the data is generated by a mixture of the Gaussians

Gaussian mixture model clustering A model-based clustering approach For K clusters, we will have K Gaussians The Gaussian mixture model describes the probability density of a data point x as Clustering entails learning a Gaussian mixture model for the data to be clustered Prior probability of the c th Gaussian c th Gaussian

Learning a Gaussian mixture model (GMM) Assume we have N training data points (e.g. genes) And we know what K is Parameters of the GMM are It is common to ignore the off-diagonal elements. For example for a 2-dimensional Gaussian we have: Mixture probabilitiesMeans Co-variances

Learning a Gaussian mixture model (GMM) If we knew the cluster assignments estimating means and variances is easy – Take the data points in cluster c and estimate parameters for the Gaussian from cluster c But we don’t. We will use the expectation-maximization (EM) algorithm to learn GMM parameters Recall the EM algorithm is useful when we have hidden variables What are the hidden variables here? – Cluster assignments

Expectation step We would like to estimate the probability of Z ic =1 – c th Gaussian generating data point x i That is We will use to denote We can think of as the contribution of each data point x i to cluster c

Maximization step Here we need to estimate the parameters for each Gaussian And the mixing weights Variance for the r th dimension

Putting the E and M steps all together Initialize using initial partitions of the data Repeat until convergence – Expectation step Compute the for each data point and Gaussian – Maximization step – Use to update for all c=1 to K – Compute likelihood to check for convergence

GMM clustering example Consider a one-dimensional clustering problem in which the data given are: x 1 = -4 x 2 = -3 x 3 = -1 x 4 = 3 x 5 = 5 Assume number of Gaussians, K=2. So we have parameters: Assume initials parameters as: Their density functions are is: Gaussian 1 Gaussian 2

GMM clustering example

GMM clustering example: E-step

GMM clustering example: M-step

GMM clustering example Here we have shown just one step of the EM procedure We would continue the E- and M-steps until convergence

Comparing K-means and GMMs K-means – Hard clustering – Optimizes within cluster scatter – Requires estimation of means GMMs – Soft clustering – Optimizes likelihood of data – Requires estimation of mean and covariance and mixture probabilities

Networks Different types of molecular networks Network reconstruction – Algorithms vary depending upon how they represent the relationship between nodes Bayesian networks – Representing biological networks as Bayesian network – Types of conditional distributions – Parameter and structure learning from data Network analysis – What properties characterize complex networks Degree distributions Centrality Network motifs Modularity Network applications – Interpreting gene sets – Connecting two gene sets

Bayesian networks (BN) A special type of probabilistic graphical model Has two parts: – A graph which is directed and acyclic – A set of conditional distributions Directed Acyclic Graph (DAG) – The nodes denote random variables X 1 … X N – The edges encode statistical dependencies between the random variables Establish parent child relationships Each node X i has a conditional probability distribution (CPD) representing P(X i | Parents(X i ) ) Provides a tractable way to represent large joint distributions

An example Bayesian network Cloudy (C) Rain (R) Sprinkler (S) Adapted from Kevin Murphy: Intro to Graphical models and Bayes networks: WetGrass (W) P(C=F) P(C=T) 0.5 P(R=F) P(R=T) T P(S=F) P(S=T) 0.5 F T P(W=F) P(W=T) 1 0 F T F C F T T C F S R

Bayesian network representation of a transcriptional network HSP12 Sko1 Hot1 Sko1 HSP12 Random variables encode expression levels T ARGET ( CHILD ) R EGULATORS (P ARENTS ) X1X1 X2X2 X3X3 X1X1 X2X2 X3X3 P(X 3 |X 1,X 2 ) GenesRandom variables P(X 2 ) P(X 1 ) An example Bayesian network Assume HSP12’s expression is dependent upon Hot1 and Sko1 binding to HSP12’s promoter HSP12 ON HSP12 Sko1 HSP12 OFF

Bayesian networks compactly represent joint distributions CPD

Example Bayesian network of 5 variables X1X1 X2X2 X3X3 X5X5 X4X4 P(X 3 |X 1,X 2 ) P(X 2 ) P(X 1 ) P(X 4 ) P(X 5 |X 3, X 4 )

CPD in Bayesian networks CPD: Conditional probability distributions are central to Bayesian networks We have a CPD for each random variable in our graph CPDs describe the distribution of a Child variable given the state of its parents. The same structure can be parameterized in different ways For example for discrete variables we can have table or tree representations

Consider the following case with Boolean variables X 1, X 2, X 3, X 4 where X 1, X 2 and X 3 are the parents of X 4 Representing CPDs as tables X1X1 X2X2 X3X3 tf ttt ttf tft tff ftt ftf0.5 fft fff P( X 4 | X 1, X 2, X 3 ) as a table X4X4 X1X1 X2X2 X4X4 X3X3 P( X 4 | X 1, X 2, X 3 )

A tree representation of a CPD P( X 4 | X 1, X 2, X 3 ) as a tree P(X 4 = t ) = 0.9 X1X1 f t X2X2 P(X 4 = t ) = 0.5 ft X3X3 P(X 4 = t ) = 0.8 ft X1X1 X2X2 X4X4 X3X3 Allows more compact representation of CPDs. For example, we can ignore some quantities.

The learning problems Parameter learning on known structure – Given training data estimate parameters of the CPDs Structure learning – Given training data, find the statistical dependency structure, and that best describe – Subsumes parameter learning For every candidate graph, we need to estimate the parameters

Example of estimating CPD table from data Consider the four random variables X 1, X 2, X 3, X 4 Assume we observe the following samples of assignments to these variables To estimate P(X 4 |X 1,X 2,X 3 ), we need consider all configurations of X 1,X 2, X 3 and estimate the probability of X 4 being T or F TFTT TTFT TTFT TFTT TFTF TFTF FFTF X1X1 X2X2 X3X3 X4X4 For example, consider X 1 =T, X 2 =F, X 3 =T P(X 4 =T|X 1 =T, X 2 =F, X 3 =T)=2/4 P(X 4 =F|X 1 =T, X 2 =F, X 3 =T)=2/4

Structure learning using score-based search... Bayesian network Maximum likelihood parameters Data

Scoring a Bayesian network The score of a Bayesian network (BN) is determined by how well the BN describes the data This in turn is a function of the data likelihood Given data The score of a BN is therefore Parents of X i Assignment to parents of X i in the d th sample

Scoring a Bayesian network Score of a graph G decomposes over individual variables Which can be re-arranged to be written as the outer sum over variables This enables us to efficiently compute the score effect of local changes – That is changes to the parent set of individual random variables

Heuristic search of Bayesian network structures Make local operations to the graph structure – Add an edge – Delete an edge – Reverse an edge Evaluate score and select the network configuration with best score We just need to check for cycles Working with gene expression data requires additional considerations – Reduce potential parents: statistically or using biological knowledge – Bootstrap based confidence estimation – Permutation based assessment of confidence

Bayesian network vs Module network Each variable takes three values: UP, DOWN, SAME

Bayesian network vs Module network Bayesian network – Different CPD per random variable – Learning only requires to search for parents Module network – CPD per module Same CPD for all random variables in the same module – Learning requires parent search and module membership assignment

Some comments about expression-based network inference methods We have seen two types of algorithms to learn these networks – Per-gene methods Sparse candidate: learn regulators for individual genes GENIE3 – Per-module methods Module networks: learn regulators for sets of genes/modules – Other implementations of module networks exist LIRNET: Learning a Prior on Regulatory Potential from eQTL Data – Su In Lee et al, Plos genetics 2009 ( journal.pgen ) LeMoNe: Learning Module Networks – Michoel et al 2007 ( 2105/8/S2/S5)

Phylogenetic trees Phylogenetic tree construction – Distance-based methods UPGMA Neighbor joining – Parsimony methods Weighted parsimony to find the minimal cost tree – Probabilistic methods Felsenstein’s algorithm to compute the probability of observations at leaf nodes Scoring a given tree versus searching the space of trees