Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

Slides:



Advertisements
Similar presentations
Hidden Markov Model in Biological Sequence Analysis – Part 2
Advertisements

Learning HMM parameters
Expectation Maximization
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Segmentation and Fitting Using Probabilistic Methods
Phylogenetic Trees Lecture 4
Statistical Topic Modeling part 1
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Lecture 5: Learning models using EM
Transcription factor binding motifs (part I) 10/17/07.
Regulatory motif discovery 6.095/ Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005.
Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.
Clustering.
Expectation Maximization Algorithm
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Maximum Likelihood (ML), Expectation Maximization (EM)
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Motif Refinement using Hybrid Expectation Maximization Algorithm Chandan Reddy Yao-Chung Weng Hsiao-Dong Chiang School of Electrical and Computer Engr.
Gaussian Mixture Model and the EM algorithm in Speech Recognition
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
EM and expected complete log-likelihood Mixture of Experts
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Lecture 19: More EM Machine Learning April 15, 2010.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Hidden Markov Models BMI/CS 776 Mark Craven March 2002.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Sampling Approaches to Pattern Extraction
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Flat clustering approaches
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/22/11.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Hidden Markov Models BMI/CS 576
Markov Chain Models BMI/CS 776
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Transcription factor binding motifs
Sequential Pattern Discovery under a Markov Assumption
Hidden Markov Models Part 2: Algorithms
Professor of Computer Science and Mathematics
Learning Sequence Motif Models Using Gibbs Sampling
Professor of Computer Science and Mathematics
Professor of Computer Science and Mathematics
CSE 5290: Algorithms for Bioinformatics Fall 2009
Presentation transcript:

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS Mark Craven March 2002

Announcements HW due on Monday reading for next week –Chapter 3 of Durbin et al. (already assigned) –Krogh, An Introduction to HMMs for Biological Sequences is everyone’s account set up? did everyone get the Lawrence et al. paper?

Sequence Motifs what is a sequence motif ? –a sequence pattern of biological significance examples –protein binding sites in DNA –protein sequences corresponding to common functions or conserved pieces of structure

Motifs and Profile Matrices sequence positions A C G T given a set of aligned sequences, it is straightforward to construct a profile matrix characterizing a motif of interest shared motif

Motifs and Profile Matrices how can we construct the profile if the sequences aren’t aligned? –in the typical case we don’t know what the motif looks like use an Expectation Maximization (EM) algorithm

The EM Approach EM is a family of algorithms for learning probabilistic models in problems that involve hidden state in our problem, the hidden state is where the motif starts in each training sequence

The MEME Algorithm Bailey & Elkan, 1993 uses EM algorithm to find multiple motifs in a set of sequences first EM approach to motif discovery: Lawrence & Reilly 1990

Representing Motifs a motif is assumed to have a fixed width, W a motif is represented by a matrix of probabilities: represents the probability of character c in column k example: DNA motif with W= A C G T

Representing Motifs we will also represent the “background” (i.e. outside the motif) probability of each character represents the probability of character c in the background example: A 0.26 C 0.24 G 0.23 T 0.27

Basic EM Approach the element of the matrix represents the probability that the motif starts in position j in sequence I example: given 4 DNA sequences of length 6, where W= seq seq seq seq

Basic EM Approach given: length parameter W, training set of sequences set initial values for p do re-estimate Z from p (E –step) re-estimate p from Z (M-step) until change in p <  return: p, Z

Basic EM Approach we’ll need to calculate the probability of a training sequence given a hypothesized starting position: is the ith sequence is 1 if motif starts at position j in sequence i is the character at position k in sequence i before motifmotifafter motif

Example A C G T G C T G T A G

The E-step: Estimating Z this comes from Bayes’ rule applied to to estimate the starting positions in Z at step t

The E-step: Estimating Z assume that it is equally likely that the motif will start in any position

Example: Estimating Z then normalize so that A C G T G C T G T A G

The M-step: Estimating p pseudo-counts total # of c’s in data set recall represents the probability of character c in position k ; values for position 0 represent the background

Example: Estimating p A G G C A G A C A G C A T C A G T C

The EM Algorithm EM converges to a local maximum in the likelihood of the data given the model: usually converges in a small number of iterations sensitive to initial starting point (i.e. values in p)

MEME Enhancements to the Basic EM Approach MEME builds on the basic EM approach in the following ways: –trying many starting points –not assuming that there is exactly one motif occurrence in every sequence –allowing multiple motifs to be learned –incorporating Dirichlet prior distributions

Starting Points in MEME for every distinct subsequence of length W in the training set –derive an initial p matrix from this subsequence –run EM for 1 iteration choose motif model (i.e. p matrix) with highest likelihood run EM to convergence

Using Subsequences as Starting Points for EM set values corresponding to letters in the subsequence to X set other values to (1-X)/(M-1) where M is the length of the alphabet example: for the subsequence TAT with X= A C G T

The ZOOPS Model the approach as we’ve outlined it, assumes that each sequence has exactly one motif occurrence per sequence; this is the OOPS model the ZOOPS model assumes zero or one occurrences per sequence

E-step in the ZOOPS Model we need to consider another alternative: the ith sequence doesn’t contain the motif we add another parameter (and its relative)  prior prob that any position in a sequence is the start of a motif  prior prob of a sequence containing a motif

E-step in the ZOOPS Model here is a random variable that takes on 0 to indicate that the sequence doesn’t contain a motif occurrence

M-step in the ZOOPS Model update p same as before update as follows average of across all sequences, positions

The TCM Model the TCM (two-component mixture model) assumes zero or more motif occurrences per sequence

Likelihood in the TCM Model the TCM model treats each length W subsequence independently to determine the likelihood of such a subsequence: assuming a motif starts there assuming a motif doesn’t start there

E-step in the TCM Model M-step same as before subsequence isn’t a motifsubsequence is a motif

Finding Multiple Motifs basic idea: discount the likelihood that a new motif starts in a given position if this motif would overlap with a previously learned one when re-estimating, multiply by is estimated using values from previous passes of motif finding

Gibbs Sampling a general procedure for sampling from the joint distribution of a set of random variables by iteratively sampling from for each j application to motif finding: Lawrence et al can view it as a stochastic analog of EM for this task less susceptible to local minima than EM

Gibbs Sampling Approach in the EM approach we maintained a distribution over the possible motif starting points for each sequence in the Gibbs sampling approach, we’ll maintain a specific starting point for each sequence but we’ll keep resampling these

Gibbs Sampling Approach given: length parameter W, training set of sequences choose random positions for a do pick a sequence estimate p given current motif positions a (update step) (using all sequences but ) sample a new motif position for (sampling step) until convergence return: p, a

Sampling New Motif Positions for each possible starting position,, compute a weight randomly select a new starting position according to these weights