Regulatory motif discovery 6.095/6.895 - Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005.

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
CS5263 Bioinformatics Probabilistic modeling approaches for motif finding.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays.
Lecture 6, Thursday April 17, 2003
Gene Regulation and Microarrays. Overview A. Gene Expression and Regulation B. Measuring Gene Expression: Microarrays C. Finding Regulatory Motifs.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Heuristic alignment algorithms and cost matrices
Lecture 5: Learning models using EM
Transcription factor binding motifs (part I) 10/17/07.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
(Regulatory-) Motif Finding
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
CS262 Lecture 18, Win07, Batzoglou Sequence Logos Height of each letter proportional to its frequency Height of all letters proportional to information.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
Fuzzy K means.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Finding Regulatory Motifs in DNA Sequences
Motif Refinement using Hybrid Expectation Maximization Algorithm Chandan Reddy Yao-Chung Weng Hsiao-Dong Chiang School of Electrical and Computer Engr.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Gibbs Sampler in Local Multiple Alignment Review by 온 정 헌.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Biological Motif Discovery Concepts Motif Modeling and Motif Information EM and Gibbs Sampling Comparative Motif Prediction Applications Transcription.
1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas.
Flat clustering approaches
Local Multiple Sequence Alignment Sequence Motifs
CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
CS5263 Bioinformatics Lecture 11 Motif finding. HW2 2(C) Click to find out K and lambda.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
CS5263 Bioinformatics Lecture 19 Motif finding. (Sequence) motif finding Given a set of sequences Goal: find sequence motifs that appear in all or the.
Lecture 20 Practical issues in motif finding Final project
Hidden Markov Models BMI/CS 576
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Transcription factor binding motifs
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
(Regulatory-) Motif Finding
Motif finding in groups of related sequences
Presentation transcript:

Regulatory motif discovery 6.095/ Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005

The course so far… Fundamental bio problemFundamental comp. tool Lec t 1Introduction1 2Sequence alignmentDynamic programming2,3 3Database searchHashing4,5 4Modeling biological signalsHMMs6,7 5Microarray analysisClustering8,9 6Regulatory motifsGibbs Sampling/EM10 7Biological networksGraph algorithms11 8Phylogenetic treesEvolutionary modeling12

Challenges in Computational Biology DNA 4 Genome Assembly 1Gene Finding5 Regulatory motif discovery Database lookup3 Gene expression analysis8 RNA transcript Sequence alignment Evolutionary Theory7 TCATGCTAT TCGTGATAA TGAGGATAT TTATCATAT TTATGATTT Cluster discovery9Gibbs sampling10 Protein network analysis11 Emerging network properties13 12 Regulatory network inference Comparative Genomics6 2

Challenges in Computational Biology DNA Regulatory motif discovery Group of co-regulated genes Common subsequence Last time: Discover groups of co-regulated genes. Today: Find motifs within them.

Overview  Introduction  Bio review: Where do ambiguities come from?  Computational formulation of the problem  Combinatorial solutions  Exhaustive search  Greedy motif clustering  Wordlets and motif refinement  Probabilistic solutions  Expectation maximization  Gibbs sampling

Overview  Introduction  Bio review: Where do ambiguities come from?  Computational formulation of the problem  Combinatorial solutions  Exhaustive search  Greedy motif clustering  Wordlets and motif refinement  Probabilistic solutions  Expectation maximization  Gibbs sampling

ATGACTAAATCTCATTCAGAAGAAGTGA Regulatory motif discovery GAL1 CCCCWCGGCCG Gal4 Mig1 CGGCCG Gal4 Regulatory motifs –Genes are turned on / off in response to changing environments –No direct addressing: subroutines (genes) contain sequence tags (motifs) –Specialized proteins (transcription factors) recognize these tags What makes motif discovery hard? –Motifs are short (6-8 bp), sometimes degenerate –Can contain any set of nucleotides (no ATG or other rules) –Act at variable distances upstream (or downstream) of target gene

PS1: Evaluating genome-wide motif conservation Genome-wide conservation reveals real motifs –Count conserved instances  Not informative –Count total instances  Not informative –Evaluate conserved/total instances  Real motifs! Motif enumeration –Perfectly conserved motif instances –Each motif is a fully-specified 6-mer Algorithmic speed-up –Do not search entire genome for every motif –Scan genome once, fill in table of motif instances –Content-based indexing What’s missing? ATTAGCCAGTAGCGCAGTGCATGCATGCACGACTGCAAGTGCATGCATGCTAGCTACGTAGCTAGCCGCGCATGCTGTGACTGCTAG AGTGCA AGTGAT AGTGAG AGTGAC AGTGCG AGTGCC AGTGCT AGTGAA total cons ratio

Problem Definition Combinatorial Motif M: m 1 …m W Some of the m i ’s blank Find M that occurs in all s i with  k differences Or, Find M with smallest total hamming dist Given a collection of promoter sequences s 1,…, s N of genes with common expression Probabilistic Motif: M ij ; 1  i  W 1  j  4 M ij = Prob[ letter j, pos i ] Find best M, and positions p 1,…, p N in sequences

Overview  Introduction  Bio review: Where do ambiguities come from?  Computational formulation of the problem  Combinatorial solutions  Exhaustive search  Greedy motif clustering  Wordlets and motif refinement  Probabilistic solutions  Expectation maximization  Gibbs sampling

Discrete Formulations Given sequences S = {x 1, …, x n } A motif W is a consensus string w 1 …w K Find motif W * with “best” match to x 1, …, x n Definition of “best”: d(W, x i ) = min hamming dist. between W and any word in x i d(W, S) =  i d(W, x i )

Overview  Introduction  Bio review: Where do ambiguities come from?  Computational formulation of the problem  Combinatorial solutions  Exhaustive search  Greedy motif clustering  Wordlets and motif refinement  Probabilistic solutions  Expectation maximization  Gibbs sampling

Exhaustive Searches 1. Pattern-driven algorithm: For W = AA…A to TT…T (4 K possibilities) Find d( W, S ) Report W* = argmin( d(W, S) ) Running time: O( K N 4 K ) (where N =  i |x i |) Advantage: Finds provably “best” motif W Disadvantage: Time

Exhaustive Searches 2. Sample-driven algorithm: For W = any K-long word occurring in some x i Find d( W, S ) Report W* = argmin( d( W, S ) ) or, Report a local improvement of W * Running time: O( K N 2 ) Advantage: Time Disadvantage:If the true motif is weak and does not occur in data then a random motif may score better than any instance of true motif

Overview  Introduction  Bio review: Where do ambiguities come from?  Computational formulation of the problem  Combinatorial solutions  Exhaustive search  Greedy motif clustering  Wordlets and motif refinement  Probabilistic solutions  Expectation maximization  Gibbs sampling

Greedy motif clustering ( CONSENSUS ) Algorithm: Cycle 1: For each word W in S(of fixed length!) For each word W’ in S Create alignment (gap free) of W, W’ Keep the C 1 best alignments, A 1, …, A C1 ACGGTTG,CGAACTT,GGGCTCT … ACGCCTG,AGAACTA,GGGGTGT …

Greedy motif clustering ( CONSENSUS ) Algorithm: Cycle t: For each word W in S For each alignment A j from cycle t-1 Create alignment (gap free) of W, A j Keep the C l best alignments A 1, …, A Ct ACGGTTG,CGAACTT,GGGCTCT … ACGCCTG,AGAACTA,GGGGTGT … ……… ACGGCTC,AGATCTT,GGCGTCT …

Greedy motif clustering ( CONSENSUS ) C 1, …, C n are user-defined heuristic constants –N is sum of sequence lengths –n is the number of sequences Running time: O(N 2 ) + O(N C 1 ) + O(N C 2 ) + … + O(N C n ) = O( N 2 + NC total ) Where C total =  i C i, typically O(nC), where C is a big constant

Overview  Introduction  Bio review: Where do ambiguities come from?  Computational formulation of the problem  Combinatorial solutions  Exhaustive search  Greedy motif clustering  Wordlets and motif refinement  Probabilistic solutions  Expectation maximization  Gibbs sampling

Motif Refinement and wordlets ( MULTIPROFILER ) Extended sample-driven approach Given a K-long word W, define: N α (W) = words W’ in S s.t. d(W,W’)  α Idea: Assume W is occurrence of true motif W * Will use N α (W) to correct “errors” in W

Motif Refinement and wordlets ( MULTIPROFILER ) Assume W differs from true motif W * in at most L positions Define: A wordlet G of W is a L-long pattern with blanks, differing from W –L is smaller than the word length K Example: K = 7; L = 3 W = ACGTTGA G = --A--CG

Motif Refinement and wordlets ( MULTIPROFILER ) Algorithm: For each W in S: For L = 1 to L max 1.Find the α- neighbors of W in S  N α (W) 2.Find all “strong” L-long wordlets G in N a (W) 3.For each wordlet G, 1.Modify W by the wordlet G  W’ 2.Compute d(W’, S) Report W * = argmin d(W’, S) Step 1 above: Smaller motif-finding problem; Use exhaustive search

Overview  Introduction  Bio review: Where do ambiguities come from?  Computational formulation of the problem  Combinatorial solutions  Exhaustive search  Greedy motif clustering  Wordlets and motif refinement  Probabilistic solutions  Expectation maximization  Gibbs sampling

Where do ambiguous bases come from ? Protein-DNA interactions –Proteins read DNA by “feeling” the chemical properties of the bases –Without opening DNA (not by base complementarity) Sequence specificity –Topology of 3D contact dictates sequence specificity of binding –Some positions are fully constrained; other positions are degenerate –“Ambiguous / degenerate” positions are loosely contacted by the transcription factor

Representing motif ambiguities Entropy at pos’n I, H(i) = –  {letter x} freq(x, i) log 2 freq(x, i) Height of x at pos’n i, L(x, i) = freq(x, i) (2 – H(i)) –Examples: freq(A, i) = 1;H(i) = 0;L(A, i) = 2 A: ½; C: ¼; G: ¼; H(i) = 1.5; L(A, i) = ¼; L(not T, i) = ¼ entropy - n 1: (communication theory) a numerical measure of the uncertainty of an outcome; "the signal contained thousands of bits of information" [information, selective information] 2: (thermodynamics) a thermodynamic quantity representing the amount of energy in a system that is no longer available for doing mechanical work; "entropy increases as matter and energy in the universe degrade to an ultimate state of inert uniformity" [randomness]

Motifs and Profile Matrices sequence positions A C G T given a set of aligned sequences, it is straightforward to construct a profile matrix characterizing a motif of interest shared motif given a profile matrix characterizing the motif, it is straightforward to find the set of aligned sequences

Motifs and Profile Matrices Construct the profile if the sequences aren’t aligned –in the typical case we don’t know what the motif looks like or where the starting positions are  Use expectation maximization

The EM Approach EM is a family of algorithms for learning probabilistic models in problems that involve hidden state In our problem, the hidden state is where the motif starts in each training sequence

Representing Motifs a motif is assumed to have a fixed width, W a motif is represented by a matrix of probabilities: represents the probability of character c in column k example: DNA motif with W= A C G T

Representing Motifs we will also represent the “background” (i.e. outside the motif) probability of each character represents the probability of character c in the background example: A 0.26 C 0.24 G 0.23 T 0.27

Basic EM Approach the element of the matrix represents the probability that the motif starts in position j in sequence I example: given 4 DNA sequences of length 6, where W= seq seq seq seq

Basic EM Approach given: length parameter W, training set of sequences set initial values for p do re-estimate Z from p (E –step) re-estimate p from Z (M-step) until change in p <  return: p, Z

Basic EM Approach we’ll need to calculate the probability of a training sequence given a hypothesized starting position: is the ith sequence is 1 if motif starts at position j in sequence i is the character at position k in sequence i before motifmotifafter motif

Example A C G T G C T G T A G

The E-step: Estimating Z this comes from Bayes’ rule applied to to estimate the starting positions in Z at step t

The E-step: Estimating Z assume that it is equally likely that the motif will start in any position

E-step: Estimating Z  Example then normalize so that A C G T G C T G T A G

The M-step: Estimating the motif p pseudo-counts total # of c’s in data set recall represents the probability of character c in position k ; values for position 0 represent the background

M-step example: Estimating p A G G C A G A C A G C A T C A G T C

The EM Algorithm EM converges to a local maximum in the likelihood of the data given the model: usually converges in a small number of iterations sensitive to initial starting point (i.e. values in p)

Overview  Introduction  Bio review: Where do ambiguities come from?  Computational formulation of the problem  Combinatorial solutions  Exhaustive search  Greedy motif clustering  Wordlets and motif refinement  Probabilistic solutions  Expectation maximization  MEME extensions  Gibbs sampling

MEME Enhancements to the Basic EM Approach MEME builds on the basic EM approach in the following ways: –trying many starting points –not assuming that there is exactly one motif occurrence in every sequence –allowing multiple motifs to be learned –incorporating Dirichlet prior distributions

Starting Points in MEME for every distinct subsequence of length W in the training set –derive an initial p matrix from this subsequence –run EM for 1 iteration choose motif model (i.e. p matrix) with highest likelihood run EM to convergence

Using Subsequences as Starting Points for EM set values corresponding to letters in the subsequence to X set other values to (1-X)/(M-1) where M is the length of the alphabet example: for the subsequence TAT with X= A C G T

The ZOOPS Model (zero-or-one-occurrence-per-sequence) the approach as we’ve outlined it, assumes that each sequence has exactly one motif occurrence per sequence; this is the OOPS model the ZOOPS model assumes zero or one occurrences per sequence

E-step in the ZOOPS Model we need to consider another alternative: the ith sequence doesn’t contain the motif we add another parameter (and its relative)  prior prob that any position in a sequence is the start of a motif  prior prob of a sequence containing a motif

E-step in the ZOOPS Model here is a random variable that takes on 0 to indicate that the sequence doesn’t contain a motif occurrence

M-step in the ZOOPS Model update p same as before update as follows average of across all sequences, positions

The TCM Model (two-component mixture model) the TCM (two-component mixture model) assumes zero or more motif occurrences per sequence

Likelihood in the TCM Model the TCM model treats each length W subsequence independently to determine the likelihood of such a subsequence: assuming a motif starts there assuming a motif doesn’t start there

E-step in the TCM Model M-step same as before subsequence isn’t a motifsubsequence is a motif

Finding Multiple Motifs basic idea: discount the likelihood that a new motif starts in a given position if this motif would overlap with a previously learned one when re-estimating, multiply by is estimated using values from previous passes of motif finding

Overview  Introduction  Bio review: Where do ambiguities come from?  Computational formulation of the problem  Combinatorial solutions  Exhaustive search  Greedy motif clustering  Wordlets and motif refinement  Probabilistic solutions  Expectation maximization  MEME extensions  Gibbs sampling

Gibbs Sampling a general procedure for sampling from the joint distribution of a set of random variables by iteratively sampling from for each j application to motif finding: Lawrence et al can view it as a stochastic analog of EM for this task less susceptible to local minima than EM

Gibbs Sampling Approach in the EM approach we maintained a distribution over the possible motif starting points for each sequence in the Gibbs sampling approach, we’ll maintain a specific starting point for each sequence but we’ll keep resampling these

Gibbs Sampling Approach given: length parameter W, training set of sequences choose random positions for a do pick a sequence estimate p given current motif positions a (update step) (using all sequences but ) sample a new motif position for (sampling step) until convergence return: p, a

Sampling New Motif Positions for each possible starting position,, compute a weight randomly select a new starting position according to these weights

Gibbs Sampling ( AlignACE ) Given: –x 1, …, x N, –motif length K, –background B, Find: –Model M –Locations a 1,…, a N in x 1, …, x N Maximizing log-odds likelihood ratio:

Gibbs Sampling ( AlignACE ) AlignACE: first statistical motif finder BioProspector: improved version of AlignACE Algorithm (sketch): 1.Initialization: a.Select random locations in sequences x 1, …, x N b.Compute an initial model M from these locations 2.Sampling Iterations: a.Remove one sequence x i b.Recalculate model c.Pick a new location of motif in x i according to probability the location is a motif occurrence

Gibbs Sampling ( AlignACE ) Initialization: Select random locations a 1,…, a N in x 1, …, x N For these locations, compute M: That is, M kj is the number of occurrences of letter j in motif position k, over the total

Gibbs Sampling ( AlignACE ) Predictive Update: Select a sequence x = x i Remove x i, recompute model: where  j are pseudocounts to avoid 0s, and B =  j  j M

Gibbs Sampling ( AlignACE ) Sampling: For every K-long word x j,…,x j+k-1 in x: Q j = Prob[ word | motif ] = M(1,x j )  …  M(k,x j+k-1 ) P i = Prob[ word | background ] B(x j )  …  B(x j+k-1 ) Let Sample a random new position a i according to the probabilities A 1,…, A |x|-k+1. 0|x| Prob

Gibbs Sampling ( AlignACE ) Running Gibbs Sampling: 1.Initialize 2.Run until convergence 3.Repeat 1,2 several times, report common motifs

Advantages / Disadvantages Very similar to EM Advantages: Easier to implement Less dependent on initial parameters More versatile, easier to enhance with heuristics Disadvantages: More dependent on all sequences to exhibit the motif Less systematic search of initial parameter space

Repeats, and a Better Background Model Repeat DNA can be confused as motif –Especially low-complexity CACACA… AAAAA, etc. Solution: more elaborate background model 0 th order: B = { p A, p C, p G, p T } 1 st order: B = { P(A|A), P(A|C), …, P(T|T) } … K th order: B = { P(X | b 1 …b K ); X, b i  {A,C,G,T} } Has been applied to EM and Gibbs (up to 3 rd order)

Example Application: Motifs in Yeast Group: Tavazoie et al. 1999, G. Church’s lab, Harvard Data: Microarrays on 6,220 mRNAs from yeast Affymetrix chips (Cho et al.) 15 time points across two cell cycles

Processing of Data 1.Selection of 3,000 genes Genes with most variable expression were selected Clustering according to common expression K-means clustering 30 clusters, genes/cluster Clusters correlate well with known function 1.AlignACE motif finding 600-long upstream regions 50 regions/trial

Motifs in Periodic Clusters

Motifs in Non-periodic Clusters

Limits of Motif Finders Given upstream regions of coregulated genes: –Increasing length makes motif finding harder – random motifs clutter the true ones –Decreasing length makes motif finding harder – true motif missing in some sequences 0 gene ???

Limits of Motif Finders A (k,d)-motif is a k-long motif with d random differences per copy Motif Challenge problem: Find a (15,4) motif in N sequences of length L CONSENSUS, MEME, AlignACE, & most other programs fail for N = 20, L = 1000

Summary  Introduction  Bio review: Where do ambiguities come from?  Computational formulation of the problem  Combinatorial solutions  Exhaustive search  Greedy motif clustering  Wordlets and motif refinement  Probabilistic solutions  Expectation maximization  Gibbs sampling