(Regulatory-) Motif Finding

Slides:



Advertisements
Similar presentations
Gene Regulation and Microarrays. Finding Regulatory Motifs Given a collection of genes with common expression, Find the TF-binding motif in common......
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Random Projection Approach to Motif Finding Adapted from RandomProjections.ppt.
CS5263 Bioinformatics Probabilistic modeling approaches for motif finding.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Phylogenetic Trees Lecture 4
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
CS262 Lecture 9, Win07, Batzoglou Gene Regulation and Microarrays.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
Gene Regulation and Microarrays. Overview A. Gene Expression and Regulation B. Measuring Gene Expression: Microarrays C. Finding Regulatory Motifs.
Motif Finding. Regulation of Genes Gene Regulatory Element RNA polymerase (Protein) Transcription Factor (Protein) DNA.
Finding Subtle Motifs by Branching from Sample Strings Xuan Qi Computer Science Dept. Utah State Univ.
Transcription factor binding motifs (part I) 10/17/07.
DNA Regulatory Binding Motif Search Dong Xu Computer Science Department 109 Engineering Building West
A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Regulatory motif discovery 6.095/ Computational Biology: Genomes, Networks, Evolution Lecture 10Oct 12, 2005.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
(Regulatory-) Motif Finding. Clustering of Genes Find binding sites responsible for common expression patterns.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Finding Regulatory Motifs in DNA Sequences
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Outline More exhaustive search algorithms Today: Motif finding
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
1 Finding Regulatory Motifs. 2 Copyright notice Many of the images in this power point presentation are from Bioinformatics and Functional Genomics by.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
1 Motifs for Unknown Sites Vasileios Hatzivassiloglou University of Texas at Dallas.
Flat clustering approaches
CS 6243 Machine Learning Advanced topic: pattern recognition (DNA motif finding)
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Transcription factor binding motifs (part II) 10/22/07.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
CS5263 Bioinformatics Lecture 11 Motif finding. HW2 2(C) Click to find out K and lambda.
1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.
CS5263 Bioinformatics Lecture 19 Motif finding. (Sequence) motif finding Given a set of sequences Goal: find sequence motifs that appear in all or the.
Lecture 20 Practical issues in motif finding Final project
Hidden Markov Models BMI/CS 576
Applied Discrete Mathematics Week 11: Relations
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Hidden Markov Models - Training
Transcription factor binding motifs
Sequential Pattern Discovery under a Markov Assumption
CS 5263 & CS 4233 Bioinformatics Motif finding.
Professor of Computer Science and Mathematics
Professor of Computer Science and Mathematics
Professor of Computer Science and Mathematics
Finding regulatory modules
CS5263 Bioinformatics Lecture 18 Motif finding.
Motif finding in groups of related sequences
Transcription factor binding motifs
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

(Regulatory-) Motif Finding

Clustering of Genes Find binding sites responsible for common expression patterns

Chromatin Immunoprecipitation Microarray Chip “ChIP-chip”

Finding Regulatory Motifs . Given a collection of genes with common expression, Find the TF-binding motif in common

Characteristics of Regulatory Motifs Tiny Highly Variable ~Constant Size Because a constant-size transcription factor binds Often repeated Low-complexity-ish

Essentially a Multiple Local Alignment . Find “best” multiple local alignment Alignment score defined differently in probabilistic/combinatorial cases

Algorithms Combinatorial Probabilistic Expectation Maximization: CONSENSUS, TEIRESIAS, SP-STAR, others Probabilistic Expectation Maximization: MEME Gibbs Sampling: AlignACE, BioProspector

Combinatorial Approaches to Motif Finding

Discrete Formulations Given sequences S = {x1, …, xn} A motif W is a consensus string w1…wK Find motif W* with “best” match to x1, …, xn Definition of “best”: d(W, xi) = min hamming dist. between W and any word in xi d(W, S) = i d(W, xi)

Approaches Exhaustive Searches CONSENSUS MULTIPROFILER TEIRESIAS, SP-STAR, WINNOWER

Exhaustive Searches 1. Pattern-driven algorithm: For W = AA…A to TT…T (4K possibilities) Find d( W, S ) Report W* = argmin( d(W, S) ) Running time: O( K N 4K ) (where N = i |xi|) Advantage: Finds provably “best” motif W Disadvantage: Time

Exhaustive Searches 2. Sample-driven algorithm: For W = any K-long word occurring in some xi Find d( W, S ) Report W* = argmin( d( W, S ) ) or, Report a local improvement of W* Running time: O( K N2 ) Advantage: Time Disadvantage: If the true motif is weak and does not occur in data then a random motif may score better than any instance of true motif

Nα(W) = words W’ in S s.t. d(W,W’)  α MULTIPROFILER Extended sample-driven approach Given a K-long word W, define: Nα(W) = words W’ in S s.t. d(W,W’)  α Idea: Assume W is occurrence of true motif W* Will use Nα(W) to correct “errors” in W

MULTIPROFILER Assume W differs from true motif W* in at most L positions Define: A wordlet G of W is a L-long pattern with blanks, differing from W L is smaller than the word length K Example: K = 7; L = 3 W = ACGTTGA G = --A--CG

MULTIPROFILER Algorithm: For each W in S: For L = 1 to Lmax Find the α-neighbors of W in S  Nα(W) Find all “strong” L-long wordlets G in Na(W) For each wordlet G, Modify W by the wordlet G  W’ Compute d(W’, S) Report W* = argmin d(W’, S) Step 1 above: Smaller motif-finding problem; Use exhaustive search

CONSENSUS ACGGTTG , CGAACTT , GGGCTCT … ACGCCTG , AGAACTA , GGGGTGT … Algorithm: Cycle 1: For each word W in S (of fixed length!) For each word W’ in S Create alignment (gap free) of W, W’ Keep the C1 best alignments, A1, …, AC1 ACGGTTG , CGAACTT , GGGCTCT … ACGCCTG , AGAACTA , GGGGTGT …

CONSENSUS ACGGTTG , CGAACTT , GGGCTCT … ACGCCTG , AGAACTA , GGGGTGT … Algorithm: Cycle t: For each word W in S For each alignment Aj from cycle t-1 Create alignment (gap free) of W, Aj Keep the Cl best alignments A1, …, ACt ACGGTTG , CGAACTT , GGGCTCT … ACGCCTG , AGAACTA , GGGGTGT … … … … ACGGCTC , AGATCTT , GGCGTCT …

CONSENSUS C1, …, Cn are user-defined heuristic constants Running time: N is sum of sequence lengths n is the number of sequences Running time: O(N2) + O(N C1) + O(N C2) + … + O(N Cn) = O( N2 + NCtotal) Where Ctotal = i Ci, typically O(nC), where C is a big constant

Expectation Maximization in Motif Finding

Expectation Maximization motif background All K-long words Algorithm (sketch): Given genomic sequences find all K-long words Assume each word is motif or background Find likeliest Motif Model Background Model classification of words into either Motif or Background

Expectation Maximization Given sequences x1, …, xN, Find all k-long words X1,…, Xn Define motif model: M = (M1,…, MK) Mi = (Mi1,…, Mi4) (assume {A, C, G, T}) where Mij = Prob[ letter j occurs in motif position i ] Define background model: B = B1, …, B4 Bi = Prob[ letter j in background sequence ] motif background M1 M1 MK B A C G T

Expectation Maximization Define Zi1 = { 1, if Xi is motif; 0, otherwise } Zi2 = { 0, if Xi is motif; 1, otherwise } Given a word Xi = x[1]…x[k], P[ Xi, Zi1=1 ] =  M1x[1]…Mkx[k] P[ Xi, Zi2=1 ] = (1 - ) Bx[1]…Bx[K] Let 1 = ; 2 = (1- ) motif background  1 –  M1 M1 MK B A C G T

Expectation Maximization  1 –   Define: Parameter space  = (M,B) 1: Motif; 2: Background Objective: Maximize log likelihood of model: M1 M1 MK B A C G T

Expectation Maximization Maximize expected likelihood, in iteration of two steps: Expectation: Find expected value of log likelihood: Maximization: Maximize expected value over , 

Expectation Maximization: E-step Find expected value of log likelihood: where expected values of Z can be computed as follows:

Expectation Maximization: M-step Maximize expected value over  and  independently For , this is easy:

Expectation Maximization: M-step For  = (M, B), define cjk = E[ # times letter k appears in motif position j] c0k = E[ # times letter k appears in background] cij values are calculated easily from Z* values It easily follows: to not allow any 0’s, add pseudocounts

Initial Parameters Matter! Consider the following “artificial” example: x1, …, xN contain: 212 patterns on {A, T}: A…A, A…AT,……, T… T 212 patterns on {C, G}: C…C , C…CG,…… , G…G D << 212 occurrences of 12-mer ACTGACTGACTG Some local maxima:  ½; B = ½C, ½G; Mi = ½A, ½T, i = 1,…, 12  D/2k+1; B = ¼A,¼C,¼G,¼T; M1 = 100% A, M2= 100% C, M3 = 100% T, etc.

Overview of EM Algorithm Initialize parameters  = (M, B), : Try different values of  from N-1/2 up to 1/(2K) Repeat: Expectation Maximization Until change in  = (M, B),  falls below  Report results for several “good” 

Overview of EM Algorithm One iteration running time: O(NK) Usually need < N iterations for convergence, and < N starting points. Overall complexity: unclear – typically O(N2K) - O(N3K) EM is a local optimization method Initial parameters matter MEME: Bailey and Elkan, ISMB 1994. M,B 