Download presentation
Presentation is loading. Please wait.
1
(Regulatory-) Motif Finding
2
Clustering of Genes Find binding sites responsible for
common expression patterns
3
Chromatin Immunoprecipitation
Microarray Chip “ChIP-chip”
4
Finding Regulatory Motifs
. Given a collection of genes with common expression, Find the TF-binding motif in common
5
Characteristics of Regulatory Motifs
Tiny Highly Variable ~Constant Size Because a constant-size transcription factor binds Often repeated Low-complexity-ish
6
Essentially a Multiple Local Alignment
. Find “best” multiple local alignment Alignment score defined differently in probabilistic/combinatorial cases
7
Algorithms Combinatorial Probabilistic Expectation Maximization:
CONSENSUS, TEIRESIAS, SP-STAR, others Probabilistic Expectation Maximization: MEME Gibbs Sampling: AlignACE, BioProspector
8
Combinatorial Approaches to Motif Finding
9
Discrete Formulations
Given sequences S = {x1, …, xn} A motif W is a consensus string w1…wK Find motif W* with “best” match to x1, …, xn Definition of “best”: d(W, xi) = min hamming dist. between W and any word in xi d(W, S) = i d(W, xi)
10
Approaches Exhaustive Searches CONSENSUS MULTIPROFILER
TEIRESIAS, SP-STAR, WINNOWER
11
Exhaustive Searches 1. Pattern-driven algorithm:
For W = AA…A to TT…T (4K possibilities) Find d( W, S ) Report W* = argmin( d(W, S) ) Running time: O( K N 4K ) (where N = i |xi|) Advantage: Finds provably “best” motif W Disadvantage: Time
12
Exhaustive Searches 2. Sample-driven algorithm:
For W = any K-long word occurring in some xi Find d( W, S ) Report W* = argmin( d( W, S ) ) or, Report a local improvement of W* Running time: O( K N2 ) Advantage: Time Disadvantage: If the true motif is weak and does not occur in data then a random motif may score better than any instance of true motif
13
Nα(W) = words W’ in S s.t. d(W,W’) α
MULTIPROFILER Extended sample-driven approach Given a K-long word W, define: Nα(W) = words W’ in S s.t. d(W,W’) α Idea: Assume W is occurrence of true motif W* Will use Nα(W) to correct “errors” in W
14
MULTIPROFILER Assume W differs from true motif W* in at most L positions Define: A wordlet G of W is a L-long pattern with blanks, differing from W L is smaller than the word length K Example: K = 7; L = 3 W = ACGTTGA G = --A--CG
15
MULTIPROFILER Algorithm: For each W in S: For L = 1 to Lmax
Find the α-neighbors of W in S Nα(W) Find all “strong” L-long wordlets G in Na(W) For each wordlet G, Modify W by the wordlet G W’ Compute d(W’, S) Report W* = argmin d(W’, S) Step 1 above: Smaller motif-finding problem; Use exhaustive search
16
CONSENSUS ACGGTTG , CGAACTT , GGGCTCT … ACGCCTG , AGAACTA , GGGGTGT …
Algorithm: Cycle 1: For each word W in S (of fixed length!) For each word W’ in S Create alignment (gap free) of W, W’ Keep the C1 best alignments, A1, …, AC1 ACGGTTG , CGAACTT , GGGCTCT … ACGCCTG , AGAACTA , GGGGTGT …
17
CONSENSUS ACGGTTG , CGAACTT , GGGCTCT … ACGCCTG , AGAACTA , GGGGTGT …
Algorithm: Cycle t: For each word W in S For each alignment Aj from cycle t-1 Create alignment (gap free) of W, Aj Keep the Cl best alignments A1, …, ACt ACGGTTG , CGAACTT , GGGCTCT … ACGCCTG , AGAACTA , GGGGTGT … … … … ACGGCTC , AGATCTT , GGCGTCT …
18
CONSENSUS C1, …, Cn are user-defined heuristic constants Running time:
N is sum of sequence lengths n is the number of sequences Running time: O(N2) + O(N C1) + O(N C2) + … + O(N Cn) = O( N2 + NCtotal) Where Ctotal = i Ci, typically O(nC), where C is a big constant
19
Expectation Maximization in Motif Finding
20
Expectation Maximization
motif background All K-long words Algorithm (sketch): Given genomic sequences find all K-long words Assume each word is motif or background Find likeliest Motif Model Background Model classification of words into either Motif or Background
21
Expectation Maximization
Given sequences x1, …, xN, Find all k-long words X1,…, Xn Define motif model: M = (M1,…, MK) Mi = (Mi1,…, Mi4) (assume {A, C, G, T}) where Mij = Prob[ letter j occurs in motif position i ] Define background model: B = B1, …, B4 Bi = Prob[ letter j in background sequence ] motif background M1 M1 MK B A C G T
22
Expectation Maximization
Define Zi1 = { 1, if Xi is motif; 0, otherwise } Zi2 = { 0, if Xi is motif; 1, otherwise } Given a word Xi = x[1]…x[k], P[ Xi, Zi1=1 ] = M1x[1]…Mkx[k] P[ Xi, Zi2=1 ] = (1 - ) Bx[1]…Bx[K] Let 1 = ; 2 = (1- ) motif background 1 – M1 M1 MK B A C G T
23
Expectation Maximization
1 – Define: Parameter space = (M,B) 1: Motif; 2: Background Objective: Maximize log likelihood of model: M1 M1 MK B A C G T
24
Expectation Maximization
Maximize expected likelihood, in iteration of two steps: Expectation: Find expected value of log likelihood: Maximization: Maximize expected value over ,
25
Expectation Maximization: E-step
Find expected value of log likelihood: where expected values of Z can be computed as follows:
26
Expectation Maximization: M-step
Maximize expected value over and independently For , this is easy:
27
Expectation Maximization: M-step
For = (M, B), define cjk = E[ # times letter k appears in motif position j] c0k = E[ # times letter k appears in background] cij values are calculated easily from Z* values It easily follows: to not allow any 0’s, add pseudocounts
28
Initial Parameters Matter!
Consider the following “artificial” example: x1, …, xN contain: 212 patterns on {A, T}: A…A, A…AT,……, T… T 212 patterns on {C, G}: C…C , C…CG,…… , G…G D << 212 occurrences of 12-mer ACTGACTGACTG Some local maxima: ½; B = ½C, ½G; Mi = ½A, ½T, i = 1,…, 12 D/2k+1; B = ¼A,¼C,¼G,¼T; M1 = 100% A, M2= 100% C, M3 = 100% T, etc.
29
Overview of EM Algorithm
Initialize parameters = (M, B), : Try different values of from N-1/2 up to 1/(2K) Repeat: Expectation Maximization Until change in = (M, B), falls below Report results for several “good”
30
Overview of EM Algorithm
One iteration running time: O(NK) Usually need < N iterations for convergence, and < N starting points. Overall complexity: unclear – typically O(N2K) - O(N3K) EM is a local optimization method Initial parameters matter MEME: Bailey and Elkan, ISMB 1994. M,B
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.