Download presentation
Presentation is loading. Please wait.
Published byHugh Walker Modified over 9 years ago
1
SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004
2
Motif: A functional regain of a DNA or protein sequence How to discover the functional regains automatically? Amino Acids sequences
3
Automatic Motif discovery Problem - Use A, B, C, … stands for different amino acids - A protein sequence: ABABAABCDBAA… - Motifs are certain patterns in sequences for example: ABCA Previous Methods: small scale discovery - Several sequences similar functions alignment Can we use data mining to generate motifs candidates first?
4
Automatically discover motifs: What properties should a motif have? It has a specific function conservative frequent appearing in sequences Evolution likely not continually identical For example: ABCBABABA AB--ABAB- string matching, suffix tree … AB-BAB-B- how?
5
Formal problem definition Input: A string of characters: S=s 1 s 2,…,s L Output: A frequent pattern: (∑ U ●)* ●: a wild card to match a single character, ∑: a full character * : repeat arbitrary times Note: NO arbitrary-length gap. ABCD, AED are different
6
Regular Expression: to describe a certain type of patterns | or : A|B means A or B ● wild card to match any characters A●B means: AAB, ABB, ACB, … * to repeat any times (including 0 times) (AB)* means null, AB, ABAB, ABABAB, … + to repeat any times (not including 0 times) …
7
Any requirements for output patterns? Can wild card be anywhere? Do we need some constraints on wild cards? What means “frequent”? How long should a qualified pattern be?
8
Can wild card be anywhere? A pattern can have ●: for example A●BA●●B But, A●●●●●●●●●●●●●●●●BBA ?? Probably, it cannot be too “sparse”… Naïve solution: no more than n ● But, for example n=5 A●●●●●B : 5 ● A●BB●●A●●B●●A : 7●
9
Given a pattern P, any length l 0 region in P must have k 0 full characters Example: l 0 = 5, k 0 = 3 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 …… Density: how “sparse” do we allow? Two ● at most
10
Given a pattern P, any length l 0 region in P must have k 0 full characters Example: l 0 = 5, k 0 = 3 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 …… Density: how “sparse” do we allow? Two ● at most
11
Given a pattern P, any length l 0 region in P must have k 0 full characters Example: l 0 = 5, k 0 = 3 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 …… Density: how “sparse” do we allow? Two ● at most
12
Given a pattern P, any length l 0 region in P must have k 0 full characters Example: l 0 = 5, k 0 = 3 s 1 s 2 s 3 s 4 s 5 s 6 s 7 s 8 s 9 …… A●●ABB●A √ BA●●A●BB X Density: how “sparse” do we allow?
13
Frequency and length At least, the patterns have K 0 full characters repeating J 0 times Example J 0 =3 ABCBABABA √ ABCBABABA X Example K 0 =3 ABCBABABA X ABCBABABA √
14
Summary of parameters for a pattern Sequence S, and its length L Pattern P, K full character, appears J times Length constraints: K ≥ K 0 Frequency constraint: J ≥ J 0 Density constraint: l 0, k 0
15
Apriori property A constraint has a-priori property means: If a set violates this constraint, any its superset will violate this constraint as well. For example max(S) < 5 Frequency constraint has a-priori property! For example, BA●A●BB appears less than J 0 times, any its super patterns CANNOT appears more than J 0 times!
16
A whole picture of the algorithm To form longer pattern only from short qualified patterns. - First, to generate candidates/seed (length l 0 ): every seed should repeat at least J 0 times - To generate longer patterns from short patterns, iteratively 1. Two patterns are together 2. Longer patterns repeat at least J 0 ……
17
Generate the seeds: enumerating … To generate seeds (shortest patterns) first ABAABBCBACBDB… J 0 =4 A: 4, B: 6, C:2, D: 1 Are length 1 seeds too short? How long could those seeds be? - Too long: enumerating costs too much time - Too short: maybe not efficient, also not consider the density constraints Maybe, we should start from the patterns with length l 0.
18
How to generate seeds with length l 0 ? Give l 0 and k 0, and character sets ABC… Enumerating all possible patterns with length l 0 Scan the sequence the count the frequency For example, l 0 =3, k 0 =2, ABC AAA, AAB, AAC, ABA, … AA●, AB●, AC●, … A●A, A●B, A●C, … …
19
Can we do it more efficiently? Give l 0 and k 0, 1: full character 0:wild card Enumerating all possible patterns by 1 and 0? Example l 0 =5, k 0 =3, to find comb 11111, 11110, 11101, 11100, 11011, 11010, 11001, 10111, 10110, 10101 10011, 01111, 01110, 01101, 01011, 00111
20
How to use comb? For example 10101 A B A A B A B B A B A B A B B A A B A●A●B
21
How to use comb? For example 10101 A B A A B A B B A B A B A B B A A B B●A●A
22
How to use comb? For example 10101 A B A A B A B B A B A B A B B A A B A●B●B
23
How to use comb? For example 10101 A B A A B A B B A B A B A B B A A B A●A●B 3, B●A●A 2, A●B●B 2 B●B●A 2, B●B●B 2, A●A●A 1 A●B●A 1, B●A●B 1 J 0 = 3? only A●A●B left By the same way, use others combs to generate other seeds, different combs won’t generate the same patterns
24
How to get long patterns? Long pattern two patterns could be merged need short patterns and their locations - Pattern: A●B●●C {A:0,B:2,C:5} - Locus: the locations where a pattern occurs: Patten AB in string ABBCABAB Its locus {0, 4, 6}
25
Append operation: to connect two small patterns to a longer pattern Patten S 1 : A●●B●C and S 2 : B●D● S 1 S 2 : A●●B●CB●D● conditional on: Their locus have intersection S 1 locus: {1, 20, 32, 57 …} S 2 locus: {7, 13, 38, 63 … } {1,7,32,57,…} S 1 S 2 locus: {1,32,57,…} -6
26
Add: to make the patterns more “dense” Patten A●●B●C●●●D and ●●●B●CE A●●B●CE●●D on the conditions: Their locus (with shifting) have intersection
27
Significance: whether it can be generated by randomly sampling ? hypothesis: A pattern is not randomly generated Given: character set: {A,B,C,D,E} sequence length: L A pattern: A●BA●AA●B Its frequency j Probability to generate this pattern j times pure randomly?
28
Statistical significance Pure random sampling, the frequency should satisfy normal distribution Z score, (A-E[A])/σ A --- normalized into N(0,1)
29
Experiments Two questions to answer. - How efficient is this algorithm? - How effective is this algorithm? Baseline algorithm - PRATT(EBI), MEME(UCSD)
30
Efficiency SPLASH PRATT
31
Effectiveness Search against SWISS-PROT Rel. 36, 578 GPCR proteins returned, only 4 false positive MEME cannot find it, PRATT program crashed
32
Conclusions Deterministic algorithm: It can discover all patterns satisfying the requirements Efficient and scalable: It beats PRATT and MEME. More scalable … Effective: It can discover useful patterns.
33
Problems All problems that A-priori algorithm could have: too many results, cannot really avoid worse-case exponential … Doesn’t really consider the 3D structure of proteins The software crashes sometimes
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.