Download presentation
Presentation is loading. Please wait.
1
Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology, Puschino, Russia CPM (Istanbul) July 5-7, 2004
2
Text filtration: general principle potential matches
3
Text filtration: general principle potential matches
4
Text filtration: general principle lossless and lossy filters true match
5
Filtration applied to sequence comparison potential similarities
6
Filtration applied to sequence alignment potential similarities
7
Filtration applied to sequence alignment true similarities
8
Gapless similarities. Hamming distance. Similarities are defined through Hamming distance GCTACGACTTCGAGCTGC...CTCAGCTATGACCTCGAGCGGCCTATCTA...
9
Gapless similarities. Hamming distance. Similarities are defined through Hamming distance
10
Gapless similarities. Hamming distance. Similarities are defined through Hamming distance ( m, k )-problem, ( m, k )-instances m k
11
Gapless similarities. Hamming distance. Similarities are defined through Hamming distance ( m, k )-problem, ( m, k )-instances This work: lossless filtering m k
12
Filtering by contiguous fragment PEX (Navarro&Raffinot 2002) –Searching for a contiguous pattern PEX with errors –Searching for a contiguous pattern with l possible errors requires retrieval of all l-variants in the index. Efficient for –small alphabets (ADN,ARN) –relatively small l (<= 2) m=18 k=3 #### ######### (1) (m,k)
13
Superposition of two filters Pevzner&Waterman 1995 Idea: combine PEX with another filter based on a regularly-spaced seed PEX : spaced PEX (matches occurring at every k positions). #### #---#---#---# k+1
14
Spaced seeds Spaced seeds (spaced Q-grams) –proposed by Burkhardt & Kärkkäinen (CPM 2001) for solving (m,k)-problems Principle –Searching for spaced rather than contiguous patterns –Selectivity defined by the weight of the seed (number of #’s) ###-##
15
Example: (18,3)-problem ###-##
16
Spaced seeds for sequence comparison Ma, Tromp, Li 2002 (PatternHunter) Estimating seed sensitivity: Keich et al 2002, Buhler et al 2003, Brejova et al 2003, Choi&Zhang 2004, Choi et al 2004, Kucherov et al 2004,... Extended seed models: BLASTZ 2003, Brejova et al 2003, Chen&Sung 2003, Noé&Kucherov 2004,...
17
This work: lossless filtration using spaced seed families (extension of Burkhard&Karkkainen 2001) single filter based on several distinct seeds each seed detects a part of (m,k)-instances but together they must detect all (m,k)-instances Families of spaced seeds Independent work (lossy seed families for sequence alignment): Li, Ma, Kisman, Tromp 2004 (PatternHunter II) Xu, Brown, Li, Ma, this conference Sun, Buhler, RECOMB 2004 (Mandala)
18
–every (18,3)-instance contains an occurrence of a seed of F –all seeds of the family have the same weight 7 Example: (18.3)-problem (cont) Family F solves the (18,3)-problem ##-#-#### ###---#--##-# F
19
##-##-##### ###-####--## ###-##---#-### ##----####-### ###---#-#-##-## ###-#-#-#-----### Example: (18.3)-problem (cont) ##-#-#### ###---#--##-# ###-##---#-### ###---#--##-# w=7 w=9
20
#### ###-## ##-##-##### ###-####--## ###-##---#-### ##----####-### ###---#-#-##-## ###-#-#-#-----### Comparative selectivity ##-#-#### ###---#--##-# w=4 ~39. 10 -4 w=5 ~9.8 10 -4 w=7 ~1.2 10 -4 w=9 ~0.23 10 -4 Selectivity of families on Bernoulli similarities ( p(match) = 1/4 ) estimated as the probability for one of the seeds to occur at a given position
21
How far should we go A trivial extreme solution... –would be to pick all seeds of weight m - k. –prohibitive cost except for very small problems We are interested in intermediate solutions: –relatively small number of seeds (< 10) to keep the hash table of a reasonable size, –the seed weight sufficiently large to obtain a good selectivity
22
Results Computing properties of seed families Seed design –Seed expansion/contraction –Periodic seeds –Seed optimality –Heuristic seed design Experiments –Examples of designed seed families –Application to computing specific oligonucleotides Conclusions
23
Measuring the efficiency of a family Optimal threshold (Burkhard&Karkkainen): minimal number of seed occurrences over all (m,k)-instances A seed family F is lossless iff the optimal threshold T F (m,k)1 T F (m,k) can be computed by a dynamic programming algorithm in time O(m·k·2 (S+1) ) and space O(k·2 (S+1) ), where S is the maximal length of a seed from F optimizations are possible (see the paper) the resulting space complexity is the same as in the Burkhard&Karkkainen algorithm
24
Measuring the efficiency of a family (cont) Using a similar DP technique we can compute, within the same time complexity bound: the number U F (m,k) of undetected (m,k)-similarities for a (lossy) family F the contribution of a seed of F, i.e. the number of (m,k)-similarities detected exclusively by this seed [see the paper for details]
25
Design of seed families Pruning exhaustive search tree (Burkhard&Karkkainen) –Construct all solutions of weight w from solutions of weight w – 1 –Example: if ##--#--# and ##-#---# are solutions of weight w-1, consider their «union» ##-##--# of weight w. –Prohibitive cost: more than a week for computing all single-seed solutions of the (50,5)-problem the search space blows up for multi-seed families
26
Seed expansion/contraction Burkhard&Karkkainen : the only two solutions of weight 12 solving the (50,5)-problem: ###-#--###-#--###-# #-#-#---#-----#-#-#---#-----#-#-#---#
27
Seed expansion/contraction Burkhard&Karkkainen : the only two solutions of weight 12 solving the (50,5)-problem: ###-#--###-#--###-# #-#-#---#-----#-#-#---#-----#-#-#---# the only solution of weight 12 of the (25,2)-problem
28
Seed expansion/contraction Burkhard&Karkkainen : the only two solutions of weight 12 solving the (50,5)-problem: ###-#--###-#--###-# #-#-#---#-----#-#-#---#-----#-#-#---# –Let be the i -regular expansion of F obtained by inserting i-1 jokers between successive positions of each seed of F –Example: If F = { ###-#, ##-## } then = { #-#-#---#, #-#---#-# } = { #--#--#-----#, #--#-----#--# } the only solution of weight 12 of the (25,2)-problem
29
Seed expansion/contraction (cont) Lemma: –If a family F solves an (m,k) –problem, then both F and solves the (i·m, (i+1)·k- 1) –problem –If a family solves the (i·m,k) –problem, then its i-contraction F solves the (m, ) -problem ##-#-#### ###---#--##-# ##-#-#### ###---#--##-# #-#---#---#-#-#-# #-#-#-------#-----#-#-# (18,3) (36,7)
30
Periodic seeds Iterating short seeds with good properties into longer seeds ###-#--###-#--###-# ###-#--
31
Cyclic problem Lemma: If a seed Q solves a cyclic (m,k)-problem, then the seed Q i =[Q,- (m-s(Q)) ] i solves the linear (m·(i+1)+s(Q)-1,k)- problem. Cyclic ( 11,3 )-problem Linear ( 30,3 )-problem ###-#--#--- ###-#--#---###-#--#
32
Extension to multi-seed case Cyclic ( 11,3 )-problem Linear ( 25,3 )-problem ###-#--#--- ###-#--#---###-#--# #--#---###-#--#---###
33
Extension to multi-seed case Cyclic ( 11,3 )-problem Linear ( 25,3 )-problem ###-#--#--- ###-#--#---###-#--# #--#---###-#--#---###
34
Asymptotic optimality Theorem: Fix a number of errors k. Let w(m) be the maximal weight of a seed solving the linear (m,k)-problem. Then the fraction of the number of jokers tends to 0 but the convergence speed depends on k seed expansion cannot provide an asymptotically optimal solution ( )
35
Non-asymptotic optimality Fix a number of errors k. For each seed (seed family) Q there exists m Q s.t. mm Q, Q solves the (m,k)-problem For a class of seeds , Q is an optimal seed in iff Q realizes the minimal m Q over all seeds of Lemma: Let n be an integer and r=n/3. For every k2, seed # n-r -# r is optimal among seeds of weight n with one joker.
36
Heuristic seed design: genetic algorithm a population of seed families is evolving by mutating and crossing over seed families are screened against sets of difficult (m,k)- instances for a family that detects all difficult instances, the number of undetected similarities is computed by a DP algorithm. A family is kept if it yields a smaller number than currently known families compute the contribution of each seed of the family. Mutate the less “valuable” seeds. difficult (m,k)- instances seed families select and reorder select
37
Example: (25,2)-problem
38
Example: (25,3)-problem
39
Application of lossless filtering: oligo design Specific oligonucleotides: small DNA molecules (10- 50bp) that hybridize with a given target sequence and do not hybridize with the other background sequences (e.g. the rest of the genome) Formalization: given a sequence, find all windows of length m which do not occur elsewhere within k substitution errors
40
Seed design: (32,5)-problem
41
Experiment This filter has been applied to the rice EST database (100015 sequences of total size ~42 Mbp) All 32-windows occurring elsewhere within 5 errors have been computed The computation took slightly more than 1 hour on a P4 3GHz computer 87% of the database have been “filtered out”
42
Further questions Combinatorial structure of optimal seed families Efficient design algorithm
43
Questions ? ? ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.