Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Similar presentations


Presentation on theme: "Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,"— Presentation transcript:

1 Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology, Puschino, Russia CPM (Istanbul) July 5-7, 2004

2 Text filtration: general principle potential matches

3 Text filtration: general principle potential matches

4 Text filtration: general principle lossless and lossy filters true match

5 Filtration applied to sequence comparison potential similarities

6 Filtration applied to sequence alignment potential similarities

7 Filtration applied to sequence alignment true similarities

8 Gapless similarities. Hamming distance.  Similarities are defined through Hamming distance GCTACGACTTCGAGCTGC...CTCAGCTATGACCTCGAGCGGCCTATCTA...

9 Gapless similarities. Hamming distance.  Similarities are defined through Hamming distance

10 Gapless similarities. Hamming distance.  Similarities are defined through Hamming distance  ( m, k )-problem, ( m, k )-instances m k

11 Gapless similarities. Hamming distance.  Similarities are defined through Hamming distance  ( m, k )-problem, ( m, k )-instances  This work: lossless filtering m k

12 Filtering by contiguous fragment  PEX (Navarro&Raffinot 2002) –Searching for a contiguous pattern  PEX with errors –Searching for a contiguous pattern with l possible errors requires retrieval of all l-variants in the index. Efficient for –small alphabets (ADN,ARN) –relatively small l (<= 2) m=18 k=3 #### ######### (1) (m,k)

13 Superposition of two filters Pevzner&Waterman 1995 Idea: combine PEX with another filter based on a regularly-spaced seed  PEX :  spaced PEX (matches occurring at every k positions). #### #---#---#---# k+1

14 Spaced seeds  Spaced seeds (spaced Q-grams) –proposed by Burkhardt & Kärkkäinen (CPM 2001) for solving (m,k)-problems  Principle –Searching for spaced rather than contiguous patterns –Selectivity defined by the weight of the seed (number of #’s) ###-##

15 Example: (18,3)-problem ###-##

16 Spaced seeds for sequence comparison  Ma, Tromp, Li 2002 (PatternHunter)  Estimating seed sensitivity: Keich et al 2002, Buhler et al 2003, Brejova et al 2003, Choi&Zhang 2004, Choi et al 2004, Kucherov et al 2004,...  Extended seed models: BLASTZ 2003, Brejova et al 2003, Chen&Sung 2003, Noé&Kucherov 2004,...

17 This work: lossless filtration using spaced seed families (extension of Burkhard&Karkkainen 2001)  single filter based on several distinct seeds  each seed detects a part of (m,k)-instances but together they must detect all (m,k)-instances Families of spaced seeds Independent work (lossy seed families for sequence alignment):  Li, Ma, Kisman, Tromp 2004 (PatternHunter II)  Xu, Brown, Li, Ma, this conference  Sun, Buhler, RECOMB 2004 (Mandala)

18 –every (18,3)-instance contains an occurrence of a seed of F –all seeds of the family have the same weight 7 Example: (18.3)-problem (cont) Family F solves the (18,3)-problem ##-#-#### ###---#--##-# F

19 ##-##-##### ###-####--## ###-##---#-### ##----####-### ###---#-#-##-## ###-#-#-#-----### Example: (18.3)-problem (cont) ##-#-#### ###---#--##-# ###-##---#-### ###---#--##-# w=7 w=9

20 #### ###-## ##-##-##### ###-####--## ###-##---#-### ##----####-### ###---#-#-##-## ###-#-#-#-----### Comparative selectivity ##-#-#### ###---#--##-# w=4 ~39. 10 -4 w=5 ~9.8 10 -4 w=7 ~1.2 10 -4 w=9 ~0.23 10 -4 Selectivity of families on Bernoulli similarities ( p(match) = 1/4 ) estimated as the probability for one of the seeds to occur at a given position

21 How far should we go  A trivial extreme solution... –would be to pick all seeds of weight m - k. –prohibitive cost except for very small problems  We are interested in intermediate solutions: –relatively small number of seeds (< 10) to keep the hash table of a reasonable size, –the seed weight sufficiently large to obtain a good selectivity

22 Results  Computing properties of seed families  Seed design –Seed expansion/contraction –Periodic seeds –Seed optimality –Heuristic seed design  Experiments –Examples of designed seed families –Application to computing specific oligonucleotides  Conclusions

23 Measuring the efficiency of a family  Optimal threshold (Burkhard&Karkkainen): minimal number of seed occurrences over all (m,k)-instances  A seed family F is lossless iff the optimal threshold T F (m,k)1  T F (m,k) can be computed by a dynamic programming algorithm in time O(m·k·2 (S+1) ) and space O(k·2 (S+1) ), where S is the maximal length of a seed from F  optimizations are possible (see the paper)  the resulting space complexity is the same as in the Burkhard&Karkkainen algorithm

24 Measuring the efficiency of a family (cont) Using a similar DP technique we can compute, within the same time complexity bound:  the number U F (m,k) of undetected (m,k)-similarities for a (lossy) family F  the contribution of a seed of F, i.e. the number of (m,k)-similarities detected exclusively by this seed [see the paper for details]

25 Design of seed families Pruning exhaustive search tree (Burkhard&Karkkainen) –Construct all solutions of weight w from solutions of weight w – 1 –Example: if ##--#--# and ##-#---# are solutions of weight w-1, consider their «union» ##-##--# of weight w. –Prohibitive cost: more than a week for computing all single-seed solutions of the (50,5)-problem the search space blows up for multi-seed families

26 Seed expansion/contraction Burkhard&Karkkainen : the only two solutions of weight 12 solving the (50,5)-problem: ###-#--###-#--###-# #-#-#---#-----#-#-#---#-----#-#-#---#

27 Seed expansion/contraction Burkhard&Karkkainen : the only two solutions of weight 12 solving the (50,5)-problem: ###-#--###-#--###-# #-#-#---#-----#-#-#---#-----#-#-#---# the only solution of weight 12 of the (25,2)-problem

28 Seed expansion/contraction Burkhard&Karkkainen : the only two solutions of weight 12 solving the (50,5)-problem: ###-#--###-#--###-# #-#-#---#-----#-#-#---#-----#-#-#---# –Let be the i -regular expansion of F obtained by inserting i-1 jokers between successive positions of each seed of F –Example: If F = { ###-#, ##-## } then = { #-#-#---#, #-#---#-# } = { #--#--#-----#, #--#-----#--# } the only solution of weight 12 of the (25,2)-problem

29 Seed expansion/contraction (cont) Lemma: –If a family F solves an (m,k) –problem, then both F and solves the (i·m, (i+1)·k- 1) –problem –If a family solves the (i·m,k) –problem, then its i-contraction F solves the (m, ) -problem ##-#-#### ###---#--##-# ##-#-#### ###---#--##-# #-#---#---#-#-#-# #-#-#-------#-----#-#-# (18,3) (36,7)

30 Periodic seeds Iterating short seeds with good properties into longer seeds ###-#--###-#--###-# ###-#--

31 Cyclic problem Lemma: If a seed Q solves a cyclic (m,k)-problem, then the seed Q i =[Q,- (m-s(Q)) ] i solves the linear (m·(i+1)+s(Q)-1,k)- problem. Cyclic ( 11,3 )-problem Linear ( 30,3 )-problem ###-#--#--- ###-#--#---###-#--#

32 Extension to multi-seed case Cyclic ( 11,3 )-problem Linear ( 25,3 )-problem ###-#--#--- ###-#--#---###-#--# #--#---###-#--#---###

33 Extension to multi-seed case Cyclic ( 11,3 )-problem Linear ( 25,3 )-problem ###-#--#--- ###-#--#---###-#--# #--#---###-#--#---###

34 Asymptotic optimality Theorem: Fix a number of errors k. Let w(m) be the maximal weight of a seed solving the linear (m,k)-problem. Then the fraction of the number of jokers tends to 0 but the convergence speed depends on k seed expansion cannot provide an asymptotically optimal solution ( )

35 Non-asymptotic optimality  Fix a number of errors k.  For each seed (seed family) Q there exists m Q s.t.  mm Q, Q solves the (m,k)-problem  For a class of seeds , Q is an optimal seed in  iff Q realizes the minimal m Q over all seeds of  Lemma: Let n be an integer and r=n/3. For every k2, seed # n-r -# r is optimal among seeds of weight n with one joker.

36 Heuristic seed design: genetic algorithm  a population of seed families is evolving by mutating and crossing over  seed families are screened against sets of difficult (m,k)- instances  for a family that detects all difficult instances, the number of undetected similarities is computed by a DP algorithm. A family is kept if it yields a smaller number than currently known families  compute the contribution of each seed of the family. Mutate the less “valuable” seeds. difficult (m,k)- instances seed families select and reorder select

37 Example: (25,2)-problem

38 Example: (25,3)-problem

39 Application of lossless filtering: oligo design  Specific oligonucleotides: small DNA molecules (10- 50bp) that hybridize with a given target sequence and do not hybridize with the other background sequences (e.g. the rest of the genome)  Formalization: given a sequence, find all windows of length m which do not occur elsewhere within k substitution errors

40 Seed design: (32,5)-problem

41 Experiment  This filter has been applied to the rice EST database (100015 sequences of total size ~42 Mbp)  All 32-windows occurring elsewhere within 5 errors have been computed  The computation took slightly more than 1 hour on a P4 3GHz computer  87% of the database have been “filtered out”

42 Further questions  Combinatorial structure of optimal seed families  Efficient design algorithm

43 Questions ? ? ?


Download ppt "Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,"

Similar presentations


Ads by Google