Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,

Slides:



Advertisements
Similar presentations
An Adaptive Algorithm for Detection of Duplicate Records.
Advertisements

Indexing DNA Sequences Using q-Grams
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Fast Algorithms For Hierarchical Range Histogram Constructions
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
Final presentation Final presentation Tandem Cyclic Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Seeds for Similarity Search Presentation by: Anastasia Fedynak.
Better Filtering with Gapped q-grams S. Burkhardt Center for Bioinformatics, SaarbrückenMax-Planck Institut f. Informatik, Saarbrücken J. Kärkkäinen.
Windows Scheduling Problems for Broadcast System 1 Amotz Bar-Noy, and Richard E. Ladner Presented by Qiaosheng Shi.
March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.
Heuristic alignment algorithms and cost matrices
Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li.
Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis.
On Uniform Amplification of Hardness in NP Luca Trevisan STOC 05 Paper Review Present by Hai Xu.
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Index-based search of single sequences Omkar Mate CS 374 Stanford University.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Languages with mismatches and an application to approximate indexing Chiara Epifanio, Alessandra Gabriele, and Filippo Mignosi.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Variable-Length Codes: Huffman Codes
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
“Multiple indexes and multiple alignments” Presenting:Siddharth Jonathan Scribing:Susan Tang DFLW:Neda Nategh Upcoming: 10/24:“Evolution of Multidomain.
Sequence alignment, E-value & Extreme value distribution
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Filter Algorithms for Approximate String Matching Stefan Burkhardt.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Efficient Optimal Linear Boosting of a Pair of Classifiers.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
Output Sensitive Algorithm for Finding Similar Objects Jul/2/2007 Combinatorial Algorithms Day Takeaki Uno Takeaki Uno National Institute of Informatics,
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Alignment, Part II Vasileios Hatzivassiloglou University of Texas at Dallas.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
Chapter 3 Brute Force Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
On the k-Closest Substring and k-Consensus Pattern Problems
3. Brute Force Selection sort Brute-Force string matching
Space-for-time tradeoffs
3. Brute Force Selection sort Brute-Force string matching
Sequence alignment, E-value & Extreme value distribution
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology, Puschino, Russia CPM (Istanbul) July 5-7, 2004

Text filtration: general principle potential matches

Text filtration: general principle potential matches

Text filtration: general principle lossless and lossy filters true match

Filtration applied to sequence comparison potential similarities

Filtration applied to sequence alignment potential similarities

Filtration applied to sequence alignment true similarities

Gapless similarities. Hamming distance.  Similarities are defined through Hamming distance GCTACGACTTCGAGCTGC...CTCAGCTATGACCTCGAGCGGCCTATCTA...

Gapless similarities. Hamming distance.  Similarities are defined through Hamming distance

Gapless similarities. Hamming distance.  Similarities are defined through Hamming distance  ( m, k )-problem, ( m, k )-instances m k

Gapless similarities. Hamming distance.  Similarities are defined through Hamming distance  ( m, k )-problem, ( m, k )-instances  This work: lossless filtering m k

Filtering by contiguous fragment  PEX (Navarro&Raffinot 2002) –Searching for a contiguous pattern  PEX with errors –Searching for a contiguous pattern with l possible errors requires retrieval of all l-variants in the index. Efficient for –small alphabets (ADN,ARN) –relatively small l (<= 2) m=18 k=3 #### ######### (1) (m,k)

Superposition of two filters Pevzner&Waterman 1995 Idea: combine PEX with another filter based on a regularly-spaced seed  PEX :  spaced PEX (matches occurring at every k positions). #### #---#---#---# k+1

Spaced seeds  Spaced seeds (spaced Q-grams) –proposed by Burkhardt & Kärkkäinen (CPM 2001) for solving (m,k)-problems  Principle –Searching for spaced rather than contiguous patterns –Selectivity defined by the weight of the seed (number of #’s) ###-##

Example: (18,3)-problem ###-##

Spaced seeds for sequence comparison  Ma, Tromp, Li 2002 (PatternHunter)  Estimating seed sensitivity: Keich et al 2002, Buhler et al 2003, Brejova et al 2003, Choi&Zhang 2004, Choi et al 2004, Kucherov et al 2004,...  Extended seed models: BLASTZ 2003, Brejova et al 2003, Chen&Sung 2003, Noé&Kucherov 2004,...

This work: lossless filtration using spaced seed families (extension of Burkhard&Karkkainen 2001)  single filter based on several distinct seeds  each seed detects a part of (m,k)-instances but together they must detect all (m,k)-instances Families of spaced seeds Independent work (lossy seed families for sequence alignment):  Li, Ma, Kisman, Tromp 2004 (PatternHunter II)  Xu, Brown, Li, Ma, this conference  Sun, Buhler, RECOMB 2004 (Mandala)

–every (18,3)-instance contains an occurrence of a seed of F –all seeds of the family have the same weight 7 Example: (18.3)-problem (cont) Family F solves the (18,3)-problem ##-#-#### ###---#--##-# F

##-##-##### ###-####--## ###-##---#-### ##----####-### ###---#-#-##-## ###-#-#-#-----### Example: (18.3)-problem (cont) ##-#-#### ###---#--##-# ###-##---#-### ###---#--##-# w=7 w=9

#### ###-## ##-##-##### ###-####--## ###-##---#-### ##----####-### ###---#-#-##-## ###-#-#-#-----### Comparative selectivity ##-#-#### ###---#--##-# w=4 ~ w=5 ~ w=7 ~ w=9 ~ Selectivity of families on Bernoulli similarities ( p(match) = 1/4 ) estimated as the probability for one of the seeds to occur at a given position

How far should we go  A trivial extreme solution... –would be to pick all seeds of weight m - k. –prohibitive cost except for very small problems  We are interested in intermediate solutions: –relatively small number of seeds (< 10) to keep the hash table of a reasonable size, –the seed weight sufficiently large to obtain a good selectivity

Results  Computing properties of seed families  Seed design –Seed expansion/contraction –Periodic seeds –Seed optimality –Heuristic seed design  Experiments –Examples of designed seed families –Application to computing specific oligonucleotides  Conclusions

Measuring the efficiency of a family  Optimal threshold (Burkhard&Karkkainen): minimal number of seed occurrences over all (m,k)-instances  A seed family F is lossless iff the optimal threshold T F (m,k)1  T F (m,k) can be computed by a dynamic programming algorithm in time O(m·k·2 (S+1) ) and space O(k·2 (S+1) ), where S is the maximal length of a seed from F  optimizations are possible (see the paper)  the resulting space complexity is the same as in the Burkhard&Karkkainen algorithm

Measuring the efficiency of a family (cont) Using a similar DP technique we can compute, within the same time complexity bound:  the number U F (m,k) of undetected (m,k)-similarities for a (lossy) family F  the contribution of a seed of F, i.e. the number of (m,k)-similarities detected exclusively by this seed [see the paper for details]

Design of seed families Pruning exhaustive search tree (Burkhard&Karkkainen) –Construct all solutions of weight w from solutions of weight w – 1 –Example: if ##--#--# and ##-#---# are solutions of weight w-1, consider their «union» ##-##--# of weight w. –Prohibitive cost: more than a week for computing all single-seed solutions of the (50,5)-problem the search space blows up for multi-seed families

Seed expansion/contraction Burkhard&Karkkainen : the only two solutions of weight 12 solving the (50,5)-problem: ###-#--###-#--###-# #-#-#---#-----#-#-#---#-----#-#-#---#

Seed expansion/contraction Burkhard&Karkkainen : the only two solutions of weight 12 solving the (50,5)-problem: ###-#--###-#--###-# #-#-#---#-----#-#-#---#-----#-#-#---# the only solution of weight 12 of the (25,2)-problem

Seed expansion/contraction Burkhard&Karkkainen : the only two solutions of weight 12 solving the (50,5)-problem: ###-#--###-#--###-# #-#-#---#-----#-#-#---#-----#-#-#---# –Let be the i -regular expansion of F obtained by inserting i-1 jokers between successive positions of each seed of F –Example: If F = { ###-#, ##-## } then = { #-#-#---#, #-#---#-# } = { #--#--#-----#, #--#-----#--# } the only solution of weight 12 of the (25,2)-problem

Seed expansion/contraction (cont) Lemma: –If a family F solves an (m,k) –problem, then both F and solves the (i·m, (i+1)·k- 1) –problem –If a family solves the (i·m,k) –problem, then its i-contraction F solves the (m, ) -problem ##-#-#### ###---#--##-# ##-#-#### ###---#--##-# #-#---#---#-#-#-# #-#-# #-----#-#-# (18,3) (36,7)

Periodic seeds Iterating short seeds with good properties into longer seeds ###-#--###-#--###-# ###-#--

Cyclic problem Lemma: If a seed Q solves a cyclic (m,k)-problem, then the seed Q i =[Q,- (m-s(Q)) ] i solves the linear (m·(i+1)+s(Q)-1,k)- problem. Cyclic ( 11,3 )-problem Linear ( 30,3 )-problem ###-#--#--- ###-#--#---###-#--#

Extension to multi-seed case Cyclic ( 11,3 )-problem Linear ( 25,3 )-problem ###-#--#--- ###-#--#---###-#--# #--#---###-#--#---###

Extension to multi-seed case Cyclic ( 11,3 )-problem Linear ( 25,3 )-problem ###-#--#--- ###-#--#---###-#--# #--#---###-#--#---###

Asymptotic optimality Theorem: Fix a number of errors k. Let w(m) be the maximal weight of a seed solving the linear (m,k)-problem. Then the fraction of the number of jokers tends to 0 but the convergence speed depends on k seed expansion cannot provide an asymptotically optimal solution ( )

Non-asymptotic optimality  Fix a number of errors k.  For each seed (seed family) Q there exists m Q s.t.  mm Q, Q solves the (m,k)-problem  For a class of seeds , Q is an optimal seed in  iff Q realizes the minimal m Q over all seeds of  Lemma: Let n be an integer and r=n/3. For every k2, seed # n-r -# r is optimal among seeds of weight n with one joker.

Heuristic seed design: genetic algorithm  a population of seed families is evolving by mutating and crossing over  seed families are screened against sets of difficult (m,k)- instances  for a family that detects all difficult instances, the number of undetected similarities is computed by a DP algorithm. A family is kept if it yields a smaller number than currently known families  compute the contribution of each seed of the family. Mutate the less “valuable” seeds. difficult (m,k)- instances seed families select and reorder select

Example: (25,2)-problem

Example: (25,3)-problem

Application of lossless filtering: oligo design  Specific oligonucleotides: small DNA molecules (10- 50bp) that hybridize with a given target sequence and do not hybridize with the other background sequences (e.g. the rest of the genome)  Formalization: given a sequence, find all windows of length m which do not occur elsewhere within k substitution errors

Seed design: (32,5)-problem

Experiment  This filter has been applied to the rice EST database ( sequences of total size ~42 Mbp)  All 32-windows occurring elsewhere within 5 errors have been computed  The computation took slightly more than 1 hour on a P4 3GHz computer  87% of the database have been “filtered out”

Further questions  Combinatorial structure of optimal seed families  Efficient design algorithm

Questions ? ? ?