Filter Algorithms for Approximate String Matching Stefan Burkhardt.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
BLAST Sequence alignment, E-value & Extreme value distribution.
1 CAP5510 – Bioinformatics Database Searches for Biological Sequences or Imperfect Alignments Tamer Kahveci CISE Department University of Florida.
Q-gram Based Database Searching Using A Suffix Array (QUASAR) S. Burkhardt A. Crauser H-P. Lenhof Max-Planck Institut f. Informatik, SaarbrückenDeutsches.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
Seeds for Similarity Search Presentation by: Anastasia Fedynak.
Better Filtering with Gapped q-grams S. Burkhardt Center for Bioinformatics, SaarbrückenMax-Planck Institut f. Informatik, Saarbrücken J. Kärkkäinen.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Design of Optimal Multiple Spaced Seeds for Homology Search Jinbo Xu School of Computer Science, University of Waterloo Joint work with D. Brown, M. Li.
Designing Multiple Simultaneous Seeds for DNA Similarity Search Yanni Sun, Jeremy Buhler Washington University in Saint Louis.
1 Energy Efficient Multi-match Packet Classification with TCAM Fang Yu
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Index-based search of single sequences Omkar Mate CS 374 Stanford University.
1 Convolution and Its Applications to Sequence Analysis Student: Bo-Hung Wu Advisor: Professor Herng-Yow Chen & R. C. T. Lee Department of Computer Science.
Multi-seed lossless filtration Gregory Kucherov Laurent Noé LORIA/INRIA, Nancy, France Mikhail Roytberg Institute of Mathematical Problems in Biology,
Heuristic Approaches for Sequence Alignments
Practical algorithms in Sequence Alignment Sushmita Roy BMI/CS 576 Sep 16 th, 2014.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design Won-Hyong Chung and Seong-Bae Park Dept. of Computer Engineering.
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara.
PatternHunter II: Highly Sensitive and Fast Homology Search Bioinformatics and Computational Molecular Biology (Fall 2005): Representation R 林語君.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.
Multi-object Similarity Query Evaluation Michal Batko.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
From Smith-Waterman to BLAST
Doug Raiford Phage class: introduction to sequence databases.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Heuristic Alignment Algorithms Hongchao Li Jan
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, Part 2
Minwise Hashing and Efficient Search
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Filter Algorithms for Approximate String Matching Stefan Burkhardt

Outline  Motivation  Filter Algorithms  Gapped q-grams  Experimental Analysis

Motivation Computational Biology:  EST Clustering  Assembly  Genome comparison (e.g. Human/Mouse) Information Retrieval  Phonebooks  Dictionaries  Search Engines Many more…. Why ? Approximate String Matching Edit and Hamming Distance Problems and Motivation

The global approximate string matching problem Given a pattern P, a target S, an error level k and a string distance d(x,y): Find all substrings y from S with: Why ? Approximate String Matching Edit and Hamming Distance Problems and Motivation P S GAT ACTGATAACGTTAGCCATGG

The global approximate string matching problem d(x,y) = Hamming Distance: The k-mismatches problem d(x,y) = Edit Distance: The k-differences problem Why ? Approximate String Matching Edit and Hamming Distance Problems and Motivation P S GAT ACTGATAACGTTAGCCATGG

How? BLAST The q-gram Lemma and QUASAR Filter Algorithms P S Potential Matches Filter Algorithm Filtration Phase, apply Filter Criterion Exact Algorithm Verification Phase, examine Potential Matches False Matches True Matches

How? BLAST The q-gram Lemma and QUASAR Filter Algorithms BLAST (Altschul, Karlin, et al.) : S P Problem for high similarity: sequential scan quite time consuming single q-grams unspecific Sequential scan of S locates all matching q-grams with P Iterative extension with cutoff to find good matches

How? BLAST The q-gram Lemma and QUASAR Filter Algorithms P S Preprocess Index Exact Algorithm Verification Phase, examine Potential Matches False Matches True Matches Potential Matches Indexed Filter Algorithm

How? BLAST The q-gram Lemma and QUASAR Filter Algorithms P S Preprocess Potential Matches Index Indexed Filter Algorithm Con: preprocessing time extra space required only good for some filter criteria Pro: potentially faster evaluation of filter criterium

How? BLAST The q-gram Lemma and QUASAR Filter Algorithms P S Preprocess Potential Matches Index Indexed Filter Algorithm QUASAR (Burkhardt, Rivals et al. 99): Filter Criterion:q-gram Lemma (Jokinen, Ukkonen 91) Index Structure: Lookup table (Jokinen, Ukkonen 91) with suffix array (Manber, Myers 90) Match Detection:overlapping rectangles in DP-Matrix

|P| =8, q = 3 total # of q-grams : |P| - q + 1 = 6 T C G C G A G A T A T T T T A T A C T C G A T T A C Each error can ´destroy´ q matching q-grams => for k errors lose kq q-grams T C G C G A G A T A T T T T A T A C T C G A A T A C How? BLAST The q-gram Lemma and QUASAR Filter Algorithms The q-gram Lemma (Jokinen, Ukkonen, 1991) For a pattern P, a substring y of S and a value k, matches between P and y with at most k errors share at least t = |P| - q (kq) substrings of length q (q-grams).

How? BLAST The q-gram Lemma and QUASAR Filter Algorithms Match Detection (Jokinen, Ukkonen 91) : overlapping rectangles of width 2|P| in DP-Matrix rectangle with at least t hits => potential match S P 3 hits 2 hits 1 hit t = 3

How? BLAST The q-gram Lemma and QUASAR Filter Algorithms Match Detection (Jokinen, Ukkonen 91) : overlapping rectangles of width 2|P| in DP-Matrix rectangle with at least t hits => potential match S P QUASAR (Burkhardt, Rivals et al. 1999) : wider rectangles efficient in practice (2048 for QUASAR) S

How? BLAST The q-gram Lemma and QUASAR Filter Algorithms QUASAR (Burkhardt, Rivals et al. 1999) :  BLAST for the verification of the potential matches  wider Rectangles as Match Regions  Index is a combination of Lookup Table and Suffix Array  used for EST-Clustering at the DKFZ in Heidelberg  searches for EST-Clustering about 30 times faster than BLAST

Gapped q-grams  A new (old?) idea  Hamming Distance  Finding good shapes

 use gapped q-grams  call arrangement of gaps the shape General idea: Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes TCGATTAC TC.A CG.T GA.T AT.A TT.C gapped 3-shape: # #. # Match Don’t care

 Califano, Rigoutsos (1993)  Pevzner, Waterman (1995)  Lehtinen, Sutinen, Tarhio (1996) Previous work...  limited attention paid to choice of shapes  no exact threshold for the general case given Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapesRecently...  Buhler (2001) : Multiple Shapes  Ma, Tromp, Li (2002) : Pattern Hunter  threshold t = 1

The Threshold t Definition: t is the number of remaining q-grams in a worst-case placement of k errors OOXOOXOOXOO OOX OXO XOO OOX OXO XOO OOX OXO XOO classic 3-shape ### k = 3 gapped 3-shape ##.# k = 3 t = 1 t = 0 no filter! OOOXXOOXOOO OO.X OX.O XX.O XO.X OO.O OX.O XO.O Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes

OOOXXOOXOOO OO.X OX.O XX.O XO.X OO.O OX.O XO.O Definition: t is the number of remaining q-grams in a worst-case placement of k errors  gapped shapes can have higher(!) thresholds t than ungapped shapes The Threshold t gapped 3-shape ##.# k = 3 t = 1 classic 3-shape ### k = 3 t = 0 no filter!  no simple formula for t  we used a DP-based approach to compute t Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes

highlow# of q-gram hits highlowfiltration time high low verific. time high low # of potential matches good filters bad filters Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes tradeoff line lowhighq

Finding good shapes high low # of potential matches Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes # of q-gram hits |S| 1 ||q||q  ? tradeoff line good filters bad filters lowhighq

Finding good shapes Reason: ##.#### A random match requires 5 matching characters instead of only 4 for the ungapped q-gram. This makes random matches less likely. Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes For |P |=13, k=3 and q=3 the shapes ##.# and ### both have a threshold of t=2. However, the gapped shape returns fewer potential matches.

We define the minimum coverage c m as the minimum number of matching characters for any distinct arrangement of t matching shapes in P and S Finding good shapes CGACGATTGAT ##.# ACTCGATTAGA For t =2 and the shape ##.# the minimum coverage is 5 Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes

# of potential matches Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes # of q-gram hits |S| 1 ||q||q  lowhighq tradeoff line good filters bad filters |S| 1 ||cm||cm  low high cmcm

t = 1 t = 2 t = 3 t = 4 t = 5 minimum coverage number of shapes with given minimum coverage for k = 5 q = 8 median contiguous best compute t and minimum coverage for all shapes with |P|=50 and k=3,4,5,6 Gapped q-grams A new (old ?) idea Hamming Distance Finding good shapes

Experimental Analysis  Speed and Filtration Efficiency  The Heuristic Zone

q minimum coverage gapped, Hamming contiguous matches hits Experimental Analysis A few different Filters Speed and Filtration Efficiency The Heuristic Zone k = 5 |P| = 50 |S| = 50Mbps

From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : 1.Guarantee zone (finds all approximate matches) 2.Heuristic zone (finds some of the approximate matches) 3.Negative zone (guaranteed not to find matches) Errors|P||P|0 0% 100% Recognition rate

From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : 1.Guarantee zone (finds all approximate matches) 2.Heuristic zone (finds some of the approximate matches) 3.Negative zone (guaranteed not to find matches) Errors|P||P|0k 0% 100% Recognition rate

From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : 1.Guarantee zone (finds all approximate matches) 2.Heuristic zone (finds some of the approximate matches) 3.Negative zone (guaranteed not to find matches) Errors|P||P|0k 0% 100% Recognition rate |P|-mc

From Hits to Matches Describing Filter Properties Filters usually have 3 ‚recognition zones` depending on k : 1.Guarantee zone (finds all approximate matches) 2.Heuristic zone (finds some of the approximate matches) 3.Negative zone (guaranteed not to find matches) Errors|P||P|0k 0% 100% Recognition rate |P|-mc

Errors|P||P|0k|P|-mc 0% 100% Recognition rate A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis Heuristic Zone Problem: Behaviour in the Heuristic Zone hard to predict

A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis A simple idea: Sampling! For a value i: 1. Generate s sample strings with i random errors each 2. Run a filter algorithm on these samples 3. Record how many strings were recognized (in percent) This allows an experimental evaluation of the Heuristic Zone

|P| = samples for each error level A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 0% 100% Recognition rate Errors contiguous k=3, q=11 k=4, q=9

A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 0% 100% Recognition rate Errors k=4, q=9 k=3, q=11 gapped, edit contiguous k=5, q=10 k=4, q=11 k=3, q=11 |P| = samples for each error level

A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 0% 100% Recognition rate Errors k=4, q=9 k=3, q=11 BLAST gapped, edit contiguous k=5, q=10 k=4, q=11 k=3, q=11 k=4,q=10 |P| = samples for each error level

A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 0% 100% Recognition rate Errors k=4, q=9 k=3, q=11 BLAST gapped, edit contiguous k=5, q=10 k=4, q=11 k=3, q=11 k=4,q=10 |P| = samples for each error level

A few different Filters Speed and Filtration Efficiency The Heuristic Zone Experimental Analysis 50% 100% Recognition rate Errors k=4, q=9 k=3, q=11 BLAST gapped, edit contiguous k=3, q=11 k=4, q=11 k=5, q=10 k=3,q=11 k=4,q=10 |P| = samples for each error level

Conclusion - Future Work Our Work:  Significant sensitivity improvement over existing filters  Required modifications easy to implement  Methods for describing filter properties Future Work:  Combination of `orthogonal` shapes into one filter  Use of word neighborhoods  Database of filter properties for good shapes