Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,

Slides:



Advertisements
Similar presentations
1 Average Case Analysis of an Exact String Matching Algorithm Advisor: Professor R. C. T. Lee Speaker: S. C. Chen.
Advertisements

Text Categorization.
Indexing DNA Sequences Using q-Grams
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Representing and Querying Correlated Tuples in Probabilistic Databases
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Space-for-Time Tradeoffs
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Chapter 10: Estimating with Confidence
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Compressed Compact Suffix Arrays Veli Mäkinen University of Helsinki Gonzalo Navarro University of Chile compact compress.
15-853Page : Algorithms in the Real World Suffix Trees.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Procedures of Extending the Alphabet for the PPM Algorithm Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.
3.3 Toward Statistical Inference. What is statistical inference? Statistical inference is using a fact about a sample to estimate the truth about the.
Modern Information Retrieval
Fast and Practical Algorithms for Computing Runs Gang Chen – McMaster, Ontario, CAN Simon J. Puglisi – RMIT, Melbourne, AUS Bill Smyth – McMaster, Ontario,
Heuristic alignment algorithms and cost matrices
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
1 A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber Tech. Rep. TR94-17,Department of Computer Science, University of Arizona, May 1994.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Similar Sequence Similar Function Charles Yan Spring 2006.
6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Compressed Index for a Dynamic Collection of Texts H.W. Chan, W.K. Hon, T.W. Lam The University of Hong Kong.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Chapter 10: Estimating with Confidence
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Genetic Algorithm.
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Dan Piett STAT West Virginia University
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
Information Retrieval and Web Search Text properties (Note: some of the slides in this set have been adapted from the course taught by Prof. James Allan.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Filter Algorithms for Approximate String Matching Stefan Burkhardt.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Index Building Overview Database tables Building flow (logical) Sequential Drawbacks Parallel processing Recovery Helpful rules.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
CSC 211 Data Structures Lecture 13
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
March 23 & 28, Hashing. 2 What is Hashing? A Hash function is a function h(K) which transforms a key K into an address. Hashing is like indexing.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Chapter 10: Confidence Intervals
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
1 CS 430: Information Discovery Lecture 8 Automatic Term Extraction and Weighting.
Step 3: Tools Database Searching
An Improved Multi-Pattern Matching Algorithm for Large-Scale Pattern Sets Author : Zhan Peng, Yu-Ping Wang and Jin-Feng Xue Conference: IEEE 10th International.
A Two-Tier Heterogeneous Mobile Ad Hoc Network Architecture and Its Load-Balance Routing Problem C.-F. Huang, H.-W. Lee, and Y.-C. Tseng Department of.
Chapter 16: Searching, Sorting, and the vector Type.
Advanced Algorithms Analysis and Design
Subject Name: File Structures
Objective of This Course
Fast Sequence Alignments
Searching Similar Segments over Textual Event Sequences
Presentation transcript:

Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science, Finland

Outline of the presentation Problem settings Methods used in the work –Preprocessing of text –Preprocessing of patterns –Executing the search Data Runs Results Conclusive remarks Future work.

Problem setting We would like to search text patterns or queries from text databases Multiple sets of large number of long patterns –Here we're handling a single set of 1000 patterns of length 1000 nucleotides each Multiple instances of preprocessed text –Can be text using a compressed suffix array (CSA) purely as index, having another instance of the text as is, or just saving the CSA, as it is a self-index In these experiments, both patterns and text were DNA

Problem setting As a single pattern set can be searched from multiple texts and vice versa, the preprocessing times are not limiting the usefulness of the possible method. –Time taken by preprocessing is amortized over large number of searches Because of this, it is smart to save the patterns already in preprocessed form This leads to searching a preprocessed set of patterns from preprocessed text.

Methods – Preprocessing the text Compressed suffix array (CSA) was constructed from the text, using the package available in the Pizza & Chili website (P. Ferragina and G. Navarro) Two main parameters exist for the CSA: –Samplerate: the interval between two indices of the suffix array stored explicitly. Default value 16 was used –Samplepsi: the interval between two indices of the psi function stored explicitly. Default value 128 was used.

Methods – Preprocessing patterns Using a compression tool called Re-Pair a certain collection of subpatterns of the patterns was retrieved Principal idea is to find a set of subpatterns, which would occur in a large number of patterns, but be rare in text Assuming that the letters in the text are independent and identically distributed, long patterns occur rarely Conveniently, Re-Pair produces phrases, which are long subpatterns of text which occur more than once These phrases were simply scored by the number of times they occur in the pattern set Additionally, the length of the subpattern was required to overcome a set threshold Done to limit the expected number of occurrences this subpattern would have in the text.

Methods – doing the search Search the preprocessed subpatterns from the CSA using locate O( m log(n) + occ * log ε (n) ), 0 < ε < 1 for space-time tradeoff Extend the initial matches of these subpatterns to check if they are an exact match, using character by character comparison –This is done for each pattern that includes the subpattern Stop this after a set number of patterns are handled using this approach Finish the search using the locate function for the remaining full patterns.

Data The 50MB DNA text was retrieved from the Pizza & Chili website 1000 patterns of length 1000 nucleotides were generated from this text at random –That is, substrings of the text were retrieved from random locations It came later apparent that all of the patterns occur only once in the text, which would necessarily not always be the case.

Data The patterns were searched from the text index as was described in the methods section Five different thresholds were used for the minimum length of the subpattern: 25, 28, 30, 33 and 35 Additionally, for each of these thresholds, the number of patterns handled by locating subpatterns was controlled by finishing this phase after 100, 300 or 500 patterns were handled –However, as the subpatterns did not always occur in the full allowed number of patterns, this number of patterns handled by locating subpatterns and extending was lower in some runs The time taken by these runs was compared to searching all of the patterns with the locate of CSA.

Multi-pattern search on CSA Results, set of 1000 patterns Msl = 30 → 14.0 % decrease in run-times.

Multi-pattern search on CSA Results, level of individual patterns Msl=35 → time per pattern was 71.6 % less than with traditional CSA.

Results Searching for the subpatterns generally took around 85% of the time, while checking for the exact match took 15% of the time, when using the implemented new method The memory consumption is not notably different –Phrases and their pattern-related information have to be saved, but this consumes a lot less memory than saving the CSA in practice Total preprocessing time for the set of patterns was roughly 0.8 s.

Conclusive remarks As minimum subpattern length is increased, the average time taken per pattern decreases Interestingly, average time per pattern taken also decreases when more patterns are handled by the proposed method –Suggests that subpatterns occurring extremely commonly in the set of patterns are not the most optimal ones More sophisticated method to choose subpatterns occurring in the set of multiple patterns would be helpful –The proposed method would work on independent and identically distributed text, but DNA definitely does not have these properties.

Future Work Consider k-mer distributions of the subpatterns and compare them to the k-mer distribution of the text –If the k-mer distribution of the text is unknown, sampling or other methods could be used –Hopefully this would lead to better estimates of the probability of a subpattern to occur in the text More work to be done in the sorting of the subpatterns This approach could be implemented for searches using other index structures as well –Anything where time taken by locate functionality strongly correlates with the length of the query should work well.