An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data May/23/2008 PAKDD 2008 Takeaki Uno National Institute of Informatics,

Slides:

Advertisements

Similar presentations

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.

Advertisements

Indexing DNA Sequences Using q-Grams

Lecture 24 MAS 714 Hartmut Klauck

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.

A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.

Combinatorial Pattern Matching CS 466 Saurabh Sinha.

Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.

Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.

Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.

Algorithm Design Strategy Divide and Conquer. More examples of Divide and Conquer  Review of Divide & Conquer Concept  More examples  Finding closest.

Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.

Heuristic alignment algorithms and cost matrices

Chapter 3 The Efficiency of Algorithms

Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.

Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.

Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.

By Makinen, Navarro and Ukkonen. Abstract Let A and B be two run-length encoded strings of encoded lengths m’ and n’, respectively. we will show an O(m’n+n’m)

Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.

6/29/20151 Efficient Algorithms for Motif Search Sudha Balla Sanguthevar Rajasekaran University of Connecticut.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.

Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.

Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,

Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science.

Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.

A Fast Algorithm for Enumerating Bipartite Perfect Matchings Takeaki Uno (National Institute of Informatics, JAPAN)

Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of.

CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.

BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina

CMPT 438 Algorithms. Why Study Algorithms? Necessary in any computer programming problem ▫Improve algorithm efficiency: run faster, process more data,

JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.

Chapter 3 Computational Molecular Biology Michael Smith

An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)

Christopher Moh 2005 Competition Programming Analyzing and Solving problems.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,

Searching and Sorting Recursion, Merge-sort, Divide & Conquer, Bucket sort, Radix sort Lecture 5.

PatternHunter: A Fast and Highly Sensitive Homology Search Method Bin Ma Department of Computer Science University of Western Ontario.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Genome Homology Visualization by Short Similar Substring Enumeration 30/Sep/2008 RIMS AVEC Workshop Takeaki Uno National Institute of Informatics & Graduated.

Invitation to Computer Science 6th Edition Chapter 3 The Efficiency of Algorithms.

Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.

Sequence Alignment.

Output Sensitive Algorithm for Finding Similar Objects Jul/2/2007 Combinatorial Algorithms Day Takeaki Uno Takeaki Uno National Institute of Informatics,

FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.

Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.

Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.

Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 File Organizations and Indexing Chapter 8 Jianping Fan Dept of Computer Science UNC-Charlotte.

Fast Algorithms for BIG DATA (title means “I make slides according to the interests of audience ) 14/Jan/2012 NII Shonan-meeting (open problem seminar)

1 Double-Patterning Aware DSA Template Guided Cut Redistribution for Advanced 1-D Gridded Designs Zhi-Wen Lin and Yao-Wen Chang National Taiwan University.

Design and Analysis of Algorithms Faculty Name : Ruhi Fatima Course Description This course provides techniques to prove.

Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced.

Between Optimization  Enumeration (on Modeling, and Computation ) 30/May/2012 NII Shonan-meeting (open problem seminar) Takeaki Uno National Institute.

CMPT 438 Algorithms.

Instance Based Learning

CSCE350 Algorithms and Data Structure

Fast Sequence Alignments

Searching Similar Segments over Textual Event Sequences

Lecture 15: Least Square Regression Metric Embeddings

Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1

Invitation to Computer Science 5th Edition

Presentation transcript:

An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data May/23/2008 PAKDD 2008 Takeaki Uno National Institute of Informatics, JAPAN & The Graduate University for Advanced Science

Motivation: Analyzing Huge Data Recent information technology gave us many huge database - - Web, genome, POS, log, … "Construction" and "keyword search" can be done efficiently The next step is analysis; capture features of the data - - statistics, such as size, #rows, density, attributes, distribution… Can we get more?   look at (simple) local structures but keep being simple and basic genome Results of experiments Database ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCAAATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCAAATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT 実験 1 実験 2 実験 3 実験 4 ● ▲ ▲ ● ▲ ● ● ▲ ● ● ● ▲ ● ▲ ● ● ● ▲ ● ● ▲ ▲ ▲ ▲

Our Focus Find all pairs of similar objects (or structures) (or binary relation, instead) Maybe, this is very basic and fundamental   There would be many applications - - detecting global similar structures, - - constructing neighbor graphs, - - detecting locally dense structures (groups of related objects) In this talk, we look at strings

Existing Studies There are so many studies on similarity search (homology search)   Given a database, construct a data structure which enables us to find the objects similar to the given a query object, quickly - - strings with Hamming distance, edit distance - - points in plane (k-d trees), Euclidian space - - sets - - constructing neighbor graphs (for smaller dimensions) - - genome sequence comparison (heuristics) Both exact and approximate approaches All pairs comparison does not work for large scale data

Approach from Algorithm Theory Parallel computation is a popular way to fast computation, but its high cost, including hardness of programming, is a disadvantage Algorithm improvement decreases the increase against the database size by the derivation on the design of the way of computation Efficiency increases as the increase of database size We approach the problem from the algorithmic point Efficiency increases as the increase of database size We approach the problem from the algorithmic point size = times size = 1,000,000 10,000 times

Our Problem We address databases whose records are short strings Problem: Problem: For given a database composed of n strings of the fixed same length l, and a threshold d, find all the pairs of strings such that the Hamming distance of the two strings is at most d. We propose an efficient algorithm SACHICA (Scalable Algorithm for Characteristic/Homogenous Interval Calculation), and a method to detect long similar substrings of input strings (especially efficient for genomic data) ATGCCGCG GCGTGTAC GCCTCTAT TGCGTTTC TGTAATGA ．．． ATGCCGCG GCGTGTAC GCCTCTAT TGCGTTTC TGTAATGA ．．．・・ ATGCCGCG, AAGCCGCC ・・ GCCTCTAT, GCTTCTAA ・・ TGTAATGA, GGTAATGG ．．．・・ ATGCCGCG, AAGCCGCC ・・ GCCTCTAT, GCTTCTAA ・・ TGTAATGA, GGTAATGG ．．．

When two strings S 1 and S 2 are similar, they must have several pairs of similar short substrings “Having several similar substrings” is a necessary condition to be similar strings Ex) for strings of length 3000 s.t., Hamming distance 290 (=10%)   they have at least 3 pairs of substrings of length 30 with Hamming distance at most 2   the position of these substrings must differ at most 30, if we allow deletion and insertion It gives a condition that substrings of length β are similar only if “k pairs of their short substrings are similar, and their start positions differ at most α Approaching Long-string Similarity

Consider to find long similar substrings of given strings S 1 and S 2 Comparison of all substrings of length β   needs square time   redundant overlapping pairs   use our similarity condition (1) (1) find all pairs of similar short substrings (2) (2) scan diagonal belt of width 2α to find an interval of length β including k pairs (3) (3) shift the diagonal belt by α, and repeat We can always find substrings of length βsatisfying the condition   approach from similar short substrings is possible Detecting Long Similar Substrings

Related Works Computing edit/Hamming distance is done in square/linear time   the whole strings have to be similar   can not detect local exchange Heuristic homology search such as BLAST, Pattern Hunter usually finds exact match of short substrings (11 letters), and extend   must find terrible number of pairs when input strings are huge   lengthen 11 letters loses the accuracy   heuristics, ignoring frequent substring, dealing only gene areas Similarity search   involves huge number of queries, taking much much longer time than exact search

Trivial Bound of the Complexity If all the strings are exactly the same, we have to output all the pairs, thus take Θ(n 2 ) time   simple all pairs comparison of O(l n 2 ) time is optimal, if l is a fixed constant   is there no improvement? In practice, we would analyze only when output is small, otherwise the analysis is non-sense   consider complexity in the term of the output size We propose O(2 l (n+lM)) time algorithm M: #outputs

Basic Idea: Solve Subproblem Consider the partition of strings into k blocks, and a subproblem subproblem: f subproblem: for given k-d block positions, find all pairs of strings with distance at most d s.t. "the given blocks are the same" Ex) 2 nd, 4 th, 5 th blocks of S 1 and S 2 (length 30) are the same  much much fewer comparisons !!  much much fewer comparisons !! We can solve by "radix sort" on combined blocks, in O(l n) time.

Examine All Cases Solve the subproblem for all combinations of the positions   if distance of two strings S 1 and S 2 is at most 2, letters on l-2 blocks are the same   in at least one combination of blocks, the pair ”S 1 and S 2 ” is found (in the subproblem of combination P) #combinations is k C d. When k=5 and d=2, it is 10   computation is "radix sorts +α", O( k C d ln ) time for sorting   recursive radix sort to reducing to O( k C d n )

ExampleExample ・・ Find all pairs of strings with Hamming distance at most 1 A BCDE A BDDE A DCDE C DEFG C DEFF C DEGG A AGAB A BCDE A BDDE A DCDE C DEFG C DEFF C DEGG A AGAB ABCD E ABD DE ADC DE CDEF G CDEF F CDEG G AAG AB ABCD E ABD DE ADC DE CDEF G CDEF F CDEG G AAG AB A BC DE A BD DE A DC DE C DE FG C DE FF C DE GG A AG AB A BC DE A BD DE A DC DE C DE FG C DE FF C DE GG A AG AB ABC DE ABD DE ADC DE CDE FG CDE FF CDE GG AAG AB ABC DE ABD DE ADC DE CDE FG CDE FF CDE GG AAG AB

Figure out Intuition Finding pairs of similar records is something finding all certain cells in a matrix All pairs comparison sweeps and looks at all cells Our multi-classification algorithm recursively reduces the areas to be checked in many ways, thus the search route forms a tree, whose leaves corresponds to a group of strings to be compared

Avoid Duplications by Canonical Positions For two strings S 1 and S 2, their canonical positions are the first l-d positions of the same letters Only we output the pair S 1 and S 2 only in the subproblem of their canonical positions Computation of canonical posisions takes O(l) time, "+α" needs O(M l k C d ) time Avoid duplications without keeping the solutions in memory O( l C d (n+dM)) = O(2 l (n + lM) ) time in total ( if we set k=l )

Difference from BLAST The original “BLAST” algorithm finds pairs of the identical intervals of 11 letters   roughly, classifies into 4 11 = 4 million groups   may take long time for 100 million letters Our method for length 30 with Hamming distance 3 (quality equal to finding same interval of 7 letters), with dividing into 6 blocks   roughly, classifies into 4 15 = 1,000 million groups, (20 times)   may take long time for 2000 million letters, but we can increase the #blocks But, not good at searching a given short string (no difference of time between many strings and one string) But, not good at searching a given short string (no difference of time between many strings and one string)

Experiments: l = 20 and d = 0,1,2,3 Prefixes of Y chromosome of Human Note PC with Pentium M 1.1GHz, 256MB RAM

Comparison of Chromosome Human 21 st and chimpanzee 22 nd chromosomes Take strings of 30 letters from both, with overlaps Intensity is given by # pairs White  possibly similar Black  never similar Grid lines detect "repetitions of similar structures" human 21 st chr. chimpanzee 22 nd chr. 20 min. by PC

Homology Search on Mouse X Chr. Human X and mouse X chromosomes (150M strings for each) take strings of 30 letters beginning at every position ・・ For human X, without overlaps ・・ d=2, k=7 ・・ dots if 3 points are in area of width 300 and length hour by PC human X chr. mouse X chr.

Comparison of Many Bacterias Comparison of the genomes of 30 bacteria The genomes are concatenated and compared in the same way 1 hour by PC

Comparison of BAC clones Sequencing a genome is done by detecting overlaps of fragments When genome has complex repeating structures, detection is hard We detected the overlaps in the mouse genome, and completed some undetermined complex repeating parts (joint research with Koide, Umemori of National Institute of Genetics, Japan) 1 sec. by PC for a pair

Extensions ??? Can we solve the problem for other objects? (sets, sequences, graphs,…) For graphs, maybe yes, but not sure for the practical performance For sets, Hamming distance is not preferable. for large sets, many difference should be allowed. For continuous objects, such as points in Euclidian space, we can hardly bound the complexity in the same way. (In the discrete version, the neighbors are finite, actually classified into constant number of groups)

ConclusionConclusion Output sensitive algorithm for finding pairs of similar strings ( in the term of Hamming distance) Multi-classification for speeding up Application to genome sequence comparison Models and algorithms for natural language text Extension to other objects (sets, sequences, graphs) Extension to continuous objects (points in Euclidian space) Efficient spin-out heuristics for practice Genome analyze tools and systems Models and algorithms for natural language text Extension to other objects (sets, sequences, graphs) Extension to continuous objects (points in Euclidian space) Efficient spin-out heuristics for practice Genome analyze tools and systems Future works