Download presentation
Presentation is loading. Please wait.
Published byShawn Hubbard Modified over 9 years ago
1
Output Sensitive Algorithm for Finding Similar Objects Jul/2/2007 Combinatorial Algorithms Day Takeaki Uno Takeaki Uno National Institute of Informatics, The Graduate University for Advanced Studies (Sokendai)
2
Motivation: Analyzing Huge Data Recent information technology gave us many huge database - - Web, genome, POS, log, … "Construction" and "keyword search" can be done efficiently The next step is analysis; capture features of the data - - statistics, such as size, #rows, density, attributes, distribution… Can we get more? look at (simple) local structures but keep simple and basic genome Results of experiments Database ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCAAATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT ATGCGCCGTA TAGCGGGTGG TTCGCGTTAG GGATATAAAT GCGCCAAATA ATAATGTATTA TTGAAGGGCG ACAGTCTCTCA ATAAGCGGCT 実験 1 実験 2 実験 3 実験 4 ● ▲ ▲ ● ▲ ● ● ▲ ● ● ● ▲ ● ▲ ● ● ● ▲ ● ● ▲ ▲ ▲ ▲
3
Our Focus Find all pairs of similar objects (or structures) (or binary relation instead of similarity) Maybe, this is very basic and fundamental There would be many applications - - finding global similar structure, - - constructing neighbor graphs, - - detect locally dense structures (groups of related objects) In this talk, we look at the strings
4
Existing Studies There are so many studies on similarity search (homology search) Given a database, construct a data structure which enable us to find the objects similar to the given a query object quickly - - strings with Hamming distance, edit distance - - points in plane (k-d trees), Euclidian space - - sets - - constructing neighbor graphs (for smaller dimensions) - - genome sequence comparison (heuristics) Both exact and approximate approaches All pairs comparison is not popular
5
Our Problem Problem: For given a database composed of n strings of the fixed same length l, and a threshold d, find all the pairs of strings such that the Hamming distance of the two strings is at most d ATGCCGCG GCGTGTAC GCCTCTAT TGCGTTTC TGTAATGA ... ATGCCGCG GCGTGTAC GCCTCTAT TGCGTTTC TGTAATGA ... ・ ・ ATGCCGCG, AAGCCGCC ・ ・ GCCTCTAT, GCTTCTAA ・ ・ TGTAATGA, GGTAATGG ... ・ ・ ATGCCGCG, AAGCCGCC ・ ・ GCCTCTAT, GCTTCTAA ・ ・ TGTAATGA, GGTAATGG ...
6
Trivial Bound of the Complexity If all the strings are exactly the same, we have to output all the pairs, thus take Θ(n 2 ) time simple all pairs comparison of O(l n 2 ) time is optimal, if l is a fixed constant Is there no improvement? In practice, we would analyze only when output is small, otherwise the analysis is non-sense consider complexity in the term of the output size We propose O(2 l (n+lM)) time algorithm M: #outputs
7
Basic Idea: Fixed Position Subproblem Consider the following subproblem: For given l-d positions of letters, find all pairs of strings with Hamming distance at most d such that "the letters on the l-d positions are the same" Ex) 2 nd, 4 th, 5 th positions of strings with length 5 We can solve by "radix sort" by letters on the positions, in O(l n) time.
8
Examine All Cases Solve the subproblem for all combinations of the positions If distance of two strings S 1 and S 2 is at most 2, letters on l-d positions (say P) are the same In at least one combination, S 1 and S 2 is found (in the subproblem of combination P) # combinations is l C d. When l=5 and d=2, it is 10 Computation is "radix sorts +α", O( l C d ln ) time for sorting Use branch-and-bound to radix sort, in O( l C d n ) time
9
ExerciseExercise ・ ・ Find all pairs of strings with Hamming distance at most 1 G A B A B C A B D A C C E F G F F G A F G G A B A B C A B D A C C E F G F F G A F G A B C A B D A C C E F G F F G A F G G A B A B C A B D A C C E F G F F G A F G G A B A B C A C C A B D A F G E F G F F G G A B A B C A C C A B D A F G E F G F F G G A B A B C A B D A C C A F G E F G F F G G A B A B C A B D A C C A F G E F G F F G G A B
10
Duplication: How long is "+α" If two strings S 1 and S 2 are exactly the same, their combination is found in all subproblems, l C d times If we allow the duplications, "+α" needs O(M l C d ) time To avoid the duplication, use "canonical positions"
11
Avoid Duplications by Canonical Positions For two strings S 1 and S 2, their canonical positions are the first l-d positions of the same letters Only we output the pair S 1 and S 2 only in the subproblem of their canonical positions Computation of canonical posisions takes O(d) time, "+α" needs O(K d l C d ) time Avoid duplications without keeping the solutions in memory O( l C d (n+dM)) = O(2 l (n + lM) ) time in total ( O(n+M)) if l is a fixed constant ) O( l C d (n+dM)) = O(2 l (n + lM) ) time in total ( O(n+M)) if l is a fixed constant )
12
In Practice Is l C d small in practice? In some case, yes (ex, genome sequences) If we want to find strings with at most 10% of error 20 C 2 = 190, 30 C 3 = 4060, 60 C 6 = 50063860… maybe, large for (bit) large l For dealing with (bit) large l, we use a variant of this algorithm
13
Partition to Blocks Consider the partition of strings into k blocks For given k-d positions of blocks, find all pairs of strings with distance at most d s. t. "the blocks on the positions are the same" Radix sorts are done in O( k C d n) time Ex) 2 nd, 4 th, 5 th positions of blocks of strings of length 5
14
Small "+α" is Expected The Hamming distance of two strings may be larger than d, even if their k-d blocks are the same In the worst case, In the worst case, "+α" is not linear in #output However, if #letters in k-d blocks are large enough, the strings having the same blocks are few "+α" is not large, in practice, in almost O( k C d n) time
15
Experiments: l = 20 and d = 0,1,2,3 Prefixes of Y chromosome of Human Note PC with Pentium M 1.1GHz, 256MB RAM
16
Slice one of the long strings with overlaps Partition the other long string without overlap Compare all pairs 1 1 draw a matrix: intensity of a cell is given by #pairs inside 2 2 draw a point if 3 pairs in an area of length αand width β: two substrings of length α have error of bit less than k %, they have at least some short similar substrings Comparison of Long Strings
17
Comparison of Chromosome Human 21 st and chimpanzee 22 nd chromosomes Take strings of 30 letters from both, with overlaps Intensity is given by # pairs White possibly similar Black never similar Grid lines detect "repetitions of similar structures" human 21 st chr. chimpanzee 22 nd chr. 20 min. by PC
18
Homology Search on Chromosomes Human X and mouse X chromosomes (150M strings for each) take strings of 30 letters beginning at every position ・ ・ For human X, Without overlaps ・ ・ d=2, k=7 ・ ・ dots if 3 points are in area of width 300 and length 3000 1 hour by PC human X chr. mouse X chr.
19
Extensions ??? Can we solve the problem for other objects? (sets, sequences, graphs,…) For graphs, maybe yes, but not sure for the practical performance For sets, Hamming distance is not preferable. For large sets, many difference should be allowed. For continuous objects, such as points in Euclidian space, we can hardly bound the complexity in the same way. (In the discrete version, the neighbors are finite, actually classified into constant number of groups)
20
ConclusionConclusion Output sensitive algorithm for finding pairs of similar strings ( in the term of Hamming distance) Multiple-classification by positions to be the same Using blocks to reduce the practical computation Application to genome sequence comparison Extension to other objects (sets, sequences, graphs) Extension to continuous objects (points in Euclidian space) Efficient spin out heuristics for practice Genome analyze system Extension to other objects (sets, sequences, graphs) Extension to continuous objects (points in Euclidian space) Efficient spin out heuristics for practice Genome analyze system Future works
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.