Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Approximate Search on String Collections Part I

Similar presentations


Presentation on theme: "Efficient Approximate Search on String Collections Part I"— Presentation transcript:

1 Efficient Approximate Search on String Collections Part I
Marios Hadjieleftheriou Chen Li

2 DBLP Author Search

3 Try their names (good luck!)
Case Western AT&T--Research UCSD Yannis Papakonstantinou Meral Ozsoyoglu Marios Hadjieleftheriou

4

5 Better system?

6 People Search at UC Irvine

7 Web Search Errors in queries Errors in data
Bring query and meaningful results closer together Actual queries gathered by Google 7

8 Data Cleaning R S informix microsoft infromix mcrosoft

9 Problem Formulation Find strings similar to a given string: dist(Q,D) <= δ Example: find strings similar to “hadjeleftheriou” Performance is important! 10 ms: 100 queries per second (QPS) 5 ms: 200 QPS

10 Outline Motivation Preliminaries Trie-based approach
Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion Part I Part II

11 Next… Preliminaries

12 Similarity Functions Similar to: Examples: a domain-specific function
returns a similarity value between two strings Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey

13 Edit Distance A widely used metric to define string similarity
Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 13 13

14 Gram-based algorithms
Next… Gram-based algorithms List-merging algorithms [LLL08] Variable-length grams (VGRAM) [LWY07,YWL08]

15 “q-grams” of strings u n i v e r s a l 2-grams

16 Edit operation’s effect on grams
Fixed length: q u n i v e r s a l k operations could affect k * q grams If ed(s1,s2) <= k, then their # of common grams >= (|s1|- q + 1) – k * q 16

17 q-gram inverted lists at ch ck ic ri st ta ti tu uc id strings 1 2 3 4
1 2-grams at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 rich stick stich stuck static

18 Searching using inverted lists
Query: “shtick”, ED(shtick, ?)≤1 sh ht ti ic ck ti ic ck # of common grams >= 3 at ch ck ic ri st ta ti tu uc 4 2 3 1 id strings 1 2 3 4 rich stick stich stuck static 2-grams

19 Find elements whose occurrences ≥ T
T-occurrence Problem Merge Ascending order Find elements whose occurrences ≥ T

20 Example T = 4 1 3 5 10 13 10 13 15 5 7 13 13 15 Result: 13

21 List-Merging Algorithms
HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip

22 Count # of occurrences of each element using a heap
Heap-based Algorithm Push to heap …… Min-heap Count # of occurrences of each element using a heap

23 MergeOpt Algorithm [SK04]
Binary search Long Lists: T-1 Short Lists

24 Example of MergeOpt Count threshold T≥ 4 Long Lists: 3 Short Lists: 2
1 3 5 10 13 10 13 15 5 7 13 13 15 Long Lists: 3 Short Lists: 2 Count threshold T≥ 4

25 ScanCount Count threshold T≥ 4 1 2 3 … 1 1 3 5 10 13 10 13 15 5 7 13
String ids # of occurrences Increment by 1 1 1 3 5 10 13 10 13 15 5 7 13 13 15 1 13 4 Result! 14 15 2 Count threshold T≥ 4 25

26 List-Merging Algorithms
HeapMerger MergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkip DivideSkip

27 MergeSkip algorithm [BK02, LLL08]
Pop T-1 …… Min-heap Jump Greater or equals T-1

28 Example of MergeSkip Count threshold T≥ 4 minHeap Jump 1 5 10 13 15 1
7 13 15 13 13 Jump 17 17 15 15 Count threshold T≥ 4

29 DivideSkip Algorithm [LLL08]
Binary search MergeSkip Long Lists Short Lists

30 How many lists are treated as long lists?

31 Length Filtering s: t: Length: 10 By length only! Ed(s,t) ≤ 2

32 Positional Filtering Ed(s,t) ≤ 2 s a b (ab,1) t a b (ab,12)

33 A filter tree Combine filters with list-merging algorithms [LLL08]

34 Variable-length grams (VGRAM) [LWY07,YWL08]
Next… Variable-length grams (VGRAM) [LWY07,YWL08]

35 2-grams -> 3-grams? sht hti tic ick tic ick
Query: “shtick”, ED(shtick, ?)≤1 sht hti tic ick tic ick # of common grams >= 1 ati ich ick ric sta sti stu tat tic tuc uck 4 2 1 3 id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static 3-grams

36 Observation 1: dilemma of choosing “q”
Increasing “q” causing: Longer grams  Shorter lists Smaller # of common grams of similar strings 4 2 3 1 2-grams at ch ck ic ri st ta ti tu uc id strings 1 2 3 4 rich stick stich stuck static

37 Observation 2: skew distributions of gram frequencies
DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio

38 VGRAM: Main idea Grams with variable lengths (between qmin and qmax)
zebra ze(123) corrasion co(5213), cor(859), corr(171) Advantages Reduce index size  Reducing running time  Adoptable by many algorithms 

39 Challenges Generating variable-length grams?
Constructing a high-quality gram dictionary? Relationship between string similarity and their gram-set similarity? Adopting VGRAM in existing algorithms?

40 Challenge 1: String  Variable-length grams?
Fixed-length 2-grams u n i v e r s a l Variable-length grams ni ivr sal uni vers [2,4]-gram dictionary u n i v e r s a l

41 Representing gram dictionary as a trie
ni ivr sal uni vers

42 Step 2: Constructing a gram dictionary
qmin=2 qmax=4 Frequency-based [LYW07] Cost-based [YLW08]

43 Challenge 3: Edit operation’s effect on grams
Fixed length: q u n i v e r s a l k operations could affect k * q grams

44 Deletion affects variable-length grams
Not affected Not affected Affected i-qmax+1 i i+qmax- 1 Deletion

45 With 2 edit operations, at most 4 grams can be affected
Main idea For a string, for each position, compute the number of grams that could be destroyed by an operation at this position Compute number of grams possibly destroyed by k operations Store these numbers (for all data strings) as part of the index Vector of s = <2,4,6,8,9> With 2 edit operations, at most 4 grams can be affected Use this number to do count filtering

46 Summary of VGRAM index

47 Challenge 4: adopting VGRAM
Easily adoptable by many algorithms Basic interfaces: String s  grams String s1, s2 such that ed(s1,s2) <= k  min # of their common grams

48 Lower bound on # of common grams
Fixed length (q) u n i v e r s a l If ed(s1,s2) <= k, then their # of common grams >=: (|s1|- q + 1) – k * q Variable lengths: # of grams of s1 – NAG(s1,k)

49 Example: algorithm using inverted lists
Query: “shtick”, ED(shtick, ?)≤1 sh ht tick tick 2-grams 2-4 grams 2 4 1 3 ck ic ich tic tick 1 2 4 3 ck ic ti Lower bound = 3 id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static id strings 1 2 3 4 rich stick stich stuck static Lower bound = 1

50 End of part I Motivation Preliminaries Trie-based approach
Gram-based algorithms Sketch-based algorithms Compression Selectivity estimation Transformations/Synonyms Conclusion Part I Part II


Download ppt "Efficient Approximate Search on String Collections Part I"

Similar presentations


Ads by Google