Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li 1.

Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li 1

Outline Part 1: Motivation and preliminaries Inverted list based algorithms Part 2: Gram signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions 2

Web Search http://www.google.com/jobs/britney.html  Errors in queries  Errors in data  Bring query and meaningful results closer together Actual queries gathered by Google 3

Record Linkage R informix microsoft … … S infromix … mcrosoft … Record linkage Edit distance Jaccard Cosine … 4

Document Cleaning Source: http://en.wikipedia.org/wiki/Heisenberg's_microscope Should be “Niels Bohr” 5

Demos http://directory.uci.edu/ http://psearch.ics.uci.edu/advanced/ http://psearch.ics.uci.edu/ 6

State-of-the-art: Oracle 10g and older Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing: begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; / CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF'); Usage: SELECT * FROM engdict WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0; Limitation: cannot handle errors in the first letters: Katherine versus Catherine 7

8 Microsoft SQL Server [CGG+05] Data cleaning tools available in SQL Server 2005 Part of Integration Services Supports fuzzy lookups Uses data flow pipeline of transformations Similarity function: tokens with TF/IDF scores 8

Lucene Using Levenshtein Distance (Edit Distance). Example: roam~0.8 Prefix filtering followed by a scan (Efficiency?) 9

Problem Formulation Find strings similar to a given string Performance is important! -10 ms: 100 queries per second (QPS) - 5 ms: 200 QPS

Similarity Functions Similar to: a domain-specific function returns a similarity value between two strings Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE See [KSS06] for an excellent survey 11

12 A widely used metric to define string similarity Ed(s1,s2) = minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2 Edit Distance 12

Outline Motivation and preliminaries Inverted list based algorithms List-merging algorithms VGRAM List-compression techniques Gram signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions 13

“ q-grams ” of strings u n i v e r s a l 2-grams 14

Edit operation’s effect on grams k operations could affect k * q grams u n i v e r s a l Fixed length: q 15

q-gram inverted lists id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 2-grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3 16

# of common grams >= 3 Searching using inverted lists Query: “ shtick ”, ED(shtick, ?)≤1 id strings 0123401234 rich stick stich stuck static 2-grams at ch ck ic ri st ta ti tu uc 4 23 0 1 4 2 0 13 0124 4 124 3 3 ti iccksh ht ti ic ck 17

T-occurrence Problem Find elements whose occurrences ≥ T Ascending order Merge 18

Example T = 4 Result: 13 1 3 5 10 13 10 13 15 5 7 13 15 19

List-Merging Algorithms HeapMergerMergeOpt [SK04] [LLL08, BK02] ScanCount MergeSkipDivideSkip 20

Heap-based Algorithm Min-heap Count # of occurrences of each element using a heap Push to heap …… 21

MergeOpt Algorithm [SK04] Long Lists: T-1Short Lists Binary search 22

Example of MergeOpt 1 3 5 10 13 10 13 15 5 7 13 15 Count threshold T≥ 4 Long Lists: 3 Short Lists: 2 23

24 ScanCount 123…123… 1 3 5 10 13 10 13 15 5 7 13 15 Count threshold T≥ 4 # of occurrences 0 0 0 4 1 Increment by 1 1 String ids 13 14 15 0 2 0 0 Result!

List-Merging Algorithms HeapMergerMergeOpt ScanCount MergeSkipDivideSkip [SK04] [LLL08, BK02] 25

MergeSkip algorithm [BK02, LLL08] Min-heap …… Pop T-1 T-1 Jump Greater or equals 26

Example of MergeSkip 1 3 5 10 15 5757 1315 Count threshold T≥ 4 minHeap 10 1315 1 5 Jump 15 13 17 27

DivideSkip Algorithm [LLL08] Long ListsShort Lists Binary search MergeSkip 28

How many lists are treated as long lists? ? Short Lists Merge Long Lists Lookup A good balance in the tradeoff: # of long lists = T / ( μ logM +1) 29

Length Filtering Ed(s,t) ≤ 2 s: t: Length: 19 Length: 10 By length only! 30

Positional Filtering ab ab Ed(s,t) ≤ 2 s t (ab,1) (ab,12) 31

Filter tree [LLL08] … Length level Gram level Position level Inverted list 5 12 17 28 44 root 2 n1 3 … zyzz abaa 12m … 32

Surprising experimental results (DBLP) No filter (ms) Length (ms) Length+Pos (ms) DivideSkip2.230.76 1.96 Adding position filter could increase running time 33

Filters fragment inverts lists Applying filters Merge Saving: reduce list size. Cost: - Tree traversal, - More merging 34

Outline Motivation and preliminaries Inverted list based algorithms List-merging algorithms VGRAM [LWY07,YWL08] List-compression techniques Gram signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions 35

# of common grams >= 1 2-grams -> 3-grams? Query: “ shtick ”, ED(shtick, ?)≤1 id strings 0123401234 rich stick stich stuck static 3-grams ati ich ick ric sta sti stu tat tic tuc uck tic icksht hti tic ick 4 2 4 1 2 0 1 0 3 4 1 3 4 2 3 id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static 36

Observation 1: dilemma of choosing “q” Increasing “q” causing: Longer grams  Shorter lists Smaller # of common grams of similar strings id strings 0123401234 rich stick stich stuck static 4 23 0 1 4 2-grams at ch ck ic ri st ta ti tu uc 2 0 13 0124 4 124 3 3 37

Observation 2: skew distributions of gram frequencies DBLP: 276,699 article titles Popular 5-grams: ation (>114K times), tions, ystem, catio 38

VGRAM: Main idea Grams with variable lengths (between q min and q max ) zebra - ze(123) corrasion - co(5213), cor(859), corr(171) Advantages Reduce index size Reducing running time Adoptable by many algorithms 39

Challenges Generating variable-length grams? Constructing a high-quality gram dictionary? Relationship between string similarity and their gram-set similarity? Adopting VGRAM in existing algorithms? 40

Challenge 1: String  Variable-length grams? Fixed-length 2-grams Variable-length grams u n i v e r s a l ni ivr sal uni vers [2,4]-gram dictionary u n i v e r s a l 41

Representing gram dictionary as a trie ni ivr sal uni vers 42

Challenge 2: Constructing gram dictionary st  0, 1, 3 sti  0, 1 stu  3 stic  0, 1 stuc  3 Step 1: Collecting frequencies of grams with length in [qmin, qmax] Gram trie with frequencies 43

Step 2: selecting grams Pruning trie using a frequency threshold F (e.g., 2) 44

Step 2: selecting grams (cont) Threshold T = 2 45

Final gram dictionary A cost-based approach to choosing a gram dictionary [YWL08] 46

Challenge 3: Edit operation’s effect on grams k operations could affect k * q grams u n i v e r s a l Fixed length: q 47

Deletion affects variable-length grams i-q max +1i+q max - 1 Deletion Not affected Affected i 48

Grams affected by a deletion ni ivr sal uni vers [2,4]-grams u n i v e r s a l i-q max +1i+q max - 1 Deletion i Affected? Deletion Affected? 49

Grams affected by a deletion (cont) i-q max +1i+q max - 1 Deletion i Affected? Trie of gramsTrie of reversed grams 50

# of grams affected by each operation _ u _ n _ i _ v _ e _ r _ s _ a _ l _ 01 11 12 12 22 11 12 11 1 10 Deletion/substitutionInsertion 51

Max # of grams affected by k operations Vector of s = With 2 edit operations, at most 4 grams can be affected Called NAG vector (# of affected grams) Precomputed and stored Dynamic programming to compute tight bounds [YWL08] 52

Summary of VGRAM index 53

Challenge 4: adopting VGRAM Easily adoptable by many algorithms Basic interfaces: String s  grams String s1, s2 such that ed(s1,s2) <= k  min # of their common grams 54

Lower bound on # of common grams If ed(s1,s2) =: (|s 1 |- q + 1) – k * q u n i v e r s a l Fixed length (q) Variable lengths: # of grams of s1 – NAG(s1,k) 55

Example: algorithm using inverted lists Query: “shtick”, ED(shtick, ?)≤1 124 12 1 0 4 3 … ck ic … ti … Lower bound = 3 Lower bound = 1 sh ht tick 24 1 4 0 1 1 2 3 … ck ic ich … tic tick … 2-4 grams2-grams tick id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static id strings 0123401234 rich stick stich stuck static 56

Outline Motivation and preliminaries Inverted list based algorithms List-merging algorithms VGRAM List-compression techniques [BJL+09] Gram signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions 57

Motivation Inverted index: very large IR : lossless compression (delta encoding, mostly disk-based) Decompression overhead Difficult to tune compression ratio 58

Solution Two lossy-compression techniques Queries become faster Flexibility to choose space / time tradeoff 59

Approach 1: Discarding Lists tfviirefrvneun in …… 2-grams Inverted Lists 5959 1515 7979 569569 1245612456 Discarded 60

Approach 2: Combining Lists 2-grams Inverted Lists tfviirefrvneun in …… 134579134579 569569 12391239 7979 6969 1245612456 Combined 61

Technical challenges  Effect on list-merging algorithms  How to choose lists to discard/merge 62

Outline (end of part 1) Part 1: Motivation and preliminaries Inverted list based algorithms List-merging algorithms VGRAM [LWY07,YWL08] List-compression techniques Part 2: Gram signature algorithms Length normalized algorithms Selectivity estimation Conclusion and future directions 63

Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li 1.

Similar presentations

Presentation on theme: "Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li 1.

Similar presentations

Presentation on theme: "Efficient Approximate Search on String Collections Marios Hadjieleftheriou Chen Li 1."— Presentation transcript:

Similar presentations

About project

Feedback