Download presentation
Presentation is loading. Please wait.
1
Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to all content in this presentation.
2
Consider searching for a subsequence in a collection of genome sequences: …gcaagctttatagtgacaacaataaggtatcactcggtt… N-gram inverted indexes are the traditional solution, but have 10-100 times more terms than ordinary word-based inverted indexes TinyLex indexes achieve similar query performance with 7-17 times less terms TinyLex provides good worst-case query performance 2 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
3
1. Each wife had seven sacks, 2. Each sack had seven cats, 3. Each cat had seven kits. 4. Kits, cats, sacks, and wives. each: {1, 2, 3} had: {1, 2, 3} seven: {1, 2, 3} wife: {1, 4} sack: {1, 2, 4} cat: {2, 3, 4} kit: {3, 4} 3 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
4
1. Each wife had seven sacks, 2. Each sack had seven cats, 3. Each cat had seven kits. 4. Kits, cats, sacks, and wives. 4 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Query: sack and cat sack: {1, 2, 4} cat: {2, 3, 4} {1, 2, 4} ∩ {2, 3, 4} = {2, 4}
5
Partial word or punctuation queries ◦ Searching a dictionary for all words ending in “ment” ◦ Searching for in HTML files ◦ Searching for "%s" in C source files ◦ Searching for x^2/2 in LaTeX source files Searching East Asian language text ◦ No spaces, word extraction is complex Phrase searching 5 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
6
Genome sequences: 1. gcaagctttatagtgacaac... 2. aataaggtatcactcggtta... 3. caattacccccacttcccct... 4. cattataaagaaatgatcaa... Example query: Documents containing subsequence “cact” 6 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
7
Simplified example: Two-letter alphabet 1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb aaa: {2} aab: {2, 3, 4} aba: {1, 2, 3} abb: {1, 2, 4} baa: {2, 3, 4} bab: {1, 2, 3} bba: {1, 4} bbb: {1, 4} 7 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
8
1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb 8 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Query: aaba aaba aab and aba
9
1. babbbbabab 2. aababaaabb 3. babababaab (false positive) 4. bbbbaabbbb 9 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Query: aaba aab and aba aab: {2, 3, 4} aba: {1, 2, 3} {2, 3, 4} ∩ {1, 2, 3} = {2, 3}
10
1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb a: {1, 2, 3, 4} b: {1, 2, 3, 4} Small number of terms Slow queries Long posting lists Too many false positives length = 1 10 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
11
1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb aababa: {2} aabbbb: {4} abaaab: {2} ababaa: {2,3} ababab: {3} abbbba: {1} baaabb: {2} baabbb: {4} babaaa: {2} babaab: {3} bababa: {3} babbbb: {1} bbaabb: {4} bbabab: {1} bbbaab: {4} bbbaba: {1} bbbbaa: {4} bbbbab: {1} Fast queries Too many terms Queries must be ≥6 characters length = 6 11 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
12
Review of inverted n-gram indexes Example TinyLex index TinyLex index construction Results Disadvantages Questions 12 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
13
Goal: less terms without sacrificing query performance Consider the n-grams “juggl” and “uggle” ◦ Almost exactly the same posting list in a typical English language collection ◦ Just put the n-gram “uggl” in the index, and leave out “juggl” and “uggle” 13 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee juggl: {2, 7, 33} uggle: {2, 7, 33} uggl: {2,7,33}
14
Insight: The more false positives a term produces when it is queried for, the more information it adds when it is added to the index. Choose a false positive threshold t and choose the smallest possible set of index terms that satisfies it. Allow variable-length n-grams. 14 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
15
1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb aa: {2, 3, 4} bb: {1, 2, 4} aaa: {2} aba: {1, 2, 3} bab: {1, 2, 3} bba: {1, 4} bbb: {1, 4} aaba: {2} baab: {3, 4} babb: {1} 15 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee In this example t = 1. At most 1 false positive is allowed for any query. Only 10 terms!
16
1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb 16 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Query: abaab aba and baab aba: {1, 2, 3} baab: {3, 4} {1, 2, 3} ∩ {3, 4} = {3}
17
The construction guarantees that if the query term occurs in the collection, it will have at most t – 1 false positives (zero in this case). If we observe t false positives, we can halt immediately. 17 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
18
18 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Query: bbbbb bbb and bbb and bbb bbb: {1, 4} {1, 4} ∩ {1, 4} ∩ {1, 4} = {1, 4} 1.babbbbabab (false positive)...can’t happen unless the query result is empty. Halt.
19
Achieve similar query performance to classical n-gram indexes with a much larger number of terms Worst-case bound on number of false positives Query can be any length 19 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
20
Review of inverted n-gram indexes Example TinyLex index TinyLex index construction Results Disadvantages Questions 20 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
21
The problem: ◦ Input: a set of documents, a threshold t ◦ Output: a list of terms such that any query for a term occurring in the collection will have at most t – 1 false positives 21 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
22
Basic construction: For each n-gram length from 1 to max: ◦ Make a list of all n-grams in the collection and what documents they occur in. ◦ Perform a query on each term using the partially constructed index. ◦ If a term has too many false positives, add it to the index. 22 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
23
1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb (index empty) 23 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 1-gramsQuery result Actual a{1,2,3,4} b t = 1 If the difference between the query result size and the actual posting list size is at least 1, add it to the index.
24
1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb 24 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 2-gramsQuery result Actual aa{1,2,3,4}{2,3,4} ab{1,2,3,4} ba{1,2,3,4} bb{1,2,3,4}{1,2,4} (index empty)
25
1. babbbbabab 2. aababaaabb 3. babababaab 4. bbbbaabbbb 25 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 2-gramsQuery result Actual aa{1,2,3,4}{2,3,4} ab{1,2,3,4} ba{1,2,3,4} bb{1,2,3,4}{1,2,4} aa: {2,3,4} bb: {1,2,4}
26
1. 1011110101 2. 0010100011 3. 1010101001 4. 1111001111 aa: {2,3,4} bb: {1,2,4} 26 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 3-gramsQuery result Actual aaa{2,3,4}{2} aab{2,3,4} aba{1,2,3,4}{1,2,3} abb{1,2,4} baa{2,3,4} bab{1,2,3,4}{1,2,3} bba{1,2,4}{1,4} bbb{1,2,4}{1,4}
27
1. 1011110101 2. 0010100011 3. 1010101001 4. 1111001111 aa: {2,3,4} bb: {1,2,4} aaa: {2} aba: {1,2,3} bab: {1,2,3} bba: {1,4} bbb: {1,4} 27 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 3-gramsQuery result Actual aaa{2,3,4}{2} aab{2,3,4} aba{1,2,3,4}{1,2,3} abb{1,2,4} baa{2,3,4} bab{1,2,3,4}{1,2,3} bba{1,2,4}{1,4} bbb{1,2,4}{1,4}
28
1. 1011110101 2. 0010100011 3. 1010101001 4. 1111001111 28 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 4-gramsQuery result Actual aaab{2} aaba{2,3}{2} aabb{2,4} abaa{2,3} abab{1,2,3} abbb{1,4} baaa{2} baab{2,3,4}{3,4} baba{1,2,3} babb{1,2}{1} bbaa{4} bbab{1} bbba{1,4} bbbb{1,4} aa: {2,3,4} bb: {1,2,4} aaa: {2} aba: {1,2,3} bab: {1,2,3} bba: {1,4} bbb: {1,4}
29
1. 1011110101 2. 0010100011 3. 1010101001 4. 1111001111 aa: {2,3,4} bb: {1,2,4} aaa: {2} aba: {1,2,3} bab: {1,2,3} bba: {1,4} bbb: {1,4} aaba: {2} baab: {3,4} babb: {1} 29 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 4-gramsQuery result Actual aaab{2} aaba{2,3}{2} aabb{2,4} abaa{2,3} abab{1,2,3} abbb{1,4} baaa{2} baab{2,3,4}{3,4} baba{1,2,3} babb{1,2}{1} bbaa{4} bbab{1} bbba{1,4} bbbb{1,4}
30
Review of inverted n-gram indexes Example TinyLex index TinyLex index construction Results Disadvantages Questions 30 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
31
31 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Test set: 100MB TREC WSJ collection 37000 documents, English text Same query performance with 7-17 times less terms
32
32 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Overall compressed index size 2-20% less TinyLex index has more information per term
33
33 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Dramatic 50x improvement in worst-case query performance for long queries
34
Applications to phrase searching using variable-length word n-grams Making the construction more efficient Performance on genome sequences Empirical evaluation of scaling 34 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
35
Suffix arrays (Manber and Myers 1991) ◦ Faster queries, but indexes 3-10 times larger agrep and GLIMPSE (Wu and Manber 1994) ◦ More general queries, but relies on a word concept n-Gram/2L (Kim et al 2005) ◦ Orthogonal; examines less document offsets “Growing an n-gram language model” ◦ (Siivola and Pellom 2005) ◦ Similar idea applied to language modeling 35 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
36
Faster construction time ◦ Currently about 10 times slower to construct than a classical n-gram index. Queries for nonoccurring terms are more expensive than with classical n-gram indexes (t documents must be read). Generalize to dynamic collections 36 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
37
N-gram indexes enable practical queries for subsequences TinyLex indexes achieve similar query performance to classical n-gram indexes with 7-17 times less terms TinyLex yields good worst-case query performance by placing an upper bound on the number of false positives 37 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
38
38 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.