Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.

Slides:



Advertisements
Similar presentations
Pumping Lemma Examples
Advertisements

Author : Xinming Chen,Kailin Ge,Zhen Chen and Jun Li Publisher : ANCS, 2011 Presenter : Tsung-Lin Hsieh Date : 2011/12/14 1.
Exact reconstruction of finite memory automata with the GSPS And a surprising application to the reconstruction of cellular automata James Nutaro
MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion.
YES-NO machines Finite State Automata as language recognizers.
Equivalence, Order, and Inductive Proof
The Binomial Distribution ► Arrangements ► Remember the binomial theorem?
COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.
A Categorization Theorem on Suffix Arrays with Applications to Space Efficient Text Indexes Meng He, J. Ian Munro, and S. Srinivasa Rao University of Waterloo.
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Algorithm Programming Some Topics in Compression Bar-Ilan University תשס"ח by Moshe Fresko.
Strings and Languages Operations
COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.
Lempel-Ziv-Welch (LZW) Compression Algorithm
COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.
79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
Topic : algorithms on FSA -- M.Mohri,on some applications of Finite- state automata theory to natural language processing. Natural Language Eng 1 (1996)
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
Second lecture REGULAR EXPRESSION. Regular Expression.
Lexical Analysis CSE 340 – Principles of Programming Languages Fall 2015 Adam Doupé Arizona State University
Formal Methods in SE Theory of Automata Qasiar Javaid Assistant Professor Lecture # 06.
Lecture Two: Formal Languages Formal Languages, Lecture 2, slide 1 Amjad Ali.
1 Language Definitions Lecture # 2. Defining Languages The languages can be defined in different ways, such as Descriptive definition, Recursive definition,
SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.
1 Chapter 1 Introduction to the Theory of Computation.
Lecture # 3 Regular Expressions 1. Introduction In computing, a regular expression provides a concise and flexible means to "match" (specify and recognize)
Module 2 How to design Computer Language Huma Ayub Software Construction Lecture 8.
L ECTURE 3 Chapter 4 Regular Expressions. I MPORTANT T ERMS Regular Expressions Regular Languages Finite Representations.
1 The number of crossings of curves on surfaces Moira Chas from Stony Brook University King Abdul- Aziz University Spring 2012.
Introduction to Theory of Automata By: Wasim Ahmad Khan.
Recursive Definitions & Regular Expressions (RE)
Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.
Data Compression Meeting October 25, 2002 Arithmetic Coding.
Lecture 02: Theory of Automata:08 Theory of Automata.
Semi-dynamic compact index for short patterns and succinct van Emde Boas tree 1 Yoshiaki Matsuoka 1, Tomohiro I 2, Shunsuke Inenaga 1, Hideo Bannai 1,
Evidence from Content INST 734 Module 2 Doug Oard.
CSC312 Automata Theory Lecture # 3 Languages-II. Formal Language A formal language is a set of words—that is, strings of symbols drawn from a common alphabet.
Representing Languages by Learnable Rewriting Systems Rémi Eyraud Colin de la Higuera Jean-Christophe Janodet.
Lecture # Book Introduction to Theory of Computation by Anil Maheshwari Michiel Smid, 2014 “Introduction to computer theory” by Daniel I.A. Cohen.
Lecture # 4.
Mathematical Foundations of Computer Science Chapter 3: Regular Languages and Regular Grammars.
Lecture 2 Theory of AUTOMATA
Lecture 02: Theory of Automata:2014 Asif Nawaz Theory of Automata.
Lecture 03: Theory of Automata:2014 Asif Nawaz Theory of Automata.
1 1. Eliminate all  -transitions from the following FA without changing the number of states and the language accepted by the automaton. You should also.
CHAPTER TWO LANGUAGES By Dr Zalmiyah Zakaria.
Languages and Strings Chapter 2. (1) Lexical analysis: Scan the program and break it up into variable names, numbers, etc. (2) Parsing: Create a tree.
Recap Lecture 3 RE, Recursive definition of RE, defining languages by RE, { x}*, { x}+, {a+b}*, Language of strings having exactly one aa, Language of.
Why indexing? For efficient searching of a document
CSE 589 Applied Algorithms Spring 1999
Theory of Computation Lecture #
Lecture # 2.
Implementation of Haskell Modules for Automata and Sticker Systems
Regular Languages, Regular Operations, Closure
Andrzej Ehrenfeucht, University of Colorado, Boulder
Theory of Automata.
Regular Expressions (Examples)
Comparison of large sequences
Exercise: fourAB Write a method fourAB that prints out all strings of length 4 composed only of a’s and b’s Example Output aaaa baaa aaab baab aaba baba.
Contents First week: algorithms for exact string matching:
Road Map - Quarter CS Concepts Data Structures Java Language
MA/CSSE 474 Theory of Computation Minimizing DFSMs.
Road Map - Quarter CS Concepts Data Structures Java Language
Chapter 1 Introduction to the Theory of Computation
Magnetic and chemical equivalence
CSC312 Automata Theory Lecture # 3 Languages-II.
Welcome to ! Theory Of Automata Irum Feroz
Recap Lecture 3 RE, Recursive definition of RE, defining languages by RE, { x}*, { x}+, {a+b}*, Language of strings having exactly one aa, Language of.
Presentation transcript:

Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to all content in this presentation.

 Consider searching for a subsequence in a collection of genome sequences: …gcaagctttatagtgacaacaataaggtatcactcggtt…  N-gram inverted indexes are the traditional solution, but have times more terms than ordinary word-based inverted indexes  TinyLex indexes achieve similar query performance with 7-17 times less terms  TinyLex provides good worst-case query performance 2 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

 1. Each wife had seven sacks,  2. Each sack had seven cats,  3. Each cat had seven kits.  4. Kits, cats, sacks, and wives. each: {1, 2, 3} had: {1, 2, 3} seven: {1, 2, 3} wife: {1, 4} sack: {1, 2, 4} cat: {2, 3, 4} kit: {3, 4} 3 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

 1. Each wife had seven sacks,  2. Each sack had seven cats,  3. Each cat had seven kits.  4. Kits, cats, sacks, and wives. 4 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Query: sack and cat sack: {1, 2, 4} cat: {2, 3, 4} {1, 2, 4} ∩ {2, 3, 4} = {2, 4}

 Partial word or punctuation queries ◦ Searching a dictionary for all words ending in “ment” ◦ Searching for in HTML files ◦ Searching for "%s" in C source files ◦ Searching for x^2/2 in LaTeX source files  Searching East Asian language text ◦ No spaces, word extraction is complex  Phrase searching 5 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

Genome sequences:  1. gcaagctttatagtgacaac...  2. aataaggtatcactcggtta...  3. caattacccccacttcccct...  4. cattataaagaaatgatcaa... Example query: Documents containing subsequence “cact” 6 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

Simplified example: Two-letter alphabet  1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb aaa: {2} aab: {2, 3, 4} aba: {1, 2, 3} abb: {1, 2, 4} baa: {2, 3, 4} bab: {1, 2, 3} bba: {1, 4} bbb: {1, 4} 7 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

 1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb 8 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Query: aaba aaba aab and aba

 1. babbbbabab  2. aababaaabb  3. babababaab (false positive)  4. bbbbaabbbb 9 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Query: aaba aab and aba aab: {2, 3, 4} aba: {1, 2, 3} {2, 3, 4} ∩ {1, 2, 3} = {2, 3}

 1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb a: {1, 2, 3, 4} b: {1, 2, 3, 4} Small number of terms Slow queries Long posting lists Too many false positives length = 1 10 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

 1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb aababa: {2} aabbbb: {4} abaaab: {2} ababaa: {2,3} ababab: {3} abbbba: {1} baaabb: {2} baabbb: {4} babaaa: {2} babaab: {3} bababa: {3} babbbb: {1} bbaabb: {4} bbabab: {1} bbbaab: {4} bbbaba: {1} bbbbaa: {4} bbbbab: {1} Fast queries Too many terms Queries must be ≥6 characters length = 6 11 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

 Review of inverted n-gram indexes  Example TinyLex index  TinyLex index construction  Results  Disadvantages  Questions 12 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

 Goal: less terms without sacrificing query performance  Consider the n-grams “juggl” and “uggle” ◦ Almost exactly the same posting list in a typical English language collection ◦ Just put the n-gram “uggl” in the index, and leave out “juggl” and “uggle” 13 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee juggl: {2, 7, 33} uggle: {2, 7, 33} uggl: {2,7,33}

 Insight: The more false positives a term produces when it is queried for, the more information it adds when it is added to the index.  Choose a false positive threshold t and choose the smallest possible set of index terms that satisfies it.  Allow variable-length n-grams. 14 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

 1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb aa: {2, 3, 4} bb: {1, 2, 4} aaa: {2} aba: {1, 2, 3} bab: {1, 2, 3} bba: {1, 4} bbb: {1, 4} aaba: {2} baab: {3, 4} babb: {1} 15 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee In this example t = 1. At most 1 false positive is allowed for any query. Only 10 terms!

 1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb 16 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Query: abaab aba and baab aba: {1, 2, 3} baab: {3, 4} {1, 2, 3} ∩ {3, 4} = {3}

 The construction guarantees that if the query term occurs in the collection, it will have at most t – 1 false positives (zero in this case).  If we observe t false positives, we can halt immediately. 17 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

18 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee Query: bbbbb bbb and bbb and bbb bbb: {1, 4} {1, 4} ∩ {1, 4} ∩ {1, 4} = {1, 4} 1.babbbbabab (false positive)...can’t happen unless the query result is empty. Halt.

 Achieve similar query performance to classical n-gram indexes with a much larger number of terms  Worst-case bound on number of false positives  Query can be any length 19 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

 Review of inverted n-gram indexes  Example TinyLex index  TinyLex index construction  Results  Disadvantages  Questions 20 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

 The problem: ◦ Input: a set of documents, a threshold t ◦ Output: a list of terms such that any query for a term occurring in the collection will have at most t – 1 false positives 21 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

 Basic construction:  For each n-gram length from 1 to max: ◦ Make a list of all n-grams in the collection and what documents they occur in. ◦ Perform a query on each term using the partially constructed index. ◦ If a term has too many false positives, add it to the index. 22 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

 1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb (index empty) 23 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 1-gramsQuery result Actual a{1,2,3,4} b t = 1 If the difference between the query result size and the actual posting list size is at least 1, add it to the index.

 1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb 24 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 2-gramsQuery result Actual aa{1,2,3,4}{2,3,4} ab{1,2,3,4} ba{1,2,3,4} bb{1,2,3,4}{1,2,4} (index empty)

 1. babbbbabab  2. aababaaabb  3. babababaab  4. bbbbaabbbb 25 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 2-gramsQuery result Actual aa{1,2,3,4}{2,3,4} ab{1,2,3,4} ba{1,2,3,4} bb{1,2,3,4}{1,2,4} aa: {2,3,4} bb: {1,2,4}

    aa: {2,3,4} bb: {1,2,4} 26 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 3-gramsQuery result Actual aaa{2,3,4}{2} aab{2,3,4} aba{1,2,3,4}{1,2,3} abb{1,2,4} baa{2,3,4} bab{1,2,3,4}{1,2,3} bba{1,2,4}{1,4} bbb{1,2,4}{1,4}

    aa: {2,3,4} bb: {1,2,4} aaa: {2} aba: {1,2,3} bab: {1,2,3} bba: {1,4} bbb: {1,4} 27 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 3-gramsQuery result Actual aaa{2,3,4}{2} aab{2,3,4} aba{1,2,3,4}{1,2,3} abb{1,2,4} baa{2,3,4} bab{1,2,3,4}{1,2,3} bba{1,2,4}{1,4} bbb{1,2,4}{1,4}

    TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 4-gramsQuery result Actual aaab{2} aaba{2,3}{2} aabb{2,4} abaa{2,3} abab{1,2,3} abbb{1,4} baaa{2} baab{2,3,4}{3,4} baba{1,2,3} babb{1,2}{1} bbaa{4} bbab{1} bbba{1,4} bbbb{1,4} aa: {2,3,4} bb: {1,2,4} aaa: {2} aba: {1,2,3} bab: {1,2,3} bba: {1,4} bbb: {1,4}

    aa: {2,3,4} bb: {1,2,4} aaa: {2} aba: {1,2,3} bab: {1,2,3} bba: {1,4} bbb: {1,4} aaba: {2} baab: {3,4} babb: {1} 29 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee 4-gramsQuery result Actual aaab{2} aaba{2,3}{2} aabb{2,4} abaa{2,3} abab{1,2,3} abbb{1,4} baaa{2} baab{2,3,4}{3,4} baba{1,2,3} babb{1,2}{1} bbaa{4} bbab{1} bbba{1,4} bbbb{1,4}

 Review of inverted n-gram indexes  Example TinyLex index  TinyLex index construction  Results  Disadvantages  Questions 30 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

31 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee  Test set: 100MB TREC WSJ collection  documents, English text  Same query performance with 7-17 times less terms

32 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee  Overall compressed index size 2-20% less  TinyLex index has more information per term

33 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee  Dramatic 50x improvement in worst-case query performance for long queries

 Applications to phrase searching using variable-length word n-grams  Making the construction more efficient  Performance on genome sequences  Empirical evaluation of scaling 34 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

 Suffix arrays (Manber and Myers 1991) ◦ Faster queries, but indexes 3-10 times larger  agrep and GLIMPSE (Wu and Manber 1994) ◦ More general queries, but relies on a word concept  n-Gram/2L (Kim et al 2005) ◦ Orthogonal; examines less document offsets  “Growing an n-gram language model” ◦ (Siivola and Pellom 2005) ◦ Similar idea applied to language modeling 35 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

 Faster construction time ◦ Currently about 10 times slower to construct than a classical n-gram index.  Queries for nonoccurring terms are more expensive than with classical n-gram indexes (t documents must be read).  Generalize to dynamic collections 36 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

 N-gram indexes enable practical queries for subsequences  TinyLex indexes achieve similar query performance to classical n-gram indexes with 7-17 times less terms  TinyLex yields good worst-case query performance by placing an upper bound on the number of false positives 37 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee

38 TinyLex: Static N-Gram Index Pruning - Derrick Coetzee