A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)

Slides:



Advertisements
Similar presentations
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
Advertisements

Output URL Bidding Panagiotis Papadimitriou, Hector Garcia-Molina, (Stanford University) Ali Dasdan, Santanu Kolay (Ebay Inc) Related papers: VLDB 2011,
Large-Scale Entity-Based Online Social Network Profile Linkage.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
Engineering a Set Intersection Algorithm for Information Retrieval Alex Lopez-Ortiz UNB / InterNAP Joint work with Ian Munro and Erik Demaine.
S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Tries Standard Tries Compressed Tries Suffix Tries.
© 2008 IBM Corporation Regular Expression Learning for Information Extraction Yunyao Li *, Rajasekar Krishnamurthy *, Sriram Raghavan *, Shivakumar Vaithyanathan.
Fast Two-Sided Error-Tolerant Search Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
February 17, There is no practical obstacle whatever now to the creation of an efficient index to all human knowledge, ideas and achievements,
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
An Efficient and Scalable Pattern Matching Scheme for Network Security Applications Department of Computer Science and Information Engineering National.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
1 A Lempel-Ziv text index on secondary storage Diego Arroyuelo and Gonzalo Navarro Combinatorial Pattern Matching 2007.
1 Algorithms and Analysis CS 2308 Foundations of CS II.
CS246 Extracting Structured Information from the Web.
A Fast Algorithm for Multi-Pattern Searching Sun Wu, Udi Manber May 1994.
CS246 Search Engine Scale. Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture  Major modules for a search engine? 1. Crawler  Page.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
Mug01, 6-9 March 2001, Santa Fe, New Mexico, USA. SMILES Multigram Compression Roger Sayle 1 and Jack Delany 2 1 Metaphorics LLC, Santa Fe, New Mexico.
Anatomy of a search engine Design criteria of a search engine Architecture Data structures.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
1 Language Specific Crawler for Myanmar Web Pages Pann Yu Mon Management and Information System Engineering Department Nagaoka University of Technology,
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Information Retrieval Quality of a Search Engine.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Graph Indexing From managing and mining graph data.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
Why indexing? For efficient searching of a document
Text Based Information Retrieval
Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
RE-Tree: An Efficient Index Structure for Regular Expressions
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
IST 516 Fall 2011 Dongwon Lee, Ph.D.
CS 430: Information Discovery
CS246 Search Engine Scale.
2018, Spring Pusan National University Ki-Joune Li
Algorithms Step-by-step instructions that tell a computing agent how to solve some problem using only finite resources Resources Memory CPU cycles Time/Space.
CS246: Information Retrieval
CS246: Search-Engine Scale
Algorithms Step-by-step instructions that tell a computing agent how to solve some problem using only finite resources Resources Memory CPU cycles Time/Space.
Minwise Hashing and Efficient Search
Extracting Patterns and Relations from the World Wide Web
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)

Junghoo "John" Cho (UCLA Computer Science)2 Problem How can we match a regular expression fast? Large text-corpus Several days to match a simple regular expression! Our solution Use an index!

Junghoo "John" Cho (UCLA Computer Science)3 Motivation Advanced search interface What is the middle name of Thomas Edison? State-of-the-art: Keyword-based  Thomas Edison Regular expression  Thomas [a-z]+ Edison Data extraction [Brin 98]

Junghoo "John" Cho (UCLA Computer Science)4 Outline Index key selection Useful gram Algorithm for key selection Other issues Experiments

Junghoo "John" Cho (UCLA Computer Science)5 Motivating example All mp3 URLs on the Web: Every matching string contains mp3. Questions: Should we index “ mp3 ”? Should we index “ <a href= ”?

Junghoo "John" Cho (UCLA Computer Science)6 What index entires? Solution 1: Inverted index (English words) Cannot handle many regular expressions Solution 2: k-grams for k = 1, 2, …, 10 Index too large (10 times as large!) Our solution: multigram

Junghoo "John" Cho (UCLA Computer Science)7 Main idea “ mp3 ” is helpful. Not many pages have it. “ <a href= ” is not. All pages have it. We index only “useful” grams.

Junghoo "John" Cho (UCLA Computer Science)8 Gram selectivity Sel(x): selectivity of gram x Sel(x) = M(x)/N M(x): number of pages containing gram x N: total number of pages C-useful gram: All grams with Sel(x) < C C: system parameter random access vs. sequential access time We index only “C-useful” grams

Junghoo "John" Cho (UCLA Computer Science)9 Minimal useful gram “ Unix is great ” If “ Unix ” is useful “ Unix i ”, “ Unix is ”, “ Unix is g ”, … are all useful. “ Unix ” is the minimal useful gram. We index only the minimal useful gram.

Junghoo "John" Cho (UCLA Computer Science)10 Advantages Versatile We can look up “ Unix ” for all grams like “ Unix i ”, “ Unix is g ”, etc. Easy to find Reduction to “A priori” algorithm Index size guarantee

Junghoo "John" Cho (UCLA Computer Science)11 Algorithm Main idea: If “ abcde ” is minimal useful gram, then “ abcd ” is not useful. If “ abcd ” is not useful, then “ a ”, “ ab ”, “ abc ” is not useful. Minimal useful gram identification is equivalent to useless gram identification.

Junghoo "John" Cho (UCLA Computer Science)12 A priori algorithm Useless gram identification Find all sequences of characters that occur in more than k pages A priori algorithm Find all sets of items that occur in more than k baskets Less than 4 scans of the corpus to find all minimal useful grams.

Junghoo "John" Cho (UCLA Computer Science)13 Prefix free set A set of grams X is prefix free if no x  X is a prefix of any other x’  X e.g.) X = {ab, ac, abc} is not prefix free. A set of minimal useful grams is a prefix free set.

Junghoo "John" Cho (UCLA Computer Science)14 Size of a prefix free set Let X be a set of grams extracted from corpus D and is prefix free. Then |X|  |D| |X|: number of grams in X |D|: number of characters in D The size of an index with minimal useful grams does not exceed the size of the corpus!

Junghoo "John" Cho (UCLA Computer Science)15 Shortest suffix gram <a href=“k If =“k is useful, then <a href=“k, a href=“k, href=“k,etc are all useful. =“k: shortest suffix gram We index only the shortest suffix gram. Pre-suf shell

Junghoo "John" Cho (UCLA Computer Science)16 Other issues Given a regular expression how to find an index entry to look up? Optimization?

Junghoo "John" Cho (UCLA Computer Science)17 Experiments Half million Web documents Comparison Raw scanning Multigram index Complete: k-grams for k = 1,2, …, 10 Benchmark queries No standard Collected from IBM Almaden researchers

Junghoo "John" Cho (UCLA Computer Science)18 Example queries (simplified) MP3 URLs: Invalid HTML: ]*< Phone numbers: (\d\d\d) \d\d\d-\d\d\d\d PowerPC chip number: (xpc|mpc)[0-9]+[0-9a-z]+ Middle name of Clinton: William [a-z]+ Clinton

Junghoo "John" Cho (UCLA Computer Science)19 Evaluation metrics Index construction time Index size Matching time Overall throughput Response time for first 10 matches

Junghoo "John" Cho (UCLA Computer Science)20 Construction time & Index size Complete Multigram Construction Time63 hours6 hours No of Keys103,151,30264,656 No of Postings18,193,048,399820,396,717 An order of magnitude reduction in index size

Junghoo "John" Cho (UCLA Computer Science)21 Matching time On average, Complete is faster than Multigram only by 33% Query Scanning Complete Multigram mp3573 sec11 sec15 sec PowerPC548 sec1 sec2 sec phone540 sec

Junghoo "John" Cho (UCLA Computer Science)22 Result size & Improvement

Junghoo "John" Cho (UCLA Computer Science)23 Related work Suffix tree Beaza-Yates et al., JACM,1998 Main-memory based Disk-based string index Cooper et al., VLDB, 2001 Good for exact string matching Inverted index English words

Junghoo "John" Cho (UCLA Computer Science)24 Conclusion Fast matching of regular expressions Multigram index Small size Significant improvement in matching time Future work Optimization?