On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.

Slides:



Advertisements
Similar presentations
Parsing V: Bottom-up Parsing
Advertisements

Regular Expressions and DFAs COP 3402 (Summer 2014)
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Advanced Data Structures
Binary Trees, Binary Search Trees CMPS 2133 Spring 2008.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Factor Oracle, Suffix Oracle 1 Factor Oracle Suffix Oracle.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
Modern Information Retrieval
1 The scanning process Main goal: recognize words/tokens Snapshot: At any point in time, the scanner has read some input and is on the way to identifying.
Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
Indexing and Searching
Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Fast Text Searching for Regular Expressions or Automaton Searching on Tries RICARDO A. BAEZA-YATES University of Chile, Santiago, Chile AND GASTON H. GONNET.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Parsing IV Bottom-up Parsing Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.
Theory of Computing Lecture 15 MAS 714 Hartmut Klauck.
The Chinese University of Hong Kong Introduction to PAT-Tree and its variations Kenny Kwok Department of Computer Science and Engineering.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
Web Data Management Indexes. In this lecture Indexes –XSet –Region algebras –Indexes for Arbitrary Semistructured Data –Dataguides –T-indexes –Index Fabric.
Design & Analysis of Algorithms COMP 482 / ELEC 420 John Greiner.
1 Prove the following languages over Σ={0,1} are regular by giving regular expressions for them: 1. {w contains two or more 0’s} 2. {|w| = 3k for some.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Theory of Computation, Feodor F. Dragan, Kent State University 1 TheoryofComputation Spring, 2015 (Feodor F. Dragan) Department of Computer Science Kent.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
CSE 311 Foundations of Computing I Lecture 27 FSM Limits, Pattern Matching Autumn 2012 CSE
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Computer Sciences Department1.  Property 1: each node can have up to two successor nodes (children)  The predecessor node of a node is called its.
Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.
1 Lexicographic Search:Tries All of the searching methods we have seen so far compare entire keys during the search Idea: Why not consider a key to be.
1 Turing Machines and Equivalent Models Section 13.1 Turing Machines.
CSE 311 Foundations of Computing I Lecture 24 FSM Limits, Pattern Matching Autumn 2011 CSE 3111.
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014CSC-305 Design and Analysis of AlgorithmsBS(CS) -6 Fall-2014 Design and Analysis of Algorithms.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Tries 07/28/16 11:04 Text Compression
New Indices for Text : Pat Trees and PAT Arrays
Chapter 2 Scanning – Part 1 June 10, 2018 Prof. Abdelaziz Khamis.
Parsing IV Bottom-up Parsing
Source Code for Data Structures and Algorithm Analysis in C (Second Edition) – by Weiss
B+ Tree.
CS 430: Information Discovery
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Finite Automata.
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
Presentation transcript:

On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching for Regular Expressions or Automaton Searching on Tries Ricardo Baeza-Yates, Gaston H. Gonnet

On the Use of Regular Expressions for Searching Text New perspective, particularly relevant to structured text Definition of the search problem –Does a given string of text match a particular pattern (regular expression recognition problem) –Locate the substrings of a text that match a particular pattern (searching problem) –Given a universe U identify all elements of U that contain a substring x matching a particular pattern r (more precise definition)

On the Use of Regular Expressions for Searching Text Given a string x and a regular expression r, locate all substrings of x that match r (continuous stream of text; problem: quadratic in the length of x; overlapping and nesting results) –Restrict the search to linearize the solutions; not simple –Most common restriction is the “leftmost longest match” rule –Problems: what is the next match? Where to start new search from?

On the Use of Regular Expressions for Searching Text This article prosposes alternative linearizing restriction—”Locate the set of shortest nonnested (but possibly overlapping strings that each match the pattern”. Related work” Thomsons’s algorithm, Baeza-Yates

On the Use of Regular Expressions for Searching Text Shortest substring –Definition of the search problem –Comparison between longest and shortest match search: shortest-match reports all occurrences of the members of L that are in G(L) and no others; longest depends on the entire text. A string may be recognized as member of a regular language by a single left to right scan with constant store. Longest does not have such properties.

On the Use of Regular Expressions for Searching Text Explicit containment –A regular expression may be used to define an explicit universe for search. Implement it by running two concurrent copies of the algorithm. Search tool: CGREP was developed on the basis of the theory in this article.

On the Use of Regular Expressions for Searching Text Concluding comments: –Explores the properties of shortest match search rule for regular expressions –The shortest substring rule provides a precise definition of which strings will be selected during a search without any dependence on the contents of the remainder of the text –Only single left to right scan is enough –Storage requirements depend on the properties of the regular expression only –Can define search universes; useful in structured text (no predefined retrieval units)

Fast Text Searching for Regular Expressions or Automaton Searching on Tries Presents algorithms for efficient searching of regular expressions on preprocessed text, using a Patricia tree as a logical model for the index. Run in logarithmic expected time in the size of the text for some restricted regular expressions, and in sublinear expected time for any regular expression.

Fast Text Searching for Regular Expressions or Automaton Searching on Tries Pattern matching – find occurrences of a given pattern in a long string Variations based on preprocessing the text or not and the language used to specify the query In this article the authors consider preprocessed text and a query specified by a regular expression The problem: find if text string t ε Σ* q Σ* (q is the query) and 1) the location of occurrence, 2) the number of occurrences, 3) all locations where the pattern occurs (any combination)

Fast Text Searching for Regular Expressions or Automaton Searching on Tries Main idea: Simulation of the finite automaton of the query over a digital tree (or Patricia tree) of the text. Run the automaton on all paths of the digital tree from the root to the leaves, stopping when possible. Time savings from the fact that each edge of the tree is traversed at most once, and that every edge represents pairs of symbols in many places of the text.

Fast Text Searching for Regular Expressions or Automaton Searching on Tries Static databases Logical index for text Definition of sistrings Construction of text index which is a binary trie consisting of the set of sistrings of the text Use of Patricia tree to reduce the number of internal nodes

Fast Text Searching for Regular Expressions or Automaton Searching on Tries General automaton searching –The authors present an algorithm that can search for artitrary regular expressions in time sublinear in n on the average. They simulate a DFA in a binary trie built from all the sistrings of a text.

Fast Text Searching for Regular Expressions or Automaton Searching on Tries Concluding comments –Using a trie or Patricia tree, we can search for many types of string searching queries in logarithmic average time, independently of the size of the answer –Automaton searching in a trie is sublinear in the size of the text on average for any regular expression –Worst case of automata searching is linear (for unusual pieces of text)