Recuperació de la informació

Slides:



Advertisements
Similar presentations
Non-Deterministic Finite Automata
Advertisements

1 String Matching of Bit Parallel Suffix Automata.
1 Languages. 2 A language is a set of strings String: A sequence of letters Examples: “cat”, “dog”, “house”, … Defined over an alphabet: Languages.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Applied Computer Science II Chapter 1 : Regular Languages Prof. Dr. Luc De Raedt Institut für Informatik Albert-Ludwigs Universität Freiburg Germany.
Lecture 3UofH - COSC Dr. Verma 1 COSC 3340: Introduction to Theory of Computation University of Houston Dr. Verma Lecture 3.
1 The scanning process Goal: automate the process Idea: –Start with an RE –Build a DFA How? –We can build a non-deterministic finite automaton (Thompson's.
The chromosomes contains the set of instructions for alive beings
1 Languages and Finite Automata or how to talk to machines...
CS 310 – Fall 2006 Pacific University CS310 Decidability Section 4.1/4.2 November 10, 2006.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Lecture 3 Goals: Formal definition of NFA, acceptance of a string by an NFA, computation tree associated with a string. Algorithm to convert an NFA to.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Indexing and Searching
1.Defs. a)Finite Automaton: A Finite Automaton ( FA ) has finite set of ‘states’ ( Q={q 0, q 1, q 2, ….. ) and its ‘control’ moves from state to state.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
CS-5800 Theory of Computation II PROJECT PRESENTATION By Quincy Campbell & Sandeep Ravikanti.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
String Matching of Regular Expression
Lesson No.6 Naveen Z Quazilbash. Overview Attendance and lesson plan sharing Assignments Quiz (10 mins.). Some basic ideas about this course Regular Expressions.
Finite Automata Chapter 1. Automatic Door Example Top View.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Overview of Previous Lesson(s) Over View  A token is a pair consisting of a token name and an optional attribute value.  A pattern is a description.
Nondeterministic Finite Automata (NFAs). Reminder: Deterministic Finite Automata (DFA) q For every state q in Q and every character  in , one and only.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Deterministic Finite Automata Nondeterministic Finite Automata.
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
Topic 3: Automata Theory 1. OutlineOutline Finite state machine, Regular expressions, DFA, NDFA, and their equivalence, Grammars and Chomsky hierarchy.
Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park.
Fall 2004COMP 3351 Finite Automata. Fall 2004COMP 3352 Finite Automaton Input String Output String Finite Automaton.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
CIS Automata and Formal Languages – Pei Wang
Advanced Data Structure: Bioinformatics
Finite automate.
Languages.
Lexical analysis Finite Automata
Recuperació de la informació
Two issues in lexical analysis
Recognizer for a Language
Chapter 2 FINITE AUTOMATA.
Regular Expressions Prof. Busch - LSU.
Non-Deterministic Finite Automata
COSC 3340: Introduction to Theory of Computation
4. Properties of Regular Languages
Animated Conversion of Regular Expressions to C Code
NFAs and Transition Graphs
Definitions Equivalence to Finite Automata
Regular Expressions.
Finite Automata.
4b Lexical analysis Finite Automata
COP4620 – Programming Language Translators Dr. Manuel E. Bermudez
CSCI 2670 Introduction to Theory of Computing
CSCI 2670 Introduction to Theory of Computing
Chapter 3. Lexical Analysis (2)
Tècniques i Eines Bioinformàtiques
4b Lexical analysis Finite Automata
Definitions Equivalence to Finite Automata
String Matching 11/04/2019 String matching: definition of the problem (text,pattern) Exact matching: depends on what we have: text or patterns The patterns.
CSCI 2670 Introduction to Theory of Computing
Tècniques i Eines Bioinformàtiques
Chapter 1 Regular Language
CSCI 2670 Introduction to Theory of Computing
Lecture 5 Scanning.
COSC 3340: Introduction to Theory of Computation
COSC 3340: Introduction to Theory of Computation
Presentation transcript:

Recuperació de la informació 06/04/2019 Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002) Gonzalo Navarro and Mathieu Raffinot Algorithms on strings (2001) M. Crochemore, C. Hancart and T. Lecroq http://www-igm.univ-mlv.fr/~lecroq/string/index.html

String Matching 06/04/2019 String matching: definition of the problem (text,pattern) Exact matching: depends on what we have: text or patterns The patterns ---> Data structures for the patterns 1 pattern ---> The algorithm depends on |p| and || k patterns ---> The algorithm depends on k, |p| and || Extensions Regular Expressions The text ----> Data structure for the text (suffix tree, ...) Approximate matching: Dynamic programming Sequence alignment (pairwise and multiple) Sequence assembly: hash algorithm Probabilistic search: Hidden Markov Models

Regular expression 06/04/2019 A regular expression ℛ is a string on the set of simbols Σ U { ε, |, · , * , (, ) } which is recursively defined as: ε (empty character) is a regular expression A character of Σ is a regular expression ( ℛ ) is a regular expression ℛ1 · ℛ2 is a regular expression As you have seen this morning .... ℛ1 | ℛ2 is a regular expression ℛ * is a regular expression

Regular lenguage 06/04/2019 The lenguage defined by a regular expression ℛ is the set of strings generated by ℛ . The problem of searching for a regular expression in the text T is to find all the factors in T that belong to the lenguage. As you have seen this morning ....

Methods Regular expression Parse tree NFA DFA Strings found 06/04/2019 Regular expression Parse tree NFA DFA Search with deterministic finit automata Search with bit-parallel Thompson automata As you have seen this morning .... Strings found

Methods Regular expression Parse tree NFA DFA Strings found 06/04/2019 Regular expression Parse tree NFA DFA Search with deterministic finit automata Search with bit-parallel Thompson automata As you have seen this morning .... Strings found

Search with a deterministic finit automata 06/04/2019 Given the regular expression bb*(b|b*a) the NFA is b 1 a 3 2 As it’s not possible to spell the text out the NFA, the NFA is transformed into a DFA … b 1 a 3 12 As you have seen this morning .... What is the cost? And the search process…

Search example with DFA 06/04/2019 b 1 a 3 12 Given the regular expression bb*(b|b*a) and the NFA: The search on the text: b b b a a b a a b b … As you have seen this morning .... …

Methods Regular expression Parse tree NFA DFA Strings found 06/04/2019 Regular expression Parse tree NFA DFA Search with deterministic finit automata Search with bit-parallel Thompson automata As you have seen this morning .... Strings found

. Parse tree Is a tree such that: 06/04/2019 Is a tree such that: - internal nodes are labeled by operators - leaves are labeled by characters of Σ and ε . ℛ1 ℛ2 ( ℛ ) ℛ ℛ1 · ℛ2 ℛ1 | ℛ2 As you have seen this morning .... | ℛ1 ℛ2 ℛ * ℛ *

Parse tree: example 06/04/2019 Given the regular expression bb*(b|b*a) the parse tree is: . | b * b . b As you have seen this morning .... a * b

NFA (Thompson automaton) 06/04/2019 From the regular expression or from the parse tree we define the automaton: For a character a of Σ: a . ℛ1 ℛ2 | ℛ1 ℛ2 ε ε As you have seen this morning .... ℛ * ε

Thompsom automaton construction 06/04/2019 bb*(b|b*a) . | b * b b . a * b b b a As you have seen this morning .... b b

NFA: ε-closure (states ε-equivalents) 06/04/2019 bb*(b|b*a) b 6 7 b a 2 3 5 8 9 b 1 4 12 b 10 11 ε 1 3 4 5 7 9 11 1, 2, 4, 5, 6, 8, 10 2, 3, 4, 5, 6, 8,10 4, 5, 6, 8, 10 As you have seen this morning .... 5, 6, 8 6, 7, 8 9, 12 11, 12

Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) ε 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 6 7 b 2 3 a 5 8 9 b D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4 12 b 10 11 Text: ababbbaab The bit-vector D mark the active states: at the begining At every step we shift to the right followed by an “and” operator with the mask of the last read character… D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 As you have seen this morning .... The masks are B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …and the ε-closure extension of active states.

Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 ε 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b -> 0 1 0 0 0 0 0 0 0 0 0 0 0 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 0 1 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 1 0 1 1 0 1 1 1 0 1 0 1 0 0 6 7 b 2 3 a 5 8 9 ε 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b -> 0 1 0 0 0 0 0 0 0 0 0 0 0 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 (b) 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 -> 0 0 1 1 0 1 1 1 0 1 0 1 0 As you have seen this morning ....

Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 -> 0 0 1 1 0 1 1 1 0 1 0 1 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 0 0 0 0 0 0 0 0 0 1 0 0 1 -> 0 1 1 1 0 1 1 1 0 1 0 1 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 1

Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 0 0 0 0 0 0 0 0 0 1 0 0 1 -> 0 1 1 1 0 1 1 1 0 1 0 1 0 -> 0 0 0 0 0 0 0 0 0 0 1 0 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 2 1 0 0 0 0 0 0 0 0 1 0 0 1

Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 0 1 1 0 1 1 1 0 1 0 1 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 0 0 0 0 0 0 0 0 0 1 0 0 1 -> 0 1 1 1 0 1 1 1 0 1 0 1 0 -> 0 0 0 0 0 0 0 0 0 0 1 0 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 (b) 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0 0 0 0 0 1 0 0 1

Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 1 1 1 1 0 1 1 1 0 1 0 1 0 0 D 0 1 2 3 4 5 6 7 8 9 10 11 12 2 0 0 0 0 0 0 0 0 0 1 0 0 1 -> 0 1 1 1 0 1 1 1 0 1 0 1 0 -> 0 0 0 0 0 0 0 0 0 0 1 0 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 (b) 0 1 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0

Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 As you have seen this morning ....

Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0

Bit-parallel Thompsom algorithm 06/04/2019 bb*(b|b*a) b 6 7 b 2 3 a 5 8 9 E 1 1, 2, 4, 5, 6, 8, 10 3 2, 3, 4, 5, 6, 8,10 4 4, 5, 6, 8, 10 5 5, 6, 8 7 6, 7, 8 9 9, 12 11 11, 12 b 1 4 12 b 10 11 B 0 1 2 3 4 5 6 7 8 9 10 11 12 a 0 0 0 0 0 0 0 0 0 1 0 0 0 b 0 1 0 1 0 0 0 1 0 0 0 1 0 Text: ababbbaab D 0 1 2 3 4 5 6 7 8 9 10 11 12 0 1 0 0 0 0 0 0 0 0 0 0 0 0 -> 0 1 0 0 0 0 0 0 0 0 0 0 0 As you have seen this morning .... (a) 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0