On Learning Regular Expressions and Patterns via Membership and Correction Queries Efim Kinber Sacred Heart University Fairfield, CT, USA.

Slides:



Advertisements
Similar presentations
Exercises for Chapter 2. Summary for the Quiz Some numbers –199  less than 100  55 Conclusion –There are students who did read textbook at least for.
Advertisements

Lecture 24 MAS 714 Hartmut Klauck
Lecture 9,10 Theory of AUTOMATA
CS 461 – Nov. 9 Chomsky hierarchy of language classes –Review –Let’s find a language outside the TM world! –Hints: languages and TM are countable, but.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 2 Mälardalen University 2005.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
YES-NO machines Finite State Automata as language recognizers.
The Power of Correction Queries Cristina Bibire Research Group on Mathematical Linguistics, Rovira i Virgili University Pl. Imperial Tarraco 1, 43005,
1 Section 14.1 Computability Some problems cannot be solved by any machine/algorithm. To prove such statements we need to effectively describe all possible.
1 Undecidability Andreas Klappenecker [based on slides by Prof. Welch]
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture12: Decidable Languages Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture4: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Introduction to Computability Theory
CSCI 2670 Introduction to Theory of Computing October 19, 2005.
CPSC 411, Fall 2008: Set 12 1 CPSC 411 Design and Analysis of Algorithms Set 12: Undecidability Prof. Jennifer Welch Fall 2008.
1 Lecture 4: Formal Definition of Solvability Analysis of decision problems –Two types of inputs:yes inputs and no inputs –Language recognition problem.
1 Undecidability Andreas Klappenecker [based on slides by Prof. Welch]
Fall 2003Costas Busch - RPI1 Decidable Problems of Regular Languages.
CS5371 Theory of Computation Lecture 6: Automata Theory IV (Regular Expression = NFA = DFA)
Automata & Formal Languages, Feodor F. Dragan, Kent State University 1 CHAPTER 5 Reducibility Contents Undecidable Problems from Language Theory.
1 Module 4: Formal Definition of Solvability Analysis of decision problems –Two types of inputs:yes inputs and no inputs –Language recognition problem.
MA/CSSE 474 Theory of Computation
Prof. Busch - RPI1 Decidable Problems of Regular Languages.
Syntactic Pattern Recognition Statistical PR:Find a feature vector x Train a system using a set of labeled patterns Classify unknown patterns Ignores relational.
CS5371 Theory of Computation Lecture 12: Computability III (Decidable Languages relating to DFA, NFA, and CFG)
Finite State Machines Data Structures and Algorithms for Information Processing 1.
Regular Model Checking Ahmed Bouajjani,Benget Jonsson, Marcus Nillson and Tayssir Touili Moran Ben Tulila
1 Learning languages from bounded resources: the case of the DFA and the balls of strings ICGI -Saint Malo September 2008 de la Higuera, Janodet and Tantini.
Welcome to Honors Intro to CS Theory Introduction to CS Theory (Honors & Traditional): - formalization of computation - various models of computation (increasing.
CS490 Presentation: Automata & Language Theory Thong Lam Ran Shi.
Learning DFA from corrections Leonor Becerra-Bonache, Cristina Bibire, Adrian Horia Dediu Research Group on Mathematical Linguistics, Rovira i Virgili.
Languages & Grammars. Grammars  A set of rules which govern the structure of a language Fritz Fritz The dog The dog ate ate left left.
Learning Automata and Grammars Peter Černo.  The problem of learning or inferring automata and grammars has been studied for decades and has connections.
Welcome to Honors Intro to CS Theory Introduction to CS Theory (Honors & Traditional): - formalization of computation - various models of computation (increasing.
Approximate schemas Michel de Rougemont, LRI, University Paris II.
LING/C SC/PSYC 438/538 Lecture 12 10/4 Sandiway Fong.
1 Prove the following languages over Σ={0,1} are regular by giving regular expressions for them: 1. {w contains two or more 0’s} 2. {|w| = 3k for some.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.
Regular Expressions Chapter 6 1. Regular Languages Regular Language Regular Expression Finite State Machine L Accepts 2.
Inferring Finite Automata from queries and counter-examples Eggert Jón Magnússon.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Automata & Formal Languages, Feodor F. Dragan, Kent State University 1 CHAPTER 3 The Church-Turing Thesis Contents Turing Machines definitions, examples,
1 Turing’s Thesis. 2 Turing’s thesis: Any computation carried out by mechanical means can be performed by a Turing Machine (1930)
CS 203: Introduction to Formal Languages and Automata
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
Regular Grammars Reading: 3.3. What we know so far…  FSA = Regular Language  Regular Expression describes a Regular Language  Every Regular Language.
Representing Languages by Learnable Rewriting Systems Rémi Eyraud Colin de la Higuera Jean-Christophe Janodet.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2007.
CSCI 2670 Introduction to Theory of Computing October 13, 2005.
Mathematical Foundations of Computer Science Chapter 3: Regular Languages and Regular Grammars.
1.2 Three Basic Concepts Languages start variables Grammars Let us see a grammar for English. Typically, we are told “a sentence can Consist.
Learning regular tree languages from correction and equivalence queries Cătălin Ionuţ Tîrnăucă Research Group on Mathematical Linguistics, Rovira i Virgili.
CSE 311 Foundations of Computing I Lecture 24 FSM Limits, Pattern Matching Autumn 2011 CSE 3111.
1 Undecidability Andreas Klappenecker [based on slides by Prof. Welch]
1 Chapter 3 Regular Languages.  2 3.1: Regular Expressions (1)   Regular Expression (RE):   E is a regular expression over  if E is one of:
Lecture 02: Theory of Automata:2014 Asif Nawaz Theory of Automata.
Set, Alphabets, Strings, and Languages. The regular languages. Clouser properties of regular sets. Finite State Automata. Types of Finite State Automata.
Languages and Strings Chapter 2. (1) Lexical analysis: Scan the program and break it up into variable names, numbers, etc. (2) Parsing: Create a tree.
General Discussion of “Properties” The Pumping Lemma Membership, Emptiness, Etc.
Implementation of Haskell Modules for Automata and Sticker Systems
Properties of Regular Languages
REGULAR LANGUAGES AND REGULAR GRAMMARS
CSCE 411 Design and Analysis of Algorithms
Recap lecture 29 Example of prefixes of a language, Theorem: pref(Q in R) is regular, proof, example, Decidablity, deciding whether two languages are equivalent.
Decidable Problems of Regular Languages
Instructor: Aaron Roth
Presentation transcript:

On Learning Regular Expressions and Patterns via Membership and Correction Queries Efim Kinber Sacred Heart University Fairfield, CT, USA

Query Learning Model (D. Angluin): L – target language, w – a string Membership queries: “w in L?” Answer: “YES” OR “NO” Correction queries: “w in L?” Answer: “YES” or a correction: a string in L “closest” to w. A learner asks a finite number of queries and returns a description for the target language (e.g., a regular expression, a dfa, a pattern)

Corrections: Not seen in the learning process so far At the shortest Levenshtein edit distance (preference given to corrections of the same length as the queried string) Smallest in the lexicographic order among those satisfying the above conditions (Originally, corrections – shortest extensions of the queried string:  Introduced by L. Beccera-Bonache, A. Dediu, C. Tîrn ă uc ă : learning DFA  C. Tîrn ă uc ă, T. Knuutila: learning k- reversible languages, patterns  Corrections – at the shortest edit distance: L. Beccera-Bonache, C. de la Higuera, J. Janodet, F. Tantini: learning balls of strings )

A Class of Regular Expressions: u 1 (v 1 ) + u 2 (v 2 ) + …u n (v n ) + u n+1 u i, v i – strings over an alphabet Σ. Example: a(aa) + bccd(cd) + ddabb + b(abb) + Two more conditions: Left-aligned: (aa) + abc(cd) + cddab + bb(abb) + No subexpressions of the type ((ab) 3 ) + ((ab) 5 ) +

Learning: Using One correction query “empty string in L?” (to find the shortest string in L) Membership queries (another way to find the shortest string: given an arbitrary string in L, use membership queries)

Example: Target expression: a + a(ab) + ab + (ab) + ab + a(bb) + Shortest string: aaabababababb First step: finding loops of length 1  Query “aaaabababababb in L?” (one extra a in the first block of a-s)  Query “aaabbababababb in L?” (answer is “no”)  Etc. RESULT: a + aabab + abab + abb

Example (continued): Target expression: a + a(ab) + ab + (ab) + ab + a(bb) + Result of Step 1: a + aabab + abab + abb Step 2: finding loops of length 2:  Query “aaabababbabababb in L?” (“yes”) ( “aaababababababb in L?” does not work!) New conjecture: a + a(ab) + ab + abab + abb  Query “aaababbabababbabb in L?” (“yes”) New conjecture: a + a(ab) + ab + (ab) + ab + abb  Etc. Key point: (under certain conditions,) the neighboring loops on the left and/or on the right in the current conjecture must be expanded for the next queried string

Complexity: Running time: O(n 5 ) Number of queries: O(n 3 ) n - the length of the target expression (equivalently, the length of the shortest example)

Discussion A similar algorithm for the following modification: * instead of + one-letter loops only nonempty strings between any two loops (example: a*aab*ac*ca*aa) Similar to a class considered by H. Fernau (learnable in the limit from positive data) Most of the classes of regular expressions considered in literature are deterministic (or one-unambiguous), whereas our class is not

Patterns (introduced by D. Angluin): Σ – an alphabet (of constants) X - a countable (infinite) set of variables A pattern π – a string over ( Σ U X)* Example: 0xx10yxy0z A substitution: 01  x, 1  y, 10  z (all variables are substituted by nonempty strings over Σ) Get the string: The language L(π): all strings obtained by such substitutions

Learning Patterns A hard problem: Membership problem: NP-complete Inclusion problem: undecidable (Equivalence problem: decidable in linear time, but does not help much) LEARNABILITY (under various protocols) extensively studied Recently: C. Tîrn ă uc ă, T. Knuutila: learning from correction queries, where a correction is the shortest extension of the queried string Learnability can be applied for various practical problems, including learning DNA structures (a survey by T. Shinohara, S. Arikawa)

Our Learning Algorithm (using correction queries): On example: x01yzuuuuzzzzvvv  To find a shortest string, query “empty string in L?” Get the correction string:  Query “1 16 in L?” Get the correction string: Constants 0 and 1 have been found. To find variables, let U01UUUUUUUUUUUUU be the current conjecture, where U is a placeholder for a variable

Learning Algorithm (continued): Current conjecture: U01UUUUUUUUUUUUU  Query “ in L?” (trying 1 for the first position). Answer is “yes”. The new conjecture is x01UUUUUUUUUUUUUU  Query “ in L?” (trying 1 for the 4 th postion). Answer is “yes”. The new conjecture is x01yUUUUUUUUUUUUU  Query “ in L?” Answer is “no”. Correction string: (where x, y are substituted by 1)  Query “ in L?” Correction string: The new conjecture: x01yUUUUUUUUUvvv

Learning Algorithm (continued) Target pattern: x01yzuuuuzzzzvvv Current conjecture: x01yUUUUUUUUUUvvv  Query “ in L?” (trying 1 for the 5 th position one more time). Correction string: New conjecture: x01yzUUUUzzzzvvv  Query “ in L?” Correction: (as closer correction strings have been seen). The final result is: x01yzuuuuzzzzvvv

One more example: Target pattern: 10xx Query empty string. Get correction: Query Get correction Conjecture 10UU (U - placeholder) Query The answer is “no”. Correction string must be longer than 4 (as both 1000 and 1011 have already been used). The algorithm terminates and returns 10xx (replacing all placeholders by the same variable)

Complexity  Running time: O(n 3 )  Number of queries: O(n 2 ) n - the length of the target pattern

Discussion  The algorithm can be extended for an arbitrary alphabet Σ  Why corrections of the same length? Strings of arbitrary length are of little help (Lange, Wiehagen)  Requirement of using the lexicographically smallest shorted example can be slightly relaxed  Conjecture: patterns cannot be learned in poly- time using corrections at the shortest edit distance with no constraints