Outline 1 Motivations, definition and difficulties 2

Slides:



Advertisements
Similar presentations
Characterization of state merging strategies which ensure identification in the limit from complete data Cristina Bibire.
Advertisements

Deterministic Finite Automata (DFA)
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
1 Languages. 2 A language is a set of strings String: A sequence of letters Examples: “cat”, “dog”, “house”, … Defined over an alphabet: Languages.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
1 Introduction to Computability Theory Lecture4: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
Lecture 3UofH - COSC Dr. Verma 1 COSC 3340: Introduction to Theory of Computation University of Houston Dr. Verma Lecture 3.
CS5371 Theory of Computation
61 Nondeterminism and Nodeterministic Automata. 62 The computational machine models that we learned in the class are deterministic in the sense that the.
Lecture 3 Goals: Formal definition of NFA, acceptance of a string by an NFA, computation tree associated with a string. Algorithm to convert an NFA to.
COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.
1 Languages and Finite Automata or how to talk to machines...
Lecture 3 Goals: Formal definition of NFA, acceptance of a string by an NFA, computation tree associated with a string. Algorithm to convert an NFA to.
79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.
CS5371 Theory of Computation Lecture 4: Automata Theory II (DFA = NFA, Regular Language)
FSA Lecture 1 Finite State Machines. Creating a Automaton  Given a language L over an alphabet , design a deterministic finite automaton (DFA) M such.
Topics Automata Theory Grammars and Languages Complexities
Syntactic Pattern Recognition Statistical PR:Find a feature vector x Train a system using a set of labeled patterns Classify unknown patterns Ignores relational.
Great Theoretical Ideas in Computer Science.
Formal Language Finite set of alphabets Σ: e.g., {0, 1}, {a, b, c}, { ‘{‘, ‘}’ } Language L is a subset of strings on Σ, e.g., {00, 110, 01} a finite language,
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
Automata, Computability, & Complexity by Elaine Rich ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Slides provided by author Slides edited for.
1 Chapter 1 Introduction to the Theory of Computation.
1 Grammatical inference Vs Grammar induction London June 2007 Colin de la Higuera.
Learning Automata and Grammars Peter Černo.  The problem of learning or inferring automata and grammars has been studied for decades and has connections.
Great Theoretical Ideas in Computer Science.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
Regular Expressions and Languages A regular expression is a notation to represent languages, i.e. a set of strings, where the set is either finite or contains.
Copyright © Curt Hill Finite State Automata Again This Time No Output.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
1 A well-parenthesized string is a string with the same number of (‘s as )’s which has the property that every prefix of the string has at least as many.
Chapter 6 Properties of Regular Languages. 2 Regular Sets and Languages  Claim(1). The family of languages accepted by FSAs consists of precisely the.
CS 203: Introduction to Formal Languages and Automata
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Probabilistic Automaton Ashish Srivastava Harshil Pathak.
Lecture 5: Finite Automata 虞台文 大同大學資工所 智慧型多媒體研究室.
Representing Languages by Learnable Rewriting Systems Rémi Eyraud Colin de la Higuera Jean-Christophe Janodet.
Mathematical Foundations of Computer Science Chapter 3: Regular Languages and Regular Grammars.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen Department of Computer Science University of Texas-Pan American.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
1 A well-parenthesized string is a string with the same number of (‘s as )’s which has the property that every prefix of the string has at least as many.
1 Chapter 3 Regular Languages.  2 3.1: Regular Expressions (1)   Regular Expression (RE):   E is a regular expression over  if E is one of:
Set, Alphabets, Strings, and Languages. The regular languages. Clouser properties of regular sets. Finite State Automata. Types of Finite State Automata.
Lecture #5 Advanced Computation Theory Finite Automata.
Fall 2004COMP 3351 Finite Automata. Fall 2004COMP 3352 Finite Automaton Input String Output String Finite Automaton.
Lecture Three: Finite Automata Finite Automata, Lecture 3, slide 1 Amjad Ali.
Outline (of this first talk)
Implementation of Haskell Modules for Automata and Sticker Systems
Language Theory Module 03.1 COP4020 – Programming Language Concepts Dr. Manuel E. Bermudez.
Languages.
Jaya Krishna, M.Tech, Assistant Professor
Chapter 2 FINITE AUTOMATA.
REGULAR LANGUAGES AND REGULAR GRAMMARS
Hierarchy of languages
Lecture3 DFA vs. NFA, properties of RL
CS 154, Lecture 3: DFANFA, Regular Expressions.
COSC 3340: Introduction to Theory of Computation
Finite Automata Reading: Chapter 2.
CSE 2001: Introduction to Theory of Computation Fall 2009
CSE322 Minimization of finite Automaton & REGULAR LANGUAGES
CS21 Decidability and Tractability
Instructor: Aaron Roth
Chapter 1 Introduction to the Theory of Computation
CSCI 2670 Introduction to Theory of Computing
CHAPTER 1 Regular Languages
CSCI 2670 Introduction to Theory of Computing
Grammatical inference: learning models
Presentation transcript:

Grammatical inference: learning from text Module X9IT050, Colin de la Higuera, Nantes, 2014

Outline 1 Motivations, definition and difficulties 2 Some negative results 3 Learning k-testable languages from text 4 Learning k-reversible languages from text 5 Conclusions http://pagesperso.lina.univ-nantes.fr/~cdlh/X9IT050.html Chapters 8 and 11 Colin de la Higuera, Nantes, 2014

Motivations, definition and difficulties Colin de la Higuera, Nantes, 2014

1.1 Identification in the limit yields L Pres  ℕX A class of languages L a The naming function A learner G A class of grammars L(a(n)) = yields() (ℕ) = (ℕ)  yields() = yields() Colin de la Higuera, Nantes, 2014

1.2 Learning from text Only positive examples are available (a sample S) Danger of over-generalization: why not return *? The problem is “basic”: Negative examples might not be available Or they might be heavily biased: near-misses, absurd examples… Base line: all the rest is learning with help Colin de la Higuera, Nantes, 2014

1.3 From the sample to the PTA PTA(S+) a 7 a 11 b a 4 8 a 2 b a 14 a 9 b 1 5 12 a 15 b a 13 3 b a 10 6 S+={, aaa, aaba, ababa, bb, bbaaa} Colin de la Higuera, Nantes, 2014

The universal automaton 1.4 GI as a search problem PTA ? The universal automaton (the unigram)  Colin de la Higuera, Nantes, 2014

Some negative results Colin de la Higuera, Nantes, 2014

2.1 The theory Gold 1967: No super-finite class can be identified from positive examples (or text) only Necessary and sufficient conditions for learning exist Literature: inductive inference, ALT series, … Colin de la Higuera, Nantes, 2014

2.2 Limit point A class L of languages has a limit point if there exists an infinite sequence Ln (nℕ) of languages in L such that L0  L1  … Lk  …, and there exists another language LL such that L = nℕLn L is called a limit point of L Colin de la Higuera, Nantes, 2014

L 2.3 L is a limit point L0 L1 L2 L3 Li Colin de la Higuera, Nantes, 2014

2.4 Theorem If L admits a limit point, then L is not learnable from text Proof: Let si be a presentation in length-lex order for Li, and s be a presentation in length-lex order for L. Then nℕ i : kn sik = sk Note: having a limit point is a sufficient condition for non learnability; not a necessary condition Colin de la Higuera, Nantes, 2014

2.5 Mincons classes A class is mincons if there is an algorithm which, given a sample S, builds a grammar GG such that S  L  L(G)  L = L(G) I.e. there is a unique minimum (for inclusion) consistent grammar Colin de la Higuera, Nantes, 2014

2.6 Accumulation point (Kapur 91) A class L of languages has an accumulation point if there exists an infinite sequence Sn nℕ of sets such that S0  S1  … Sn  …, and L = nℕSn  L …and for any nℕ there exists a language Ln’ in L such that Sn  Ln’  L. The language L is called an accumulation point of L Colin de la Higuera, Nantes, 2014

2.7 L is an accumulation point Ln’ S0 S1 S2 S3 Sn L Colin de la Higuera, Nantes, 2014

2.8 Theorem (for Mincons classes) L admits an accumulation point iff L is not learnable from text Colin de la Higuera, Nantes, 2014

2.9 Infinite Elasticity If a class of languages has a limit point there exists an infinite ascending chain of languages L0  L1  …  Ln  … This property is called infinite elasticity Colin de la Higuera, Nantes, 2014

2.10 Infinite Elasticity x0 x1 x2 x3 xi Xi+1 Xi+2 Xi+3 Xi+4 Colin de la Higuera, Nantes, 2014

If L (G) has finite elasticity and is mincons, 2.12 Theorem (Wright) If L (G) has finite elasticity and is mincons, then G is learnable. Colin de la Higuera, Nantes, 2014

TG L(G’) 2.13 Tell tale sets Forbidden x1 x3 x2 x4 L(G) Colin de la Higuera, Nantes, 2014

there is a computable partial function  : G ℕ* such that: 2.14 Theorem (Angluin) G is learnable iff there is a computable partial function  : G ℕ* such that: nℕ, (G,n) is defined iff GG and L(G)   GG, TM = {(G,n): nℕ} is a finite subset of L(G) called a tell-tale subset G,G’M, if TM L(G’) then L(G’) L(G) Colin de la Higuera, Nantes, 2014

2.15 Proposition (Kapur 91) A language L in L has a tell-tale subset iff L is not an accumulation point. (for mincons) Colin de la Higuera, Nantes, 2014

2.16 Summarizing Many alternative ways of proving that identification in the limit is not feasible Methodological-philosophical discussion We still need practical solutions Colin de la Higuera, Nantes, 2014

Learning k-testable languages Colin de la Higuera, Nantes, 2014

Bibliography P. García and E. Vidal. Inference of K-testable languages in the strict sense and applications to syntactic pattern recognition. Pattern Analysis and Machine Intelligence, 12(9):920–925, 1990 P. García, E. Vidal, and J. Oncina. Learning locally testable languages in the strict sense. In Workshop on Algorithmic Learning Theory (ALT 90), pages 325–338, 1990 Colin de la Higuera, Nantes, 2014

3.1 Definition Let k0, a k-testable language in the strict sense (k-TSS) is a 5-tuple Zk=(, I, F, T, C) with:  a finite alphabet I, F  k-1 (allowed prefixes of length k-1 and suffixes of length k-1) T  k (allowed segments) C  <k contains all strings of length less than k Note that I∩F = C∩Σk-1 Colin de la Higuera, Nantes, 2014

3.1 Definition The k-testable language for Zk is L(Zk) = I*  *F - *(k-T)* C Strings (of length at least k) have to use a good prefix and a good suffix of length k-1, and all sub-strings have to belong to T. Strings of length less than k should be in C Or: k-T defines the prohibited segments Key idea: use a window of size k Colin de la Higuera, Nantes, 2014

3.2 Window languages We use a window to decide if computation is OK At any moment, the content of the window has to be permitted Colin de la Higuera, Nantes, 2014

3.3 An example (2-testable) I={a} F={a} T={aa, ab, ba} C={,a} a b  Colin de la Higuera, Nantes, 2014

3.4 Window languages By sliding a window of size 2 over a string we can parse ababaaababababaaaa OK aaabbaaaababab not OK Colin de la Higuera, Nantes, 2014

3.5 The hierarchy of k-TSS languages k-TSS()={L*: L is k-TSS} All finite languages are in k-TSS() if k is large enough! k-TSS()  [k+1]-TSS() (bak)*  [k+1]-TSS() (bak)*  k-TSS() With a window of length only k, if we accept window ak, then string ak+22 is in L Colin de la Higuera, Nantes, 2014

3.6 A language that is not k-testable  a Colin de la Higuera, Nantes, 2014

Zk = ( (S), I(S), F(S), T(S), C(S) ) 3.7 K-TSS inference Given a sample S, L(ak-TSS(S)) = Zk where Zk = ( (S), I(S), F(S), T(S), C(S) ) and (S) is the alphabet used in S C(S)=(S)<kS I(S)=(S)k-1Pref(S) F(S)= (S)k-1Suff(S) T(S)=(S)k{v: uvwS} Colin de la Higuera, Nantes, 2014

3.8 Example S={a, aa, abba, abbbba} Let k=3 (S)={a, b} I(S)= {aa, ab} F(S)= {aa, ba} C(S)= {a , aa} T(S)={abb, bbb, bba} L(a3-TSS(S)) = ab*a+a Colin de la Higuera, Nantes, 2014

3.9 Building the corresponding automaton Each string in IC and PREF(IC) is a state Each substring of length k-1 of strings in T is a state  is the initial state Add a transition labeled b from u to ub for each state ub Add a transition labeled b from au to ub for each aub in T Each state/substring that is in F is a final state Each state/substring that is in C is a final state Colin de la Higuera, Nantes, 2014

3.10 A language that is not k-testable S={a, aa, abba, abbbba} a  ab ba bb aa a  ab ba bb aa a b I = {aa, ab} F = {aa, ba} T = {abb, bbb, bba} C={a, aa} Colin de la Higuera, Nantes, 2014

3.11 Properties (1) S  L(ak-TSS(S)) L(ak-TSS(S)) is the smallest k-TSS language that contains S If there is a smaller one, some prefix, suffix or substring has to be absent Colin de la Higuera, Nantes, 2014

3.11 Properties (2) ak-TSS identifies any k-TSS language in the limit from polynomial data Once all the prefixes, suffixes and substrings have been seen, the correct automaton is returned If YS, L(ak-TSS(Y))  L(ak-TSS(S)) Colin de la Higuera, Nantes, 2014

3.11 Properties (3) L(ak+1-TSS(S))  L(ak-TSS(S)) In Ik+1 (resp. Fk+1 and Tk+1) there are less allowed prefixes (resp. suffixes or substrings) than in Ik (resp. Fk and Tk) k>maxxSx, L(ak-TSS(S)) = S Because for a large k, Tk(S) =  Colin de la Higuera, Nantes, 2014

Learning k-reversible languages from text Colin de la Higuera, Nantes, 2014

Bibliography D. Angluin. Inference of reversible languages. Journal of the Association for Computing Machinery, 29(3):741–765, 1982 Colin de la Higuera, Nantes, 2014

4.1 The k-reversible languages The class was proposed by Angluin (1982) The class is identifiable in the limit from text The class is composed by regular languages that can be accepted by a DFA such that its reverse is deterministic with a look-ahead of k Colin de la Higuera, Nantes, 2014

the reversal automaton with: T(q,a)={q’Q: q(q’,a)} Let A = (, Q, , I, F) be a NFA, we denote by AT=(, Q, T, F, I) the reversal automaton with: T(q,a)={q’Q: q(q’,a)} Colin de la Higuera, Nantes, 2014

a b A a 2 b 1 a 4 a a 3 a AT b a 2 b 1 a 4 a a 3 Colin de la Higuera, Nantes, 2014

4.2 Some definitions u is a k-successor of q if │u│= k and (q,u)   u is a k-predecessor of q if │u│= k and T(q,uT)    is 0-successor and 0-predecessor of any state Colin de la Higuera, Nantes, 2014

a b A a 2 b 1 a 4 a a 3 aa is a 2-successor of 0 and 1 but not of 3 a 4 a a 3 aa is a 2-successor of 0 and 1 but not of 3 a is a 1-successor of 3 aa is a 2-predecessor of 3 but not of 1 Colin de la Higuera, Nantes, 2014

  4.2 Some definitions U  v A NFA is deterministic with look-ahead k If q,q’Q: qq’ (q,q’I)  (q,q’(q”,a))  (u is a k-successor of q)  (v is a k-successor of q’)  U  v Colin de la Higuera, Nantes, 2014

4.3 Prohibited: u 1 │u│=k a a u 2 Colin de la Higuera, Nantes, 2014

4.4 Example: a b a 2 b a 1 a 4 a 3 This automaton is not deterministic with look-ahead 1 but is deterministic with look-ahead 2 Colin de la Higuera, Nantes, 2014

4.5 K-reversible automata A is k-reversible if A is deterministic and AT is deterministic with look-ahead k b a a b a 2 b a 2 1 1 b b deterministic deterministic with look-ahead 1 Colin de la Higuera, Nantes, 2014

4.6 Notations RL(, k) is the set of all k reversible languages over alphabet  RL() is the set of all k-reversible languages over alphabet  (i.e. for all values of k) ak-RL is the learning algorithm we describe Colin de la Higuera, Nantes, 2014

RL(,k)  RL(,k-1) 4.7 Properties There are some regular languages that are not in RL() RL(,k)  RL(,k-1) Colin de la Higuera, Nantes, 2014

4.8 Violation of k-reversibility Two states q, q’ violate the k-reversibility condition if they violate the deterministic condition: q,q’(q”,a) or they violate the look-ahead condition: q,q’F, uk: u is k-predecessor of both q and q’ uk, (q,a) = (q’,a) and u is k-predecessor of both q and q’ Colin de la Higuera, Nantes, 2014

4.9 Learning k-reversible automata Key idea: the order in which the merges are performed does not matter! Just merge states that do not comply with the conditions for k-reversibility Colin de la Higuera, Nantes, 2014

4.10 K-RL algorithm (ak-RL) Data: kℕ, S sample of a k-RL language L A0 = PTA(S)  = {{q}:qQ} while B,B’ k-reversibility violators do  = -B-B’  {BB’} A = A0/ Colin de la Higuera, Nantes, 2014

4.10 K-RL algorithm (ak-RL) Data: kℕ, S sample of a k-RL language L A = PTA(S) while q,q’ k-reversibility violators do A = merge(A,q,q’) Colin de la Higuera, Nantes, 2014

Let S = {a, aa, abba, abbbba} K = 2  b b b b a ab abb abbb abbbb abbbba Violators, for u= ba Colin de la Higuera, Nantes, 2014

Let S = {a, aa, abba, abbbba} K = 2  ab b abb b abbb b abbbb b Violators, for u = bb Colin de la Higuera, Nantes, 2014

Suppose k = 1. Then now a, aa and abba violate. Let S = {a, aa, abba, abbbba} K = 2 a aa abba a a a  b b b ab abb abbb b Suppose k = 1. Then now a, aa and abba violate. Colin de la Higuera, Nantes, 2014

4.11 Properties (1) k0, S, ak-RL(S) is a k-reversible language L(ak-RL(S)) is the smallest k-reversible language that contains S The class RL(, k) is identifiable in the limit from text Colin de la Higuera, Nantes, 2014

(u1v)-1L (u2v)-1L and │v│=k 4.11 Properties (2) Any regular language is k-reversible iff (u1v)-1L (u2v)-1L and │v│=k  (u1v)-1L=(u2v)-1L (if two strings are prefixes of a string of length at least k, then the strings are Nerode-equivalent) Colin de la Higuera, Nantes, 2014

L(ak-RL(S))  L(a(k-1)-RL(S)) 4.11 Properties (3) L(ak-RL(S))  L(a(k-1)-RL(S)) RL(, k)  RL(, k+1) Colin de la Higuera, Nantes, 2014

The time complexity is O(k║S║3) The space complexity is O(║S║) 4.11 Properties (4) The time complexity is O(k║S║3) The space complexity is O(║S║) Colin de la Higuera, Nantes, 2014

4.11 Properties (4) Polynomial aspects Polynomial characteristic sets Polynomial update time But not necessarily a polynomial number of mind changes Polynomial aspects Colin de la Higuera, Nantes, 2014

4.12 Extensions Sakakibara built an extension for context-free grammars whose tree language is k-reversible Marion & Besombes propose an extension to tree languages Different authors propose to learn these automata and then estimate the probabilities as an alternative to learning stochastic automata Colin de la Higuera, Nantes, 2014

4.13 Exercises Build a language L that is not k-reversible, k0 Prove that the class of all k-reversible languages is not learnable from text Run ak-RL on S = {aa, aba, abb, abaaba, baaba} for k=0,1,2,3 Colin de la Higuera, Nantes, 2014

4.14 Solution (idea) Lk={ai: ik} Then for each k: Lk is k-reversible but not k-1 reversible And ULk = a* So there is an accumulation point… Colin de la Higuera, Nantes, 2014

4.15 Exercise (1) Let Jn={w*: wn} and J=U{Jn} Find an algorithm that identifies J in the limit from text Prove that this algorithm works in polynomial update time Prove that it admits a polynomial characteristic set Colin de la Higuera, Nantes, 2014

4.15 Exercise (2) Let Bn,w = {u*: dedit(u,w)n} And B = U{Bn,w} Find an algorithm that identifies B in the limit from text Colin de la Higuera, Nantes, 2014

Conclusions Colin de la Higuera, Nantes, 2014

5.1 Conclusions Window languages are rich families using simple mechanisms Work has been done by J. Heinz or H. Fernau using alternative mechanisms Colin de la Higuera, Nantes, 2014

Colin de la Higuera, Nantes, 2014