1 Grammatical inference Vs Grammar induction London June 2007 Colin de la Higuera
cdlh 2 Summary 1. Why study the algorithms and not the grammars 2. Learning in the exact setting 3. Learning in a probabilistic setting
cdlh 3 1 Why study the process and not the result? Usual approach in grammatical inference is to build a grammar (automaton), small and adapted in some way to the data from which we are supposed to learn from.
cdlh 4 Grammatical inference Is about learning a grammar given information about a language.
cdlh 5 Grammar induction Is about learning a grammar given information about a language.
cdlh 6 Difference? Data G Grammar induction Grammatical inference
cdlh 7 Motivating* example #1 Is 17 a random number? Is 17 more random than 25? Suppose I had a random number generator, would I convince you by showing how well it does on an example? On various examples ? *(and only slightly provocative)
cdlh 8 Motivating example #2 Is a random sequence? What about aaabaaabababaabbba?
cdlh 9 Motivating example #3 Let X be a sample of strings. Is grammar G the correct grammar for sample X? Or is it G’ ? Correct meaning something like “the one we should learn”
cdlh 10 Back to the definition Grammar induction and grammatical inference are about finding a/the grammar from some information about the language. But once we have done that, what can we say?
cdlh 11 What would we like to say? That the grammar is the smallest, best (re a score). Combinatorial characterisation What we really want to say is that having solved some complex combinatorial question we have an Occam, Compression-MDL-Kolmogorov like argument proving that what we have found is of interest.
cdlh 12 What else might we like to say? That in the near future, given some string, we can predict if this string belongs to the language or not. It would be nice to be able to bet £100 on this.
cdlh 13 What else would we like to say? That if the solution we have returned is not good, then that is because the initial data was bad (insufficient, biased). Idea: blame the data, not the algorithm.
cdlh 14 Suppose we cannot say anything of the sort? Then that means that we may be terribly wrong even in a favourable setting.
cdlh 15 Motivating example #4 Suppose we have an algorithm that ‘learns’ a grammar by applying iteratively the following two operations: Merge two non-terminals whenever some nice MDL-like rule holds Add a new non-terminal and rule corresponding to a substring when needed
cdlh 16 Two learning operators Creation of non terminals and rules NP ART ADJ NOUN NP ART ADJ ADJ NOUN NP ART AP1 NP ART ADJ AP1 AP1 ADJ NOUN
cdlh 17 Merging two non terminals NP ART AP1 NP ART AP2 AP1 ADJ NOUN AP2 ADJ AP1 NP ART AP1 AP1 ADJ NOUN AP1 ADJ AP1
cdlh 18 What is bound to happen? We will learn a context-free grammar that can only generate a regular language. Brackets are not found. This is a hidden bias.
cdlh 19 But how do we say that a learning algorithm is good? By accepting the existence of a target. The question is that of studying the process of finding this target (or something close to this target). This is an inference process.
cdlh 20 If you don’t believe there is a target? Or that the target belongs to another class You will have to come up with another bias. For example, believing that simplicity (eg MDL) is the correct way to handle the question.
cdlh 21 If you are prepared to accept there is a target but.. Either the target is known and what is the point or learning? Or we don’t know it in the practical case (with this data set) and it is of no use…
cdlh 22 Then you are doing grammar induction.
cdlh 23 Careful Some statements that are dangerous Algorithm A can learn {a n b n c n : n N} Algorithm B can learn this rule with just 2 examples Looks to me close to wanting free lunch
cdlh 24 A compromise You only need to believe there is a target while evaluating the algorithm. Then, in practice, there may not be one!
cdlh 25 End of provocative example If I run my random number generator and get , I can only keep this number if I believe in the generator itself.
cdlh 26 Credo (1) Grammatical inference is about measuring the convergence of a grammar learning algorithm in a typical situation.
cdlh 27 Credo(2) Typical can be: In the limit: learning is always achieved, one day Probabilistic There is a distribution to be used (Errors are measurably small) There is a distribution to be found
cdlh 28 Credo(3) Complexity theory should be used: the total or update runtime, the size of the data needed, the number of mind changes, the number and weight of errors… …should be measured and limited.
cdlh 29 2 Non probabilistic setting Identification in the limit Resource bounded identification in the limit Active learning (query learning)
cdlh 30 Identification in the limit The definitions, presentations The alternatives Order free or not Randomised algorithm
cdlh 31 A presentation is a function f: N X where X is any set, yields : Presentations Languages If f( N )=g( N ) then yields (f)= yields (g)
cdlh 32 Some presentations (1) A text presentation of a language L * is a function f : N * such that f( N )=L. f is an infinite succession of all the elements of L. (note : small technical difficulty with )
cdlh 33 Some presentations (2) An informed presentation (or an informant) of L * is a function f : N * {-,+} such that f(N)=(L,+) (L,-) f is an infinite succession of all the elements of * labelled to indicate if they belong or not to L.
cdlh 34 Learning function Given a presentation f, f n is the set of the first n elements in f. A learning algorithm a is a function that takes as input a set f n ={f(0),…,f (n-1)} and returns a grammar. Given a grammar G, L (G) is the language generated/recognised/ represented by G.
cdlh 35 Identification in the limit LPres NX NX A class of languages A class of grammars G L A learner The naming function yields a f ( N )= g ( N ) yields( f )=yields( g ) n N :k>n L ( a ( f k ) )= yields( f )
cdlh 36 What about efficiency? We can try to bound global time update time errors before converging mind changes queries good examples needed
cdlh 37 What should we try to measure? The size of G ? The size of L ? The size of f ? The size of f n ?
cdlh 38 Some candidates for polynomial learning Total runtime polynomial in ║ L ║ Update runtime polynomial in ║ L ║ # mind changes polynomial in ║ L ║ # implicit prediction errors polynomial in ║ L ║ Size of characteristic sample polynomial in ║ L ║
cdlh 39 f1f1 G1G1 a f(0) f2f2 G2G2 a f(1) fnfn GnGn a f(n-1) fkfk GnGn a f(k)f(k)
cdlh 40 Some selected results (1) DFAtextinformant Runtime no Update-time“yes #IPE“no #MC“? CS“yes
cdlh 41 Some selected results (2) CFGtextinformant Runtime no Update-time“yes #IPE“no #MC“? CS“no
cdlh 42 Some selected results (3) Good Ballstextinformant Runtimeno Update-timeyes #IPEyesno #MCyesno CSyes
cdlh 43 3 Probabilistic setting Using the distribution to measure error Identifying the distribution Approximating the distribution
cdlh 44 Probabilistic settings PAC learning Identification with probability 1 PAC learning distributions
cdlh 45 Learning a language from sampling We have a distribution over * We sample twice: Once to learn Once to see how well we have learned The PAC setting Probably approximately correct
cdlh 46 PAC learning (Valiant 84, Pitt 89) L a set of languages G a set of grammars and m a maximal length over the strings n a maximal size of grammars
cdlh 47 Polynomially PAC learnable There is an algorithm that samples reasonably and returns with probability at least 1- a grammar that will make at most errors.
cdlh 48 Results Using cryptographic assumptions, we cannot PAC learn DFA. Cannot PAC learn NFA, CFGs with membership queries either.
cdlh 49 Learning distributions No error Small error
cdlh 50 No error This calls for identification in the limit with probability 1. Means that the probability of not converging is 0.
cdlh 51 Results If probabilities are computable, we can learn with probability 1 finite state automata. But not with bounded (polynomial) resources.
cdlh 52 With error PAC definition But error should be measured by a distance between the target distribution and the hypothesis L 1,L 2,L ?
cdlh 53 Results Too easy with L Too hard with L 1 Nice algorithms for biased classes of distributions.
cdlh 54 For those that are not convinced there is a difference
cdlh 55 Structural completeness Given a sample and a DFA each edge is used at least once each final state accepts at least one string Look only at DFA for which the sample is structurally complete!
cdlh 56 not structurally complete… X + ={aab, b, aaaba, bbaba} add and abb a b a a a b b b
cdlh 57 Question Why is the automaton structurally complete for the sample ? And not the sample structurally complete for the automaton ?
cdlh 58 Some of the many things I have not talked about Grammatical inference is about new algorithms Grammatical inference is applied to various fields: pattern recognition, machine translation, computational biology, NLP, software engineering, web mining, robotics…
cdlh 59 And Next ICGI in Britanny in 2008 Some references in the 1 page abstract, others on the grammatical inference webpage.
cdlh 60 Appendix, some technicalities Size of G Size of L #MC PAC Size of f #IPE Runtimes #CS
cdlh 61 The size of G : ║ G ║ The size of a grammar is the number of bits needed to encode the grammar. Better some value polynomial in the desired quantity. Example: DFA : # of states CFG : # of rules * length of rules …
cdlh 62 The size of L If no grammar system is given, meaningless If G is the class of grammars then ║ L ║ = min{ ║ G ║ : G G L (G)=L} Example: the size of a regular language when considering DFA is the number of states of the minimal DFA that recognizes it.
cdlh 63 Is a grammar representation reasonable? Difficult question: typical arguments are that NFA are better than DFA because you can encode more languages with less bits. Yet redundancy is necessary!
cdlh 64 Proposal A grammar class is reasonable if it encodes sufficient different languages. Ie with n bits you have 2 n+1 encodings so optimally you should have 2 n+1 different languages. Allow for redundancy and syntaxic sugar, so p(2 n+1 ) different languages.
cdlh 65 But We should allow for redundancy and for some strings that do not encode grammars. Therefore a grammar representation is reasonable if there exists a polynomial p() and for any n the number of different languages encoded by grammars of size n is at least p(2 n )
cdlh 66 The size of a presentation f Meaningless. Or at least no convincing definition comes up. But when associated with a learner a we can define the convergence point Cp(f, a ) which is the point at which the learner a finds a grammar for the correct language L and does not change its mind. Cp(f, a )=n : m n, a (f m )= a (f n ) L
cdlh 67 The size of a finite presentation f n An easy attempt is n But then this does not represent the quantity of information we have received to learn. A better measure is i n |f(i)|
cdlh 68 Quantities associated with learner a The update runtime: time needed to update hypothesis h n-1 into h n when presented with f(n). The complete runtime. Time needed to build hypothesis h n from f n. Also the sum of all update-runtimes.
cdlh 69 Definition 1 (total time) G is polynomially identifiable in the limit from Pres if there exists an identification algorithm a and a polynomial p() such that given any G in G, and given any presentation f such that yields(f)= L (G), Cp(f, a ) p( ║ G ║ ). (or global-runtime( a ) p( ║ G ║ ))
cdlh 70 Impossible Just take some presentation that stays useless until the bound is reached and then starts helping.
cdlh 71 Definition 2 (update polynomial time) G is polynomially identifiable in the limit from Pres if there exists an identification algorithm a and a polynomial p() such that given any G in G, and given any presentation f such that yields(f)= L (G), update-runtime( a ) p( ║ G ║ ).
cdlh 72 Doesn’t work We can just differ identification Here we are measuring the time it takes to build the next hypothesis.
cdlh 73 Definition 4: polynomial number of mind changes G is polynomially identifiable in the limit from Pres if there exists an identification algorithm a and a polynomial p() such that given any G in G, and given any presentation f such that yields(f)= L (G), #{i : a (f i ) a (f i+1 )} p( ║ G ║ ).
cdlh 74 Definition 5: polynomial number of implicit prediction errors Denote by G x if G is incorrect with respect to an element x of the presentation (i.e. the algorithm producing G has made an implicit prediction error.
cdlh 75 G is polynomially identifiable in the limit from Pres if there exists an identification algorithm a and a polynomial p() such that given any G in G, and given any presentation f such that yields(f)= L (G), #{i : a (f i ) f(i+1)} p( ║ G ║ ).
cdlh 76 Definition 6: polynomial characteristic sample G has polynomial characteristic samples for identification algorithm a if there exists an and a polynomial p() such that: given any G in G, Y correct sample for G, such that when Y f n, a (f n ) G and ║ Y ║ p( ║ G ║ ).
cdlh 77 3 Probabilistic setting Using the distribution to measure error Identifying the distribution Approximating the distribution
cdlh 78 Probabilistic settings PAC learning Identification with probability 1 PAC learning distributions
cdlh 79 Learning a language from sampling We have a distribution over * We sample twice: Once to learn Once to see how well we have learned The PAC setting
cdlh 80 How do we consider a finite set? ** D mm D≤mD≤m Pr< By sampling 1/ ln 1/ examples we can find a safe m.
cdlh 81 PAC learning (Valiant 84, Pitt 89) L a set of languages G a set of grammars and m a maximal length over the strings n a maximal size of grammars
cdlh 82 H is -AC (approximately correct)* if Pr D [H(x) G(x)]<
cdlh 83 L(G) L(H) Errors: we want L 1 (D(G),D(H))<
cdlh 84 a b a b a b
cdlh 85 a b a b a b Pr(abab)=
cdlh a b a b a b
cdlh 87 b b a a a b
cdlh 88 b a b