Zadar, August 2010 1 Learning probabilistic finite automata Colin de la Higuera University of Nantes.

Slides:



Advertisements
Similar presentations
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Advertisements

Characterization of state merging strategies which ensure identification in the limit from complete data Cristina Bibire.
Lecture 24 MAS 714 Hartmut Klauck
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
FORMAL LANGUAGES, AUTOMATA, AND COMPUTABILITY
1 Polynomial Time Probabilistic Learning of a Subclass of Linear Languages with Queries Yasuhiro TAJIMA, Yoshiyuki KOTANI Tokyo Univ. of Agri. & Tech.
Pushdown Automata Part II: PDAs and CFG Chapter 12.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
1 Module 19 LNFA subset of LFSA –Theorem 4.1 on page 131 of Martin textbook –Compare with set closure proofs Main idea –A state in FSA represents a set.
61 Nondeterminism and Nodeterministic Automata. 62 The computational machine models that we learned in the class are deterministic in the sense that the.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Cohen, Chapter 61 Introduction to Computational Theory Chapter 6.
COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.
Finite state automaton (FSA)
1 Finite state automaton (FSA) LING 570 Fei Xia Week 2: 10/07/09 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
79 Regular Expression Regular expressions over an alphabet  are defined recursively as follows. (1) Ø, which denotes the empty set, is a regular expression.
CS 454 Theory of Computation Sonoma State University, Fall 2011 Instructor: B. (Ravi) Ravikumar Office: 116 I Darwin Hall Original slides by Vahid and.
Lecture 5UofH - COSC Dr. Verma 1 COSC 3340: Introduction to Theory of Computation University of Houston Dr. Verma Lecture 5.
Zadar, August Learning from an informant Colin de la Higuera University of Nantes.
1 Learning languages from bounded resources: the case of the DFA and the balls of strings ICGI -Saint Malo September 2008 de la Higuera, Janodet and Tantini.
Basics of automata theory
Lecture # 1 (Automata Theory)
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
L ECTURE 1 T HEORY OF A UTOMATA. P RAGMATICS  Pre-Requisites  No Pre-Requisite  Text book  Introduction to Computer Theory by Daniel I.A. Cohen 
1 Language Definitions Lecture # 2. Defining Languages The languages can be defined in different ways, such as Descriptive definition, Recursive definition,
1 Chapter 1 Introduction to the Theory of Computation.
1 Grammatical inference Vs Grammar induction London June 2007 Colin de la Higuera.
Learning Automata and Grammars Peter Černo.  The problem of learning or inferring automata and grammars has been studied for decades and has connections.
Module 2 How to design Computer Language Huma Ayub Software Construction Lecture 8.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
Zadar, August Active Learning 2010 Colin de la Higuera.
Copyright © Curt Hill Finite State Automata Again This Time No Output.
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
CS 203: Introduction to Formal Languages and Automata
Chapter 3 Regular Expressions, Nondeterminism, and Kleene’s Theorem Copyright © 2011 The McGraw-Hill Companies, Inc. Permission required for reproduction.
Recognising Languages We will tackle the problem of defining languages by considering how we could recognise them. Problem: Is there a method of recognising.
Probabilistic Automaton Ashish Srivastava Harshil Pathak.
MA/CSSE 474 Theory of Computation Minimizing DFSMs.
Representing Languages by Learnable Rewriting Systems Rémi Eyraud Colin de la Higuera Jean-Christophe Janodet.
Lecture # Book Introduction to Theory of Computation by Anil Maheshwari Michiel Smid, 2014 “Introduction to computer theory” by Daniel I.A. Cohen.
Lecture # 4.
Mathematical Foundations of Computer Science Chapter 3: Regular Languages and Regular Grammars.
Lecture 2 Theory of AUTOMATA
Dynamic Programming & Hidden Markov Models. Alan Yuille Dept. Statistics UCLA.
CS 154 Formal Languages and Computability February 11 Class Meeting Department of Computer Science San Jose State University Spring 2016 Instructor: Ron.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen.
1 Introduction to the Theory of Computation Regular Expressions.
1 Section 11.2 Finite Automata Can a machine(i.e., algorithm) recognize a regular language? Yes! Deterministic Finite Automata A deterministic finite automaton.
Lecture 02: Theory of Automata:2014 Asif Nawaz Theory of Automata.
Set, Alphabets, Strings, and Languages. The regular languages. Clouser properties of regular sets. Finite State Automata. Types of Finite State Automata.
Theory of Languages and Automata By: Mojtaba Khezrian.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
By Dr.Hamed Alrjoub. 1. Introduction to Computer Theory, by Daniel I. Cohen, John Wiley and Sons, Inc., 1991, Second Edition 2. Introduction to Languages.
Lecture 01: Theory of Automata:2013 Asif Nawaz Theory of Automata.
Recap Lecture 3 RE, Recursive definition of RE, defining languages by RE, { x}*, { x}+, {a+b}*, Language of strings having exactly one aa, Language of.
Outline (of this first talk)
Outline 1 Motivations, definition and difficulties 2
Implementation of Haskell Modules for Automata and Sticker Systems
Lecture 1 Theory of Automata
Grammatical inference: an introduction
Theory of Automata.
Exercise: fourAB Write a method fourAB that prints out all strings of length 4 composed only of a’s and b’s Example Output aaaa baaa aaab baab aaba baba.
4. Properties of Regular Languages
NFAs and Transition Graphs
MA/CSSE 474 Theory of Computation Minimizing DFSMs.
Chapter 1 Introduction to the Theory of Computation
Magnetic and chemical equivalence
Grammatical inference: learning models
Presentation transcript:

Zadar, August Learning probabilistic finite automata Colin de la Higuera University of Nantes

Zadar, August Acknowledgements Laurent Miclet, Jose Oncina, Tim Oates, Rafael Carrasco, Paco Casacuberta, Rémi Eyraud, Philippe Ezequel, Henning Fernau, Thierry Murgue, Franck Thollard, Enrique Vidal, Frédéric Tantini,... List is necessarily incomplete. Excuses to those that have been forgotten Chapters 5 and 16

Zadar, August Outline 1. PFA 2. Distances between distributions 3. FFA 4. Basic elements for learning PFA 5. ALERGIA 6. MDI and DSAI 7. Open questions

Zadar, August PFA Probabilistic finite (state) automata

Zadar, August Practical motivations (Computational biology, speech recognition, web services, automatic translation, image processing …) A lot of positive data Not necessarily any negative data No ideal target Noise

Zadar, August The grammar induction problem, revisited The data consists of positive strings, «generated» following an unknown distribution The goal is now to find (learn) this distribution Or the grammar/automaton that is used to generate the strings

Zadar, August Success of the probabilistic models n-grams Hidden Markov Models Probabilistic grammars

Zadar, August a b a b a b DPFA: Deterministic Probabilistic Finite Automaton

Zadar, August a b a b a b Pr A (abab)=

Zadar, August a b a b a b

Zadar, August b b a a a b PFA: Probabilistic Finite (state) Automaton

Zadar, August  b   a b  -PFA: Probabilistic Finite (state) Automaton with  -transitions

Zadar, August How useful are these automata? They can define a distribution over  * They do not tell us if a string belongs to a language They are good candidates for grammar induction There is (was?) not that much written theory

Zadar, August Basic references The HMM literature Azaria Paz 1973: Introduction to probabilistic automata Chapter 5 of my book Probabilistic Finite-State Machines, Vidal, Thollard, cdlh, Casacuberta & Carrasco Grammatical Inference papers

Zadar, August Automata, definitions Let D be a distribution over  * 0  Pr D (w)  1  w  * Pr D (w)=1

Zadar, August A Probabilistic Finite (state) Automaton is a Q set of states I P : Q  [0;1] F P : Q  [0;1]  P : Q  Q  [0;1]

Zadar, August What does a PFA do? It defines the probability of each string w as the sum (over all paths reading w) of the products of the probabilities Pr A (w)=   i  paths(w) Pr(  i )  i =q i 0 a i 1 q i 1 a i 2 …a i n q i n Pr(  i )=I P (q i 0 )·F P (q i n ) ·  ai j  P (q i j-1,a i j,q i j ) Note that if -transitions are allowed the sum may be infinite

Zadar, August Pr(aba) = 0.7*0.4*0.1*1 +0.7*0.4*0.45*0.2 = = a b a a a b b 0.4

Zadar, August non deterministic PFA: many initial states/only one initial state an -PFA: a PFA with -transitions and perhaps many initial states DPFA: a deterministic PFA

Zadar, August Consistency A PFA is consistent if Pr A (  * )=1  x  * 0  Pr A (x)  1

Zadar, August Consistency theorem A is consistent if every state is useful (accessible and co- accessible) and  q  Q F P (q) +  q’  Q,a   P (q,a,q’)= 1

Zadar, August Equivalence between models Equivalence between PFA and HMM… But the HMM usually define distributions over each  n

Zadar, August win draw lose A football HMM

Zadar, August Equivalence between PFA with -transitions and PFA without -transitions cdlh 2003, Hanneforth & cdlh 2009 Many initial states can be transformed into one initial state with -transitions; -transitions can be removed in polynomial time; Strategy: number the states eliminate first -loops, then the transitions with highest ranking arrival state

Zadar, August PFA are strictly more powerful than DPFA Folk theorem (and) You can’t even tell in advance if you are in a good case or not (see: Denis & Esposito 2004)

Zadar, August a a a a This distribution cannot be modelled by a DPFA Example

Zadar, August What does a DPFA over  ={a} look like? And with this architecture you cannot generate the previous one a … a a a

Zadar, August Parsing issues Computation of the probability of a string or of a set of strings Deterministic case Simple: apply definitions Technically, rather sum up logs: this is easier, safer and cheaper

Zadar, August Pr(aba) = 0.7*0.9*0.35*0 = 0 Pr(abb) = 0.7*0.9*0.65*0.3 = a b a b a b

Zadar, August Pr(aba) = 0.7*0.4*0.1*1 +0.7*0.4*0.45*0.2 = = Non-deterministic case a b a a a b b 0.4

Zadar, August In the literature The computation of the probability of a string is by dynamic programming : O(n 2 m) 2 algorithms: Backward and Forward If we want the most probable derivation to define the probability of a string, then we can use the Viterbi algorithm

Zadar, August Forward algorithm A[i,j]=Pr(q i |a 1..a j ) (The probability of being in state q i after having read a 1..a j ) A[i,0]=I P (q i ) A[i,j+1]=  k  |Q| A[k,j].  P (q k,a j+1,q i ) Pr(a 1..a n )=  k  |Q| A[k,n]. F P (q k )

Zadar, August Distances What for? Estimate the quality of a language model Have an indicator of the convergence of learning algorithms Construct kernels

Zadar, August Entropy How many bits do we need to correct our model? Two distributions over  * : D et D’ Kullback Leibler divergence (or relative entropy) between D and D’ :  w  * Pr D (w)  log Pr D (w)-log Pr D’ (w) 

Zadar, August Perplexity The idea is to allow the computation of the divergence, but relatively to a test set (S) An approximation (sic) is perplexity: inverse of the geometric mean of the probabilities of the elements of the test set

Zadar, August  w  S Pr D (w) -1/  S  = 1  w  S Pr D (w) Problem if some probability is null... SS

Zadar, August Why multiply (1) We are trying to compute the probability of independently drawing the different strings in set S

Zadar, August Why multiply? (2) Suppose we have two predictors for a coin toss Predictor 1: heads 60%, tails 40% Predictor 2: heads 100% The tests are H: 6, T: 4 Arithmetic mean P1: 36%+16%=0,52 P2: 0,6 Predictor 2 is the better predictor ;-)

Zadar, August Distance d 2 d 2 ( D, D’ )=  w  * (Pr D (w)-Pr D’ (w)) 2 Can be computed in polynomial time if D and D’ are given by PFA (Carrasco & cdlh 2002) This also means that equivalence of PFA is in P

Zadar, August FFA Frequency Finite (state) Automata

Zadar, August A learning sample is a multiset Strings appear with a frequency (or multiplicity) S={ (3), aaa (4), aaba (2), ababa (1), bb (3), bbaaa (1)}

Zadar, August DFFA A deterministic frequency finite automaton is a DFA with a frequency function returning a positive integer for every state and every transition, and for entering the initial state such that the sum of what enters is equal to what exits and the sum of what halts is equal to what starts

Zadar, August Example b : 5 b: 3 a: 1a: 2 a: 5 b: 4 6

Zadar, August From a DFFA to a DPFA 3/13 1/6 2/7 b: 5/13 b: 3/6 a: 1/7a: 2/6 a: 5/13 b: 4/7 6/6 Frequencies become relative frequencies by dividing by sum of exiting frequencies

Zadar, August From a DFA and a sample to a DFFA b: 5 b: 3 a: 1a: 2 a: 5 b: 4 6 S = {, aaaa, ab, babb, bbbb, bbbbaa}

Zadar, August Note Another sample may lead to the same DFFA Doing the same with a NFA is a much harder problem Typically what algorithm Baum-Welch (EM) has been invented for…

Zadar, August The frequency prefix tree acceptor The data is a multi-set The FTA is the smallest tree-like FFA consistent with the data Can be transformed into a PFA if needed

Zadar, August From the sample to the FTA a:7 a:6 a:4 a:2 b:4 b:2 1 a:1 b:1 1 a:1 b:1 a:1 S={ (3), aaa (4), aaba (2), ababa (1), bb (3), bbaaa (1)} FTA(S) 14

Zadar, August Red, Blue and White states a a a b b b a b a -Red states are confirmed states -Blue states are the (non Red) successors of the Red states -White states are the others Same as with DFA and what RPNI does

Zadar, August Merge and fold Suppose we decide to merge with state a:26 a:4 a:6 a:4 a:10 b:24 b:9 b:6 100 b a

Zadar, August Merge and fold First disconnect and reconnect to a:26 a:4 a:6 a:4 a:10 b:24 b:9 b:6 100 b a

Zadar, August Merge and fold Then fold a:26 a:4 a:6 a:4 a:10 b:24 b:9 b:6 100

Zadar, August Merge and fold after folding a:26 a:4 a:10 b:24 b:9 b:30 100

Zadar, August State merging algorithm A=FTA(S); Blue ={  (q I,a): a  }; Red ={q I } While Blue  do choose q from Blue such that Freq(q)  t 0 if  p  Red: d(A p,A q ) is small then A = merge_and_fold(A,p,q) else Red = Red  {q} Blue = {  (q,a): q  Red} – {Red}

Zadar, August The real question How do we decide if d(A p,A q ) is small? Use a distance… Be able to compute this distance If possible update the computation easily Have properties related to this distance

Zadar, August Deciding if two distributions are similar If the two distributions are known, equality can be tested Distance (L 2 norm) between distributions can be exactly computed But what if the two distributions are unknown?

Zadar, August Taking decisions Suppose we want to merge with state a:26 a:4 a:6 a:4 a:10 b:24 b:9 b:6 100 b a

Zadar, August Taking decisions Yes if the two distributions induced are similar a:26 a:4 a:6 a:4 a:10 b:24 b:9 b: a:4 b:24 b:9

Zadar, August Alergia

Zadar, August Alergia’s test D 1  D 2 if  x Pr D1 (x)  Pr D2 (x) Easier to test: Pr D1 ( )=Pr D2 ( )  a  Pr D1 (a  *)=Pr D2 (a  *) And do this recursively! Of course, do it on frequencies

Zadar, August Hoeffding bounds  indicates if the relative frequencies and are sufficiently close

Zadar, August A run of Alergia Our learning multisample S={ (490), a(128), b(170), aa(31), ab(42), ba(38), bb(14), aaa(8), aab(10), aba(10), abb(4), baa(9), bab(4), bba(3), bbb(6), aaaa(2), aaab(2), aaba(3), aabb(2), abaa(2), abab(2), abba(2), abbb(1), baaa(2), baab(2), baba(1), babb(1), bbaa(1), bbab(1), bbba(1), aaaaa(1), aaaab(1), aaaba(1), aabaa(1), aabab(1), aabba(1), abbaa(1), abbab(1)}

Zadar, August Parameter  is arbitrarily set to We choose 30 as a value for threshold t 0. Note that for the blue state who have a frequency less than the threshold, a special merging operation takes place

Zadar, August a :257 a : 57 a : 15 b : 170 b : 26 b : 18 9 a : 13 a : b : 65 a : 14 b : 9 14 a : b : 6 b : a : 4 a : 5 a : 2 a : 4 a : 2 a : 1 b : 1 b : 2 b : 1 b : 2 b : 3 a b b b a a a a 1000

Zadar, August Can we merge and a? Compare and a, a  * and aa  *, b  * and ab  * 490/1000 with 128/257, 257/1000 with 64/257, 253/1000 with 65/257,.... All tests return true

Zadar, August a: 257 a: 57 a: 15 b: 170 b: 26 b: 18 9 a: 13 a: b: 65 a: 14 b: 9 14 a: b: 6 b: a : 4 a : 5 a : 2 a : 4 a : 2 a : 1 b : 1 b : 2 b : 1 b : 2 b : 3 a b b b a a a a Merge… 1000

Zadar, August a:341 a: 77 b: 340 b: a: 16 a: b: 9 b: a: 2 a: 1 b: 1 b: 2 And fold 1000

Zadar, August a:341 a: 77 b: 340 b: a: 16 a: b: 9 b: a : 2 a : 1 b : 1 b : 2 Next merge ? with b ? 1000

Zadar, August Can we merge and b? Compare and b, a  * and ba  *, b  * and bb  * 660/1341 and 225/340 are different (giving  = 0.162) On the other hand

Zadar, August a:341 a: 77 b: 340 b: a: 16 a: b: 9 b: a : 2 a : 1 b : 1 b : 2 Promotion 1000

Zadar, August a: 341a: 77 b: 340 b: a: 16 a: b: 9 b: a : 2 a : 1 b : 1 b : 2 Merge

Zadar, August a:341a: 95 b: 340 b: 49 a: b: a : 2 a : 1 b : 2 And fold 1000

Zadar, August a:341a: 95 b: 340 b: 49 a: b: a : 2 a : 1 b : 2 Merge 1000

Zadar, August a: 354a: 96 b: 351 b: 49 And fold a:.354a:.096 b:.351 b:.049 As a PFA

Zadar, August Conclusion and logic Alergia builds a DFFA in polynomial time Alergia can identify DPFA in the limit with probability 1 No good definition of Alergia’s properties

Zadar, August DSAI and MDI Why not change the criterion?

Zadar, August Criterion for DSAI Using a distinguishable string Use norm L  Two distributions are different if there is a string with a very different probability Such a string is called  -distinguishable Question becomes: Is there a string x such that |Pr A,q (x)-Pr A,q’ (x)|> 

Zadar, August (much more to DSAI) D. Ron, Y. Singer, and N. Tishby. On the learnability and usage of acyclic probabilistic finite automata. In Proceedings of Colt 1995, pages 31–40, PAC learnability results, in the case where targets are acyclic graphs

Zadar, August Criterion for MDI MDL inspired heuristic Criterion is: does the reduction of the size of the automaton compensate for the increase in preplexity? F. Thollard, P. Dupont, and C. de la Higuera. Probabilistic Dfa inference using Kullback-Leibler divergence and minimality. In Proceedings of the 17th International Conference on Machine Learning, pages 975–982. Morgan Kaufmann, San Francisco, CA, 2000

Zadar, August Conclusion and open questions

Zadar, August A good candidate to learn NFA is DEES Never has been a challenge, so state of the art is still unclear Lots of room for improvement towards probabilistic transducers and probabilistic context-free grammars

Zadar, August Appendix Stern Brocot trees Identification of probabilities If we were able to discover the structure, how do we identify the probabilities?

Zadar, August By estimation: the edge is used 1501 times out of 3000 passages through the state : 3000 a

Zadar, August Stern-Brocot trees: (Stern 1858, Brocot 1860) Can be constructed from two simple adjacent fractions by the «mean» operation aca+c bdb+d = m

Zadar, August

Zadar, August Idea: Instead of returning c(x)/n, search the Stern-Brocot tree to find a good simple approximation of this value.

Zadar, August Iterated Logarithm: c(x) - a < log log n n b n  >1 With probability 1, for a co-finite number of values of n we have: