Incrementally Learning Parameter of Stochastic CFG using Summary Stats Written by:Brent Heeringa Tim Oates
Goals: To learn the syntax of utterances Approach: SCFG (Stochastic Context Free Grammar) M= V-finite set of non-terminal E-finite set of terminals R-finite set of rules, each r has p(r). Sum of p(r) of the same left-hand side = 1 S-start symbol
Problems with most SCFG Learning Algorithms 1)Expensive storage: need to store a corpus of complete sentences 2)Time-consuming: algorithms needs to repeat passes throughout all data
Learning SCFG Inducing context-free structure from corpus(sentences) Learning – the production(rules) probabilities
Learning SCFG –Cont General method: Inside/Outside algorithm –Expectation- Maximization (EM) Find expectation of rules Maximize the likelihood given both expectation & corpus Disadvantage of Inside/Outside algo. –Entire sentence corpus must be stored using some representation(eg. chart parse) –Expensive storage (unrealistic for human agent!)
Proposed Algorithm Use Unique Normal Form (UNF) –Replace all terminal A-z to 2 new rules A->D p[A->D]=p[A->z] D-> z p[D->z]=1 –No two productions have the same right hand side
Learning SCFG- Proposed Algorithm -cont Use Histogram –Each rule has 2 histograms (H o r, H L r )
Proposed Algorithm -cont –H o r -contructed when parsing sentences in O – H L r- -will continue to be updated throughout learning process H L r rescale to fixed size h –Why?! –Recently used rules has more impact on histogram
Comparing between H L r & H o r Relative entropy T decrease- increase prob of rules used –(if s large, increase prob of rules used when parsing last sentence ) T increase- decrease prob of rules used (eg p t+1 (r)=0.01* p t+1 (r)
Comparing Inside/Outside Algo with the proposed algorithm Inside/Outside –O(n 3 ) Good –3-5 iterations Bad –Need to store complete sentence corpus Proposed Algo –O(n 3 ) Bad – iterations Good –Memory requirements is constant!