Download presentation
Presentation is loading. Please wait.
Published byLaurel Morrison Modified over 9 years ago
1
Inside-outside reestimation from partially bracketed corpora F. Pereira and Y. Schabes ACL 30, 1992 CS730b김병창 NLP Lab. 1998. 10. 29
2
NLP Lab., POSTECH 2Contents oMotivation oPartially Bracketed Text oGrammar Reestimation m The Inside-Outside Algorithm m The Extended Algorithm m Complexity oExperimental Evaluation m Inferring the Palindrome Language m Experiments on the ATIS Corpus oConclusions and Further Work
3
NLP Lab., POSTECH 3 Motivation I oVery simple method for learning SCFGs [Charniak] m Generate all possible SCFG rules m Assign some initial probabilities m Run the training algorithm on a sample text raw text m remove those rules with zero probabilities oDifficulties in using SCFGs m Time complexity - O(n 3 |w| 3 ) n : the number of non-terminalsw : training sentence cf. O(s 2 |w|) : training an HMM with s states m Bad convergence properties The larger number of non-terminals, the worse. m Inferred only by chance
4
NLP Lab., POSTECH 4 Motivation II oExtension of the Inside-Outside algorithm m Inferring grammars from a partially parsed corpus m Advantages constituent boundary information in grammar reduced number of iteration for training better time complexity
5
NLP Lab., POSTECH 5 Partially Bracketed Text oExample m (((VB(DT NNS(IN((NN)(NN CD)))))).) m (((List (the fares(for((flight)(number 891)))))).) oNotations m Corpus C = { c | c = ( w, B) }, w : string, B : bracketing of w m w=w 1 w 2 w i w i+1 w j w |w| m (i,j) delimits i w j m consistent : no overlapping in a bracketing m compatible : union of two bracketing is consistent m valid : a span is compatible with a bracketing m span in derivation 0 1 m =w if j=m, span of w i in j is (i-1,i) if j<m, j = A , j+1 = X 1 X k , span A in j is (i 1,j k )
6
NLP Lab., POSTECH 6 Grammar Reestimation oUsing reestimation algorithm m parameter estimates for a SCFG derived by other means m grammar inferring from scratch oGrammar inferring m Given set N of Non-terminals, set of terminals n=|N|, t=| | N={A 1, ,A n }, ={b 1, ,b t } m CNF SCFG over N, : n 3 +nt probabilities B p,q,r on A p A q A r : n 3 U p,m on A p b m : nt m oMeaning of rule probabilities : intuition of context freeness
7
NLP Lab., POSTECH 7 The Inside-Outside Algorithm oDefinition of inner (e) and outer (f) probabilities S i 1s-1t+1Tst Inner probability Outer probability i S i Special thanks to ohwoog
8
NLP Lab., POSTECH 8 The Extended Algorithm oCompatible function oExtended algorithm m Table 1. 참조 m Inside probabilities : (1), (2) ; (2) 에 compatible function 사용. m Outside probabilities : (3), (4); (4) 에 compatible function 사용. m Parameter reestimation : (5), (6) ; original algorithm 과 같음. oStopping criterior m When the cross entropy estimate becomes negligible.
9
NLP Lab., POSTECH 9Complexity oComplexity of original algorithm : O(|w| 3 ) for each sentence m computation of inside probability, computation of outside probability and rule probability reestimation : 각각 O(|w| 3 ) for each sentence oComplexity of extended algorithm : O(|w|) at best case m In the case of full binary bracketing B of a string w O(|w|) spans in B Only one split point for each (i,k) Each valid span must be a member of B. m Preprocessing Enumerating valid spans and split points
10
NLP Lab., POSTECH 10 Experimental Evaluation oTwo experiments m Artificial Language ; Palindrome m Natural Language ; Penn Treebank oEvaluation m Bracketing accuracy proportion of phrases that are compatible
11
NLP Lab., POSTECH 11 Inferring the Palindrome Language oL={ww R |w {a,b}*} oInitial grammar : 135 rules ( =5 3 +5*2 ) oTraining with 100 sentences oInferred grammar : correct palindrome language grammar oBracketing accuracy : above 90% (100% in several cases) m In the unbracketing training : 15% - 69%
12
NLP Lab., POSTECH 12 Experiments on the ATIS Corpus oATIS(Air Travel Information System) corpus ; 770 sentences (7812 words) m 700 training set, 70 test set (901 words) oInitial grammar : 4095 rules ( =15 3 +15*48) m 15 nonterminals, 48 terminal symbols for POS tags oBracketing accuracy : 90.36% after 75 iteration m In the unbracketing training : 37.35% oIn the case (A) m (Delta flight number) : not compatible m (the cheapest) : linguistically wrong ; lack of information m 16 incompatibles in G R oIn the case (B) m fully compatible m 9 incompatibles in G R
13
NLP Lab., POSTECH 13 Conclusions and Further Work oThe use of partially bracketed corpus can m reduce the number of iterations for convergence m find good solution m infer grammars specifying linguistically reasonable constituent boundaries m reduce time complexity (linear in the best case) oMore Extensions m determination of sensitivity to the initial probability assignments training corpus lack or misplacement of brackets. m larger terminal vocabularies
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.