1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav Petrov’s talk at COLING-ACL 06 are used in this lecture TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA
2 Parsing Problem Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson 98]
3 Parsing Problem Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson 98] Head lexicalization [Collins 99,...]
4 Parsing Problem Annotation refines base treebank symbols to improve statistical fit of the grammar Parent annotation [Johnson 98] Head lexicalization [Collins 99,...] Automatic Annotation [Matsuzaki et al, 05;...] Manual Annotation [Klein and Manning 03]
5 Manual Annotation Manually split categories – NP: subject vs object – DT: determiners vs demonstratives – IN: sentential vs prepositional Advantages: – Fairly compact grammar – Linguistic motivations Disadvantages: – Performance leveled out – Manually annotated ModelF1 Naïve Treebank PCFG72.6 Klein & Manning ’0386.3
6 Automatic Annotation Use Latent Variable Models – Split (“annotate”) each node: E.g., NP -> ( NP[1], NP[2],...,NP[T]) – Each node in the tree is annotated with a latent sub-category: – Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables
7 How to perform this clustering? Estimating model parameters (and models structure) – Decide how do you split each terminal (what is T in., NP -> ( NP[1], NP[2],...,NP[T]) – Estimate probabilities for all Parsing: – Do you need the most likely ‘annotated’ parse tree (1) or the most likely tree with non-annotated nodes (2)? – Usually (2), but the inferred latent variables can can be useful for other tasks – Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables
8 Estimating the model Estimating parameters: – If we decide on the structure of the model (how we split) we can use EM (Matsuzaki et al, 05; Petrov and Klein, 06;...): E-Step: estimate - obtain fractional counts of rules M-Step: – Also can use variational methods (mean-field): [Titov and Henderson, 07; Liang et al, 07] Recall: We considered the variational methods in the context of LDA
9 Estimating the model How to decide on how many nodes to split? – Early models split all the nodes equally [Kurihara and Sato, 04; Matsuzaki et al, 05; Prescher 05,...] with T selected by hand – Models are sparse (parameter estimates are not reliable), parsing time is large
10 Estimating the model How to decide on how many nodes to split? – Later different approaches were considered: (Petrov and Klein 06): Split and merge approach – recursively split each node in 2, if the likelihood is (significantly) improved – keep, otherwise, merge back; continue until no improvement (Liang et al 07): Use Dirichlet Processes to automatically infer the appropriate size of the grammar – Larger is the training set: more fine grain the annotation is
11 Estimating the model How to decide on how many nodes to split? (Titov and Henderson 07; current work): – Instead of annotating with a single label annotate with a binary vector: -log-linear models for instead of counts of productions - - can be large: standard Gaussian regularization to avoid overtraining – efficient approximate parsing algorithms
12 How to parse? Do you need the most likely ‘annotated’ parse tree (1) or the most likely tree with non-annotated nodes (2)? How to parse: – (1) – easy – just usual parsing with the extended grammar (if all nodes split in T) – (2) - not tractable (NP-complete, [Matsuzaki et al, 2005]), – instead you can do Minimum Bayes Risk decoding (i.e., output the minimum loss tree [Goodman 96; Titov and Henderson, 06; Petrov and Klein 07]) => instead of predicting the best tree you output the tree with the minimal expected error (Not always a great idea because we often do not know good loss measures: like optimizing the Hamming loss for sequence labeling can lead to lingustically non-plausible structures) – Latent Annotated Probablistic CFG: To obtain the probability of a tree you need to sum over all the latent variables
13 Adaptive splitting (Petrov and Klein, 06): Split and Merge: number of induced constituent labels: PP VP NP
14 (Petrov and Klein, 06): Split and Merge: number of induced POS tags: Adaptive splitting TO,POS
15 Adaptive splitting (Petrov and Klein, 06): Split and Merge: number of induced POS tags: TO,POS NN NNS NNP JJ
16 Induced POS-tags Proper Nouns (NNP): Personal pronouns (PRP): NNP-14Oct.Nov.Sept. NNP-12JohnRobertJames NNP-2J.E.L. NNP-1BushNoriegaPeters NNP-15NewSanWall NNP-3YorkFranciscoStreet PRP-0ItHeI PRP-1ithethey PRP-2itthemhim
17 Induced POS tags Relative adverbs (RBR): Cardinal Numbers (CD): RBR-0furtherlowerhigher RBR-1morelessMore RBR-2earlierEarlierlater CD-7onetwoThree CD CD-11millionbilliontrillion CD CD CD
18 Results for this model F1 ≤ 40 words F1 all words Parser Klein & Manning ’ Matsuzaki et al. ’ Collins ’ Charniak & Johnson ’ Petrov & Klein,
19 LVs in Parsing In standard models for parsing (and other structured prediction problems) you need to decide how the structure decomposes into the parts (e.g., weighted CFGs / PCFGs) In latent variable models you relax this assumption: you assume how the structure annotated with latent variables decomposes In other words, you learn to construct composite features from the elementary features (parts) -> reduces feature engineering effort Latent variable models become popular in many applications: – syntactic dependency parsing [Titov and Henderson, 07] – best single model system in the parsing competition (overall 3 rd result out of 22 systems) (CoNLL-2007) – joint semantic role labeling and parsing [Henderson et al, 09] – again the best single model (1 st result in parsing, 3 rd result in SRL) (CoNLL-2009) – hidden (dynamics) CRFs [Quattoni, 09] –...
20 Hidden CRFs CRF (Lafferty et al, 2001): Latent Dynamic CRF No long-distance statistical dependencies between y Long-distance dependencies can be encoded using latent vectors
21 Latent Variables Drawbacks: – Learning LVs models usually involves using slower iterative algorithms (EM, Variation methods, sampling...) – Optimization problem is often non-convex – many local minima – Inference (decoding) can be more expensive Advantages: – Reduces feature engineering effort – Especially preferable if little domain knowledge is available and complex features are needed – Induced representation can be used for other tasks (e.g., LA-PCFGs induce fine-grain grammar can be useful, e.g., for SRL) – Latent variables (= hidden representations) can be useful in muti-task learning: hidden representation is induced simultaneously for several tasks [Collobert and Weston, 2008; Titov et al, 2009]. #
22 Conclusions We considered latent variable models in different contexts: – Topic modeling – Structured prediction models We demonstrated where and why they are useful Reviewed basic inference/learning techniques: – EM-type algorithms – Variational approximations – Sampling Only very basic review Next time: a guest lecture by Ming-Wei Chang on Domain- Adaptation (really hot and important topic in NLP!)