Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford.

Slides:



Advertisements
Similar presentations
Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.
Advertisements

DECISION TREES. Decision trees  One possible representation for hypotheses.
Evaluation of Decision Forests on Text Categorization
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
The Leaf Projection Path View of Parse Trees: Exploring String Kernels for HPSG Parse Selection Kristina Toutanova, Penka Markova, Christopher Manning.
Model Assessment, Selection and Averaging
Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},
CMPUT 466/551 Principal Source: CMU
A Joint Model For Semantic Role Labeling Aria Haghighi, Kristina Toutanova, Christopher D. Manning Computer Science Department Stanford University.
Final review LING572 Fei Xia Week 10: 03/13/08 1.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Ensemble Learning: An Introduction
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation Given.
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor.
Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer Stanford University.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Radial Basis Function Networks
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
Generative and Discriminative Models in NLP: A Survey Kristina Toutanova Computer Science Department Stanford University.
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
Final review LING572 Fei Xia Week 10: 03/11/
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
1 Persian Part Of Speech Tagging Mostafa Keikha Database Research Group (DBRG) ECE Department, University of Tehran.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Protein Family Classification using Sparse Markov Transducers Proceedings of Eighth International Conference on Intelligent Systems for Molecular Biology.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
DATA MINING LECTURE 10b Classification k-nearest neighbor classifier
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
KNN & Naïve Bayes Hongning Wang
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
DECISION TREES An internal node represents a test on an attribute.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
K Nearest Neighbor Classification
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Generally Discriminant Analysis
Model generalization Brief summary of methods
Introduction to Sensor Interpretation
Introduction to Sensor Interpretation
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Optimizing Local Probability Models for Statistical Parsing Kristina Toutanova, Mark Mitchell, Christopher Manning Computer Science Department Stanford University

Highlights Choosing a local probability model P(expansion(n)|history(n)) for statistical parsing – a comparison of commonly used models A new player – memory based models and their relation to interpolated models Joint likelihood, conditional likelihood and classification accuracy for models of this form

Motivation Many problems in natural language processing are disambiguation problems word senses jaguar – a big cat, a car, name of a Java package line - phone, queue, in mathematics, air line, etc. part-of-speech tags (noun, verb, proper noun, etc.) ? ? ? Joy makes progress every day. NN VB DTNN NNP VBZ NNS

Parsing as Classification “I would like to meet with you again on Monday” Input: a sentence Classify to one of the possible parses

Motivation – Classification Problems There are two major differences from typical ML domains: The number of classes can be very large or even infinite; the set of available classes for an input varies (and depends on a grammar) Data is usually very sparse and the number of possible features is large (e.g. words)

Solutions The possible parse trees are broken down into small pieces defining features features are now functions of input and class, not input only Discriminative or generative models are built using these features we concentrate on generative models here; when a huge number of analyses are possible, they are the only practical ones

History-Based Generative Parsing Models “ Tuesday Marks bought Brooks”. S TOP NPNP-CVP NNP Tuesday The generative models learn a distribution P(S,T) on pairs: select a single most likely parse based on:

Factors in the Performance of Generative History-Based Models The chosen decomposition of parse tree generation, including the representation of parse tree nodes and the independence assumptions The model family chosen for representing local probability distributions: Decision Trees, Naïve Bayes, Log-linear Models The optimization method for fitting major and smoothing parameters: Maximum likelihood, maximum conditional likelihood, minimum error rate, etc.

Previous Studies and This Work The influence of the previous three factors has not been isolated in previous work: authors presented specific choices for all components and the importance of each was unclear. We assume the generative history-based model and set of features (the representation of parse tree nodes) are fixed and we study carefully the other two factors.

Deleted Interpolation Estimating the probability P(y|X) by interpolating relative frequency estimates for lower-order distributions Most commonly used: linear feature subsets order Jelinek-Mercer with fixed weight, Witten Bell with varying d, Decision Trees with path interpolation,Memory-Based Learning

Memory-Based Learning as Deleted Interpolation In k-NN, the probability of a class given features is estimated as: If the distance function depends only on the positions of the matching features*, it is a case of deleted interpolation

Memory-Based Learning as Deleted Interpolation P(eye-color=blue|hair-color=blond) We have N=12 samples of people d=1 or d=0 (match), w(1)=w1, w(0)=w0, K=12 Deleted Interpolation where the interpolation weights depend on the counts and weights of nearest neighbors at all accepted distances

The Task and Features Used NoNameExample 1Node label HCOMP 2Parent node label HCOMP 3Node direction left 4Parent node direction none 5Grandparent node label IMPER 6Great grandparent node label TOP 7Left sister node label none 8Category of node verb sentlengt h struct amb random % Maximum ambiguity – 507, minimum - 2 see letus LET_V1US IMPER SEE_V3 HCOMP TOP

Experiments Linear Feature Subsets Order Jelinek-Mercer with fixed weight Witten Bell with varying d Linear Memory-Based Learning Arbitrary Feature Subsets Order Decision Trees Memory-Based Learning Log-linear Models Experiments on the connection among likelihoods and accuracy

Experiments – Linear Sequence The features {1,2,…,8} ordered by gain ratio {1,8,2,3,5,4,7,6} Jelinek Mercer Fixed Weight Witten-Bell Varying d

Experiments – Linear Sequence heavy smoothing for best results

MBL Linear Subsets Sequence Restrict MBL to be an instance of the same linear subsets sequence deleted interpolation as follows: Weighting functions INV3 and INV4 performed best: LKNN3 best at K=3, % LKNN4 best at K=15, % LKNN4 is best of all

Experiments Linear Subsets Feature Order Jelinek-Mercer with fixed weight Witten Bell with varying d Linear Memory-Based Learning Arbitrary Subsets Feature Order Decision Trees Memory-Based Learning Log-linear Models Experiments on the connection among likelihoods and accuracy

Model Implementations – Decision Trees ( DecTreeWBd ) n-ary decision trees; If we choose a feature f to split on, all its values form subtrees splitting criterion – gain ratio final probabilities estimates at the leaves are Witten Bell d interpolations of estimates on the path to the root feat: 1 feat:2 HCOMP instances of deleted interpolation models! NOPTCOMP

Model Implementations – Log-linear Models Binary features formed by instantiating templates Three models with different allowable features Single attributes only LogLinSingle Pairs of attributes, only pairs involving the most important feature (node label) LogLinPairs Linear feature subsets – comparable to previous models LogLinBackoff Gaussian smoothing was used Trained by Conjugate Gradient (Stanford Classify Package)

Model Implementations – Memory-Based Learning Weighting functions INV3 and INV4 KNN4 better than DecTreeWBd and Log-linear models KNN4 has 5.8% error reduction from WBd (significant at the 0.01 level) Model KNN4DecTree WBd LogLin Single LogLin Pairs LogLin Backof f Accurac y 80.79% 79.66%78.65 % % %

Accuracy Curves for MBL and Decision Trees

Experiments Linear Subsets Feature Order Jelinek-Mercer with fixed weight Witten Bell with varying d Linear Memory-Based Learning Arbitrary Subsets Feature Order Decision Trees Memory-Based Learning Log-linear Models Experiments on the connection among likelihoods and accuracy

Joint Likelihood, Conditional Likelihood, and Classification Accuracy Our aim is to maximize parsing accuracy, but: Smoothing parameters are usually fit on held- out data to maximize joint likelihood Sometimes conditional likelihood is optimized We look at the relationship among the maxima of these three scoring functions, depending on the amount of smoothing, finding that: Much heavier smoothing is needed to maximize accuracy than joint likelihood Conditional likelihood also increases with smoothing, even long after the maximum for joint likelihood

Test Set Performance versus Amount of Smoothing - I

Test Set Performance versus Amount of Smoothing

Test Set Performance versus Amount of Smoothing –PP Attachment Witten-Bell Varying d

Summary The problem of effectively estimating local probability distributions for compound decision models used for classification is under-explored We showed that the chosen local distribution model matters We showed the relationship between MBL and deleted interpolation models MBL with large numbers of neighbors and appropriate weighting outperformed more expensive and popular algorithms – Decision Trees and Log- linear Models Fitting a small number of smoothing parameters to maximize classification accuracy is promising for improving performance

Future Work Compare MBL to other state-of-the art smoothing methods Better ways of fitting MBL weight functions Theoretical investigation of bias-variance tradeoffs for compound decision systems with strong independence assumptions