Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Location Recognition Given: A query image A database of images with known locations Two types of approaches: Direct matching: directly match image features.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
Imbalanced data David Kauchak CS 451 – Fall 2013.
Supervised Learning Recap
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Linear Discriminant Functions
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
1 Gholamreza Haffari Anoop Sarkar Presenter: Milan Tofiloski Natural Language Lab Simon Fraser university Homotopy-based Semi- Supervised Hidden Markov.
1 Prepared and presented by Roozbeh Farahbod Voted Perceptron: Modified for NP-Chunking A Re-ranking Method.
Page-level Template Detection via Isotonic Smoothing Deepayan ChakrabartiYahoo! Research Ravi KumarYahoo! Research Kunal PuneraUniv. of Texas at Austin.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Ensemble Learning: An Introduction
Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer Stanford University.
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Maximum Entropy Model & Generalized Iterative Scaling Arindam Bose CS 621 – Artificial Intelligence 27 th August, 2007.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Classification: Feature Vectors
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
This paper was presented at KDD ‘06 Discovering Interesting Patterns Through User’s Interactive Feedback Dong Xin Xuehua Shen Qiaozhu Mei Jiawei Han Presented.
PRIORS David Kauchak CS159 Fall Admin Assignment 7 due Friday at 5pm Project proposals due Thursday at 11:59pm Grading update.
Chapter 23: Probabilistic Language Models April 13, 2004.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
John Lafferty Andrew McCallum Fernando Pereira
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Logistic Regression William Cohen.
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.
Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Tabu Search for Solving Personnel Scheduling Problem
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Max-margin sequential learning methods
CS 4/527: Artificial Intelligence
Probabilistic Models for Linear Regression
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
N-Gram Model Formulas Word sequences Chain rule of probability
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi

2 12/4/2002 Agenda Background Training approach Reranking Results Conclusion Future Directions Comparison: VP, MaxEnt and Baseline Application

3 12/4/2002 Background MRF framework was previously used in re- ranking for natural language parsing MRF can be viewed in terms of principle of maximum entropy It was found to be “too inefficient to run on the full data set” The experiment was not completed No final results on the performance is provided

4 12/4/2002 Training Approach (1) Goal: Learning a ranking function xx i,j : The j’th chunking candidate for the i’th sentence LL(x i,j ): Log-probability that the base chunking model assigns to x i,j hh k (x i,j ): A function specifying the existence of feature f k in x i,j ww k : A parameter corresponding to weight of each feature f k xx i,1 :The candidate with the highest golden score We need to find parameters of the model, w k ’s, such that it leads to good scores on test data

5 12/4/2002 Training Approach (2) How to find a good parameter setting? Try to minimize number of ranking errors F makes on the training data Ranking error: a candidate with lower golden score is ranked above the best candidate Maximize Likelihood of the golden candidates Log-Linear Model:  The probabilty of x i,j being the correct chunking for the i’th sentence is defined as: Use Maximum Entropy framework to estimate probability distribution

6 12/4/2002 Training Approach (3) First approach, Feature Selection  Goal: Find a small subset of features that contribute most to maximizing the likelihood of training data  Greedily pick the feature with additive weight, δ, which has the most impact in maximizing likelihood The complexity is O(TNFC), where T: number of iterations (number of selected features) N: number of sentences in the training set F: number of features C: number of iterations needed for convergence of the weight of each feature Finding the feature/weight pair with the highest gain, is too expensive

7 12/4/2002 Training Approach (4) Second approach, forget about gain, just use GIS 1. w 0 =1 and w 0 …w m =0 2. For each feature f k, expected[k] is the number of times that feature k is seen in the best chunking: 3. For each feature f k, observed[k] is the number of times that feature k is seen under the model: 4. For each feature f k w k = w k + log(observed[k]/expected[k]) 5. Repeat steps 2-4 until convergence

8 12/4/2002 Training Approach (5) Instead of updating just one weight in each pass over the training data, all the weights are updated The procedure can be repeated until a fixed number of iterations or until no significant change in log-likelihood happens Experiment showed that convergence is achieved after about 100 rounds First method might lead to better performance, but it was too inefficient to be applied!

9 12/4/2002 Reranking The output of the training phase is a weight vector For each sentence in the test set  Function specifies the score for each of its candidates  The candidate with the highest score is the best one

10 12/4/2002 Results (1) Initial experiment:  Cut-Off: 10 (features with less than 10 counts where omitted) Training is making it WORSE?!

11 12/4/2002 Results (2) Try other cut-offs Convergence was occurred by round 100 Cut-off 50 is worse than 45

12 12/4/2002 Results (3)

13 12/4/2002 Results (4) Why cut-off 45 performs better than 10?  Feature set is extracted from the training data set  Features with low counts, are probably the dataset-specific ones  As training proceeds, rare features become more important! Label-Bias Problem: The problem happens when some decision is made locally, regardless of global history Why cut-off 45 is better than 10?

14 12/4/2002 Results (5) Training process is supposed to increase the likelihood of the training data Recall is always increasing Precision is not! Overfitting! Why does the precision decrease?

15 12/4/2002 Conclusion Considering the trade-off between precision and recall, cut-off 45 has the best performance Cut-OffPrecisionRecall Num. of Rounds

16 12/4/2002 Future Directions Expand the template set  Find more useful feature templates Try to solve Label Bias problem  Apply a smoothing method (like applying a discount factor, or Guassian Prior)

17 12/4/2002 Comparison: VP, MaxEnt, Baseline Both re-ranking methods performs better than the baseline MaxEnt  is more complex  should solve Label Bias problem Voted Perceptron  is a simple algorithm  achieves better results PrecisionRecall VP99.65%99.98% MaxEnt99.25%99.87% Base Line97.71%99.32% Max.99.95%100.0%

18 12/4/2002 Applications Both methods, can be applied to any probabilistic baseline chunker (HMM chunker) The only restriction:  Baseline has to produce n-best candidates for each sentence Same framework can be used for VP-chunking  Same feature templates are used to extract features for VP-chunking Higher accuracy in text chunking leads to higher accuracy in the related tasks  like larger-scale grouping and subunit extraction

Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi