MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran 100050052 – Dikkala Sai Nishanth 100050056 – Ashwin P. Paranjape 100050057.

Slides:

Advertisements

Similar presentations

Natural Language Processing COMPSCI 423/723

Advertisements

Linear Regression.

DIRECTED GRAPHICAL MODEL MEMM AS SEQUENCE CLASSIFIER NIKET TANDON Tutor: Martin Theobald, Date: 09 June, 2011.

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.

Chapter 4: Linear Models for Classification

Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.

Hidden Markov Models in NLP

Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.

Lecture 14 – Neural Networks

Hindi POS tagging and chunking : An MEMM approach Aniket Dalal Kumar Nagaraj Uma Sawant Sandeep Shelke Under the guidance of Prof. P. Bhattacharyya.

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Today Linear Regression Logistic Regression Bayesians v. Frequentists

Project  Now it is time to think about the project  It is a team work Each team will consist of 2 people  It is better to consider a project of your.

Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.

Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.

Optimal Adaptation for Statistical Classifiers Xiao Li.

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Introduction to machine learning

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.

Review of Lecture Two Linear Regression Normal Equation

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence

Introduction to Natural Language Processing Heshaam Faili University of Tehran.

Introduction to Machine Learning for Information Retrieval Xiaolong Wang.

Graphical models for part of speech tagging

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 16– Linear and Logistic Regression) Pushpak Bhattacharyya CSE Dept., IIT Bombay.

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab

1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.

Linear Models for Classification

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

Structured learning: overview Sunita Sarawagi IIT Bombay TexPoint fonts used in EMF. Read the TexPoint manual before.

Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.

SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Hidden Markov Models BMI/CS 576

Deep Feedforward Networks

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

10701 / Machine Learning.

CSC 594 Topics in AI – Natural Language Processing

CSCI 5832 Natural Language Processing

Parametric Methods Berlin Chen, 2005 References:

Multivariate Methods Berlin Chen

Logistic Regression Chapter 7.

Multivariate Methods Berlin Chen, 2005 References:

Machine Learning – a Probabilistic Perspective

Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]

The Improved Iterative Scaling Algorithm: A gentle Introduction

Presentation transcript:

MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape – Vipul Singh

Introduction Need for MEMM in NLP MEMM and the feature and weight vectors Linear and Logistic Regression (MEMM) Learning in logistic regression Why call it Maximum Entropy? 2

Need for MEMM in NLP 3

HMM – tag and observed word both depend only on previous tag Need to account for dependency of tag on observed word Need to extract “Features” from word & use 4

MAXIMUM ENTROPY MODELS Machine learning framework called Maximum Entropy modeling, MAXEnt Used for Classification – The task of classification is to take a single observation, extract some useful features describing the observation, and then based on these features, to classify the observation into one of a set of discrete classes. Probabilistic classifier: gives the probability of the observation being in that class Non-sequential classification – in text classification we might need to decide whether a particular should be classified as spam or not – In sentiment analysis we have to determine whether a particular sentence or document expresses a positive or negative opinion. – we’ll need to classify a period character (‘.’) as either a sentence boundary or not 5

MaxEnt MaxEnt belongs to the family of classifiers known as the exponential or log-linear classifiers MaxEnt works by extracting some set of features from the input, combining them linearly (meaning that we multiply each by a weight and then add them up), and then using this sum as an exponent Example: tagging – A feature for tagging might be this word ends in -ing or the previous word was ‘the’ 6

Linear Regression Two different names for tasks that map some input features into some output value: regression when the output is real-valued, and classification when the output is one of a discrete set of classes 7

8 price = w0+w1 ∗ Num Adjectives

Multiple linear regression price=w0+w1 ∗ Num Adjectives+w2 ∗ Mortgage Rate+w3 ∗ Num Unsold Houses 9

Learning in linear regression sum-squared error X is matrix of feature vectors y is vector of costs 10

Logistic regression Classification in which the output y we are trying to predict takes on one from a small set of discrete values binary classification: Odds logit function 11

Logistic regression 12

Logistic regression 13

Logistic regression: Classification 14 hyperplane

Learning in logistic regression 15 conditional maximum likelihood estimation.

Learning in logistic regression Convex Optimization 16 PS – GIS, if needed will be inserted here.

MAXIMUM ENTROPY MODELING multinomial logistic regression(MaxEnt) – Most of the time, classification problems that come up in language processing involve larger numbers of classes (part-of-speech classes) y is a value take on C different value corresponding to classes C1,…,Cn 17

Maximum Entropy Modeling Indicator function: A feature that only takes on the values 0 and 1 18

Maximum Entropy Modeling Example – Secretariat/NNP is/BEZ expected/VBN to/TO race/?? tomorrow/ 19

Maximum Entropy Modeling 20

The Occam Razor Adopting the least complex hypothesis possible is embodied in Occam's razor ("Nunquam ponenda est pluralitas sine necesitate.') 21

Why do we call it Maximum Entropy? From of all possible distributions, the equiprobable distribution has the maximum entropy 22

Why do we call it Maximum Entropy? 23

Maximum Entropy probability distribution of a multinomial logistic regression model whose weights W maximize the likelihood of the training data! Thus the exponential model 24

References Jurafsky, Daniel and Martin, James H. (2006) Speech and Language Processing: An introduction to natural language processing, computational linguistics, and speech recognition. Prentice-Hall. Adam L. Berger, Vincent J. Della Pietra, and Stephen A. Della Pietra A maximum entropy approach to natural language processing. Comput. Linguist. 22, 1 (March 1996), Ratnaparkhi A A Maximum Entropy Model for Part-of-Speech Tagging. Proceedings of the Empirical Methods in Natural Language Processing (1996), pp

THANK YOU 26