Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
BİL711 Natural Language Processing
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,
LING 388 Language and Computers Lecture 22 11/25/03 Sandiway FONG.
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Markov Models notes for CSCI-GA.2590 Prof. Grishman.
Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Part of speech (POS) tagging
BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.
CS224N Interactive Session Competitive Grammar Writing Chris Manning Sida, Rush, Ankur, Frank, Kai Sheng.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Part-of-Speech Tagging
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
1 Persian Part Of Speech Tagging Mostafa Keikha Database Research Group (DBRG) ECE Department, University of Tehran.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Some Advances in Transformation-Based Part of Speech Tagging
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Albert Gatt Corpora and Statistical Methods Lecture 10.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Lecture 10 NLTK POS Tagging Part 3 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
Word classes and part of speech tagging Chapter 5.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.
CSA3202 Human Language Technology HMMs for POS Tagging.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
Supertagging CMSC Natural Language Processing January 31, 2006.
February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.
Part-of-speech tagging
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Part-of-Speech Tagging CSE 628 Niranjan Balasubramanian Many slides and material from: Ray Mooney (UT Austin) Mausam (IIT Delhi) * * Mausam’s excellent.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Maximum Entropy Models and Feature Engineering CSCI-GA.2591
CSC 594 Topics in AI – Natural Language Processing
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
N-Gram Model Formulas Word sequences Chain rule of probability
Part-of-Speech Tagging Using Hidden Markov Models
Presentation transcript:

Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU

Parts of Speech Grammar is stated in terms of parts of speech (‘preterminals’): – classes of words sharing syntactic properties: noun verb adjective … 1/16/14NYU2

POS Tag Sets Most influential tag sets were those defined for projects to produce large POS-annotated corpora: Brown corpus – 1 million words from variety of genres – 87 tags UPenn Tree Bank – initially 1 million words of Wall Street Journal – later retagged Brown – first POS tags, then full parses – 45 tags (some distinctions captured in parses) 1/16/14NYU3

The Penn POS Tag Set Noun categories NN (common singular) NNS (common plural) NNP (proper singular) Penn POS tagsPenn POS tags NNPS (proper plural) Verb categories VB (base form) VBZ (3 rd person singular present tense) VBP (present tense, other than 3 rd person singular) VBD (past tense) VBG (present participle) VBN (past participle) 1/16/14NYU4

some tricky cases present participles which act as prepositions: – according/JJ to nationalities: – English/JJ cuisine – an English/NNP sentence adjective vs. participle – the striking/VBG teachers – a striking/JJ hat – he was very surprised/JJ – he was surprised/VBN by his wife 1/16/14NYU5

Tokenization any annotated corpus assumes some tokenization relatively straightforward for English – generally defined by whitespace and punctuation – treat negative contraction as separate token: do | n’t – treat possessive as separate token: cat | ‘s – do not split hyphenated terms: Chicago-based 1/16/14NYU6

the Tagging Task Task: assigning a POS to each word not trivial: many words have several tags dictionary only lists possible POS, independent of context how about using a parser to determine tags? – some analysis (e.g., partial parsers) assume input is tagged 1/16/14NYU7

Why tag? POS tagging can help parsing by reducing ambiguity Can resolve some pronunciation ambiguities for text-to-speech (“desert”) Can resolve some semantic ambiguities 1/16/14NYU8

Simple Models Natural language is very complex – we don't know how to model it fully, so we build simplified models which provide some approximation to natural language 1/16/14NYU9

Corpus-Based Methods How can we measure 'how good' these models are? we assemble a text corpus annotate it by hand with respect to the phenomenon we are interested in compare it with the predictions of our model – for example, how well the model predicts part-of- speech or syntactic structure 1/16/14NYU10

Preparing a Good Corpus To build a good corpus – we must define a task people can do reliably (choose a suitable POS set, for example) – we must provide good documentation for the task so annotation can be done consistently – we must measure human performance (through dual annotation and inter-annotator agreement) Often requires several iterations of refinement

Training the model How to build a model? – need a goodness metric – train by hand, by adjusting rules and analyzing errors (ex: Constraint Grammar) – train automatically develop new rules build probabilistic model (generally very hard to do by hand) choice of model affected by ability to train it (NN) 1/16/1412NYU

The simplest model The simplest POS model considers each word separately: We tag each word with its most likely part-of-speech – this works quite well: about 90% accuracy when trained and tested on similar texts – although many words have multiple parts of speech, one POS typically dominates within a single text type How can we take advantage of context to do better? 1/16/14NYU13

A Language Model To see how we might do better, let us consider a related problem: building a language model – a language model can generate sentences following some probability distribution 1/16/14NYU14

Markov Model In principle each word we select depends on all the decisions which came before (all preceding words in the sentence) But we’ll make life simple by assuming that the decision depends on only the immediately preceding decision [first-order] Markov Model representable by a finite state transition network T ij = probability of a transition from state i to state j

Finite State Network cat: meow cat: meow dog: woof dog: woof end start

Our bilingual pets Suppose our cat learned to say “woof” and our dog “meow” … they started chatting in the next room … and we wanted to know who said what

Hidden State Network woof meow cat dog end start

How do we predict When the cat is talking: t i = cat When the dog is talking: t i = dog We construct a probabilistic model of the phenomenon And then seek the most likely state sequence S

Hidden Markov Model Assume current word depends only on current tag

HMM for POS Tagging We can use the same formulas for POS tagging states  POS tags 1/16/14NYU21

Training an HMM Training an HMM is simple if we have a completely labeled corpus: – have marked the POS of each word. – can directly estimate both P ( t i | t i-1 ) and P ( w i | t i ) from corpus counts using the Maximum Likelihood Estimator. 1/16/14NYU22

Greedy Decoder simplest decoder (tagger) assign tags deterministically from left to right selects t i to maximize P(w i |t i ) * P(t i |t i-1 ) does not take advantage of right context can we do better? 1/16/14NYU23

1/16/14NYU24

Performance Accuracy with good unknown-word model trained and tested on WSJ is 96.5% to 96.8% 1/16/14NYU25

Unknown words Problem (as with NB) of zero counts … words not in the training corpus – simplest: assume all POS equally likely for unknown words – can make better estimate by observing unknown words are very likely open class words, and most likely nouns base P(t|w) of unknown word on probability distribution of words which occur once in corpus 1/16/14NYU26

Unknown words, cont’d – can do even better by taking into account the form of a word whether it is capitalized whether it is hyphenated its last few letters 1/16/14NYU27

Trigram Models in some cases we need to look two tags back to find an informative context – e.g, conjunction (N and N, V and V, …) but there’s not enough data for a pure trigram model so combine unigram, bigram, and trigram – linear interpolation – backoff 1/16/14NYU28

Domain adaptation Substantial loss in shifting to new domain – 8-10% loss in shift from WSJ to biology domain – adding small annotated sample ( sentences) in new domain greatly reduces error – some reduction possible without annotated target data (Blitzer, Structured Correspondence Learning) 1/16/14NYU29

Jet Tagger HMM–based trained on WSJ file pos_hmm.txt 1/16/14NYU30

Transformation-Based Learning TBL provides a very different corpus-based approach to part-of-speech tagging It learns a set of rules for tagging – the result is inspectable 1/16/14NYU31

TBL Model TBL starts by assigning each word its most likely part of speech Then it applies a series of transformations to the corpus – each transformation states some condition and some change to be made to the assigned POS if the condition is met – for example: Change NN to VB if the preceding tag is TO. Change VBP to VB if one of the previous 3 tags is MD. 1/16/14NYU32

Transformation Templates Each transformation is based on one of a small number of templates, such as Change tag x to y if the preceding tag is z. Change tag x to y if one of the previous 2 tags is z. Change tag x to y if one of the previous 3 tags is z. Change tag x to y if the next tag is z. Change tag x to y if one of the next 2 tags is z. Change tag x to y if one of the next 3 tags is z. 1/16/14NYU33

Training the TBL Model To train the tagger, using a hand-tagged corpus, we begin by assigning each word its most common POS. We then try all possible rules (all instantiations of one of the templates) and keep the best rule -- the one which corrects the most errors. We do this repeatedly until we can no longer find a rule which corrects some minimum number of errors. 1/16/14NYU34

Some Transformations Changetoif NNVBprevious tag isTO VBPVBone of previous 3 tags is MD NNVBone of previous 2 tags is MD VBNNone of previous 2 tags is DT VBDVBNone of previous 3 tags is VBZ VBNVBDprevious tag is PRP VBNVBDprevious tag is NNP VBDVBNprevious tag is VBD VBPVBprevious tag is TO 1/16/14NYU35 the first 9 transformations found for WSJ corpus

TBL Performance Performance competitive with good HMM accuracy 96.6% on WSJ Compared to HMM, much slower to train, but faster to apply 1/16/14NYU36