School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING A comparative study of the tagging of adverbs in modern English corpora.

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.

COMP3740 CR32: Knowledge Management and Adaptive Systems

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research.

Word Bi-grams and PoS Tags

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Google Research: Theorizing from Data COMP3310 AI32 Natural Language Processing.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Ling 570 Day 6: HMM POS Taggers 1. Overview Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details.

Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.

1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.

Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.

Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.

Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.

Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.

CS Catching Up CS Porter Stemmer Porter Stemmer (1980) Used for tasks in which you only care about the stem –IR, modeling given/new distinction,

More about tagging, assignment 2 DAC723 Language Technology Leif Grönqvist 4. March, 2003.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.

Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,

1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.

Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.

I256 Applied Natural Language Processing Fall 2009 Lecture 6 Introduction of Graphical Models Part of speech tagging Barbara Rosario.

Part of speech (POS) tagging

1 Complementarity of Lexical and Simple Syntactic Features: The SyntaLex Approach to S ENSEVAL -3 Saif Mohammad Ted Pedersen University of Toronto, Toronto.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 15, 2004.

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 13, 2004.

SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.

Albert Gatt Corpora and Statistical Methods Lecture 9.

SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.

EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()

Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.

Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.

Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.

인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Lecture 10 NLTK POS Tagging Part 3 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:

S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,

Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.

13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.

Tokenization & POS-Tagging

TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.

For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.

February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.

Natural Language Processing Statistical Inference: n-grams

Chunk Parsing. Also called chunking, light parsing, or partial parsing. Method: Assign some additional structure to input over tagging Used when full.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

Modified from Diane Litman's version of Steve Bird's notes 1 Rule-Based Tagger The Linguistic Complaint –Where is the linguistic knowledge of a tagger?

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.

N-Grams Chapter 4 Part 2.

CSC 594 Topics in AI – Natural Language Processing

CSCI 5832 Natural Language Processing

LING/C SC 581: Advanced Computational Linguistics

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Presentation transcript:

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric Atwell, Language Research Group (with thanks to Katja Markert, Marti Hearst, and other contributors)

Reminder Puns play on our assumptions of the next word… … eg they present us with an unexpected homonym (bends) ConditionalFreqDist() counts word-pairs: word bigrams Used for story generation, Speech recognition, … Parts of Speech: groups words into grammatical categories … and separates different functions of a word In English, many words are ambiguous: 2 or more PoS-tags Very simple tagger: tag with the likeliest tag for the word Better Pos-Taggers: to come…

Taking context into account Theory behind some example Machine Learning PoS-taggers Example implementations in NLTK Machine Learning from a PoS-tagged training corpus Statistical (N-Gram/Markov) taggers: learn table of 1/2/3/N-tag sequence frequencies Brill (transformation-based) tagger: learn likeliest tag for each word ignoring context, then learn rules to change tag to fit context NB you dont have to use NLTK – just useful to illustrate

Training and Testing of Machine Learning Algorithms Algorithms that learn from data see a set of examples and try to generalize from them. Training set: Examples trained on Test set: Also called held-out data and unseen data Use this for evaluating your algorithm Must be separate from the training set; otherwise, you cheated! Gold standard evaluation corpus An evaluation set that a community has agreed on and uses as a common benchmark. Not seen until development is finished – ONLY for evaluation

Cross-Validation of Learning Algorithms Cross-validation set Part of the training set. Used for tuning parameters of the algorithm without polluting (tuning to) the test data. You can train on x%, and then cross-validate on the remaining 1-x% E.g., train on 90% of the training data, cross- validate (test) on the remaining 10% Repeat several times with different splits This allows you to choose the best settings to then use on the real test set. You should only evaluate on the test set at the very end, after youve gotten your algorithm as good as possible on the cross-validation set.

Strong Baselines When designing NLP algorithms, you need to evaluate them by comparing to others. Baseline Algorithm: An algorithm that is relatively simple but can be expected to do ok Should get the best score possible by doing the obvious thing.

A Tagging Baseline Find the most likely tag for the most frequent words Frequent words are ambiguous Youre likely to see frequent words in any collection Will always see to but might not see armadillo How to do this? First find the most likely words and their tags in the training data Train a tagger that looks up these results in a table

Find the most frequent words and the most likely tag of each

Use our own tagger class

N-Grams The N stands for how many terms are used Unigram: 1 term (0 th order) Bigram: 2 terms (1 st order) Trigrams: 3 terms (2 nd order) Usually dont go beyond this You can use different kinds of terms, e.g.: Character based n-grams Word-based n-grams POS-based n-grams Ordering Often adjacent, but not required We use n-grams to help determine the context in which some linguistic phenomenon happens. E.g., look at the words before and after period to see if it is the end of sentence or not.

Modified from Massio Poesio's lecture 11 Tagging with lexical frequencies Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN Problem: assign a tag to race given its lexical frequency Solution: we choose the tag that has the greater probability P(race|VB) P(race|NN)

Unigram Tagger Train on a set of sentences Keep track of how many times each word is seen with each tag. After training, associate with each word its most likely tag. Problem: many words never seen in the training data. Solution: have a default tag to backoff to.

Unigram tagger with Backoff

Whats wrong with unigram? Most frequent tag isnt always right! Need to take the context into account Which sense of to is being used? Which sense of like is being used?

N-gram tagger Uses the preceding N-1 predicted tags Also uses the unigram estimate for the current word

Modified from Diane Litman's version of Steve Bird's notes 16 Bigram Tagging For tagging, in addition to considering the tokens type, the context also considers the tags of the n preceding tokens What is the most likely tag for word n, given word n-1 and tag n- 1? The tagger picks the tag which is most likely for that context.

Modified from Diane Litman's version of Steve Bird's notes 17 Combining Taggers using Backoff Use more accurate algorithms when we can, backoff to wider coverage when needed. Try tagging the token with the 1 st order tagger. If the 1 st order tagger is unable to find a tag for the token, try finding a tag with the 0 th order tagger. If the 0 th order tagger is also unable to find a tag, use the default tagger to find a tag. Important point: Bigram and trigram taggers use the previous tag context to assign new tags. If they see a tag of None in the previous context, they will output None too.

Demonstrating the n-gram taggers Trained on brown.tagged(a), tested on brown.tagged(b) Backs off to a default of nn

Demonstrating the n-gram taggers

Combining Taggers The bigram backoff tagger did worse than the unigram! Why? Why does it get better again with trigrams? How can we improve these scores?

Modified from Diane Litman's version of Steve Bird's notes 21 Rule-Based Tagger The Linguistic Complaint Where is the linguistic knowledge of a tagger? Just a massive table of numbers Arent there any linguistic insights that could emerge from the data? Could thus use handcrafted sets of rules to tag input sentences, for example, if input follows a determiner tag it as a noun. Constraint Grammar (CG) tagger: PhD student spends 3+ years coding a large set of these rules (for English, Finnish, …) Machine Learning researchers would prefer to use ML to extract a large set of such rules from a PoS-tagged training corpus

Slide modified from Massimo Poesio's 22 The Brill tagger An example of Transformation-Based Learning Basic idea: do a quick job first (using frequency), then revise it using contextual rules. Very popular (freely available, works fairly well) A supervised method: requires a tagged corpus

Brill Tagging: In more detail Start with simple (less accurate) rules…learn better ones from tagged corpus Tag each word initially with most likely POS Examine set of transformations to see which improves tagging decisions compared to tagged corpus Re-tag corpus using best transformation Repeat until, e.g., performance doesnt improve Result: tagging procedure (ordered list of transformations) which can be applied to new, untagged text

Slide modified from Massimo Poesio's 24 An example Examples: They are expected to race tomorrow. The race for outer space. Tagging algorithm: 1.Tag all uses of race as NN (most likely tag in the Brown corpus) They are expected to race/NN tomorrow the race/NN for outer space 2.Use a transformation rule to replace the tag NN with VB for all uses of race preceded by the tag TO: They are expected to race/VB tomorrow the race/NN for outer space

Example Rule Transformations

Sample Final Rules

Summary: N-gram/Markov and Transformation/Brill PoS-Taggers Theory behind some example Machine Learning PoS-taggers Example implementations in NLTK Machine Learning from a PoS-tagged training corpus Statistical (N-Gram/Markov) taggers: learn table of 1/2/3/N-tag sequence frequencies If not enough data for N, back off to N-1 patterns Brill (transformation-based) tagger: learn likeliest tag for each word ignoring context, then learn rules to change tag to fit context NB you dont have to use NLTK – just useful to illustrate