Ch 9 Part of Speech Tagging (slides adapted from Dan Jurafsky, Jim Martin, Dekang Lin, Rada Mihalcea, and Bonnie Dorr and Mitch Marcus.) ‏

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING A comparative study of the tagging of adverbs in modern English corpora.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.
BİL711 Natural Language Processing
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Natural Language Processing Lecture 8—9/24/2013 Jim Martin.
LING 388 Language and Computers Lecture 22 11/25/03 Sandiway FONG.
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
1 Words and the Lexicon September 10th 2009 Lecture #3.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Word Classes and English Grammar.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
NLP and Speech 2004 English Grammar
Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Part of speech (POS) tagging
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
ELN – Natural Language Processing Giuseppe Attardi
8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303.
1 POS Tagging: Introduction Heng Ji Feb 2, 2008 Acknowledgement: some slides from Ralph Grishman, Nicolas Nicolov, J&M.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Parts of Speech Sudeshna Sarkar 7 Aug 2008.
Some Advances in Transformation-Based Part of Speech Tagging
1 CPE 641 Natural Language Processing Lecture 2: Levels of Linguistic Analysis, Tokenization & Part- of-speech Tagging Asst. Prof. Dr. Nuttanart Facundes.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Part of Speech Tagging & Hidden Markov Models Mitch Marcus CSE 391.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Word classes and part of speech tagging Chapter 5.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Natural Language Processing
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
GoBack definitions Level 1 Parts of Speech GoBack is a memorization game; the teacher asks students definitions, and when someone misses one, you go back.
Supertagging CMSC Natural Language Processing January 31, 2006.
Part-of-speech tagging
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Natural Language Processing Lecture 14—10/13/2015 Jim Martin.
LING/C SC/PSYC 438/538 Lecture 18 Sandiway Fong. Adminstrivia Homework 7 out today – due Saturday by midnight.
Linguistics Lecture-1: Words Pushpak Bhattacharyya, CSE Department, IIT Bombay 14 June, 2008.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Word classes and part of speech tagging Chapter 5.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
ENGLISH 5050: English Syntax and Morphology All quotations, unless otherwise noted, are from Chapter 2 of The Grammar Book, 2nd edition. Robert F. van.
POS TAGGING AND HMM Tim Teks Mining Adapted from Heng Ji.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Lecture 9: Part of Speech
CSC 594 Topics in AI – Natural Language Processing
LING/C SC/PSYC 438/538 Lecture 20 Sandiway Fong.
CSCI 5832 Natural Language Processing
FIRST SEMESTER GRAMMAR
Natural Language Processing
Natural Language Processing
Artificial Intelligence 2004 Speech & Natural Language Processing
Natural Language Processing (NLP)
Presentation transcript:

Ch 9 Part of Speech Tagging (slides adapted from Dan Jurafsky, Jim Martin, Dekang Lin, Rada Mihalcea, and Bonnie Dorr and Mitch Marcus.) ‏

Parts of Speech  8 (ish) traditional parts of speech Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.) ‏ Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS We’ll use POS most frequently

POS examples for English  Nnoun chair, bandwidth, pacing  Vverb study, debate, munch  ADJadj purple, tall, ridiculous  ADVadverb unfortunately, slowly,  Ppreposition of, by, to  PROpronoun I, me, mine  DET determiner the, a, that, those

Open Class Words  Every known human language has nouns and verbs  Nouns: people, places, things Classes of nouns — proper vs. common — count vs. mass  Verbs: actions and processes  Adjectives: properties, qualities  Adverbs: hodgepodge! Unfortunately, John walked home extremely slowly yesterday

Definition: An adverb is a part of speech. It is any word that modifies any othe r part of language: verbs, adjectives (including numbers), clauses, sentences and other adverbs, except for nouns; modifiers of nouns are primarily determiners and adjectives.part of speechverbs adjectivesclausessentencesnounsdeterminersadjectives

Closed Class Words  Differ more from language to language than open class words  Examples: prepositions: on, under, over, … particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, who, I,.. conjunctions: and, but, or, … auxiliary verbs: can, may should, … numerals: one, two, three, third, …

Prepositions from CELEX

Pronouns in CELEX

Conjunctions

Auxiliaries

NLP Task I – Determining Part of Speech Tags  The Problem: nounpot advnounadjlarge noun-propernoundeta advnounprepin nounoil verbnounheat POS listing in Brown CorpusWord

POS Tagging: Definition  The process of assigning a part-of-speech or lexical class marker to each word in a corpus: the koala put the keys on the table WORDS TAGS N V P DET

POS Tagging example WORD tag theDET koalaN put V the DET keysN onP theDET tableN

What is POS tagging good for?  Speech synthesis: How to pronounce “lead”? INsult inSULT OBject obJECT OVERflow overFLOW DIScountdisCOUNT CONtent conTENT  Stemming for information retrieval Knowing a word is a N tells you it gets plurals Can search for “aardvarks” get “aardvark”  Parsing and speech recognition and etc Possessive pronouns (my, your, her) followed by nouns Personal pronouns (I, you, he) likely to be followed by verbs

Related Problem in Bioinformatics  Durbin et al. Biological Sequence Analysis, Cambridge University Press.  Several applications, e.g. proteins  From primary structure ATCPLELLLD  Infer secondary structure HHHBBBBBC..

History: From Yair Halevi (Bar-Ilan U.) ‏ Brown Corpus Created (EN-US) ‏ 1 Million Words Brown Corpus Tagged HMM Tagging (CLAWS) ‏ 93%-95% Greene and Rubin Rule Based - 70% LOB Corpus Created (EN-UK) ‏ 1 Million Words DeRose/Church Efficient HMM Sparse Data 95%+ British National Corpus (tagged by CLAWS) ‏ POS Tagging separated from other NLP Transformation Based Tagging (Eric Brill) ‏ Rule Based – 95%+ Tree-Based Statistics (Helmut Shmid) ‏ Rule Based – 96%+ Neural Network 96%+ Trigram Tagger (Kempe) ‏ 96%+ Combined Methods 98%+ Penn Treebank Corpus (WSJ, 4.5M) ‏ LOB Corpus Tagged

British National Carpus What is it used for? Ultimately, its use is limited only by our imagination; if you have any need for up to 100 million words of modern British English, you can make use of the British National Corpus.  The main uses of the corpus, are as follows:  Reference Book Publishing Dictionaries, grammar books, teaching materials, usage guides, thesauri. Increasingly, publishers are referring to the use they make of corpus facilities: it's important to know how well their corpora are planned and constructed.  Linguistic Research Raw data for studying lexis, syntax, morphology, semantics, discourse analysis, stylistics, sociolinguistics...  Artificial Intelligence Extensive data test bed for program development.  Natural language processing Taggers, parsers, natural language understanding programs, spell checking word lists...  English Language Teaching Syllabus and materials design, classroom reference, independent learner research.

Penn Treebank Tagset

A Simplified Tagset for English  Tagsets for English have grown progressively larger since the Brown Corpus until the Penn Treebank project. 34 tags + punctuationUPenn Treebank: 197 tagsLondon-Lund Corpus: 166 tagsLancaster UCREL group: 135 tagsLOB Corpus: 87 tagsBrown Corpus:

Rationale behind British & European tag sets To provide “distinct codings for all classes of words having distinct grammatical behaviour” – Garside et al  The Lund tagset for adverb distinguishes between Adjunct – Process, Space, Time Wh-type – Manner, Reason, Space, Time, Wh-type + ‘S Conjunct – Appositional, Contrastive, Inferential, Listing, … Disjunct – Content, Style Postmodifier – “else” Negative – “not” Discourse Item – Appositional, Expletive, Greeting, Hesitator, …

Reasons for a Smaller Tagset  Many tags are unique to particular lexical items, and can be recovered automatically if desired. sung/VBNhad/HVNbeen/BEN singing/VBGhaving/HVGbeing/BEG sang/VBDhad/HVDwas/BED sing/VBZhas/HVZis/BEZ sing/VBhave/HVbe/BE Brown Tags For Verbs sung/VBNhad/VBNbeen/VBN singing/VBGhaving/VBGbeing/VBG sang/VBDhad/VBDwas/VBD sing/VBZhas/VBZis/VBZ sing/VBhave/VBbe/VB Penn Treebank Tags For Verbs

Task I – Determining Part of Speech Tags  The Problem:  The Old Solution: Combinatorial search. If each of n words has k tags on average, try the n k combinations until one works. nounpot advnounadjlarge noun-proper noundeta advnounprepin nounoil verbnounheat POS listing in BrownWord

NLP Task I – Determining Part of Speech Tags  Machine Learning Solutions: Automatically learn Part of Speech (POS) assignment. The best techniques achieve 96-97% accuracy per word on new materials, given large training corpora.

Simple Statistical Approaches: Idea 1

Simple Statistical Approaches: Idea 2 For a string of words w = w 1 w 2 w 3 …w n find the string of POS tags T = t 1 t 2 t 3 …t n which maximizes P(T|W) ‏ i.e., the probability of tag string T given that the word string was w i.e., that w was tagged T

Again, The Sparse Data Problem … A Simple, Impossible Approach to Compute P(T|W): Count up instances of the string "heat oil in a large pot" in the training corpus, and pick the most common tag assignment to the string..

A Practical Statistical Tagger

A Practical Statistical Tagger II But we can't accurately estimate more than tag bigrams or so… We change to a model that we CAN estimate:

A Practical Statistical Tagger III So, for a given string W = w 1 w 2 w 3 …w n, the tagger needs to find the string of tags T which maximizes

Training and Performance  To estimate the parameters of this model, given an annotated training corpus:  Because many of these counts are small, smoothing is necessary for best results…  Such taggers typically achieve about 95-96% correct tagging, for tag sets of tags.