Ch 9 Part of Speech Tagging (slides adapted from Dan Jurafsky, Jim Martin, Dekang Lin, Rada Mihalcea, and Bonnie Dorr and Mitch Marcus.) ‏

Parts of Speech  8 (ish) traditional parts of speech Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.) ‏ Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS We’ll use POS most frequently

POS examples for English  Nnoun chair, bandwidth, pacing  Vverb study, debate, munch  ADJadj purple, tall, ridiculous  ADVadverb unfortunately, slowly,  Ppreposition of, by, to  PROpronoun I, me, mine  DET determiner the, a, that, those

Open Class Words  Every known human language has nouns and verbs  Nouns: people, places, things Classes of nouns — proper vs. common — count vs. mass  Verbs: actions and processes  Adjectives: properties, qualities  Adverbs: hodgepodge! Unfortunately, John walked home extremely slowly yesterday

Definition: An adverb is a part of speech. It is any word that modifies any othe r part of language: verbs, adjectives (including numbers), clauses, sentences and other adverbs, except for nouns; modifiers of nouns are primarily determiners and adjectives.part of speechverbs adjectivesclausessentencesnounsdeterminersadjectives

Closed Class Words  Differ more from language to language than open class words  Examples: prepositions: on, under, over, … particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, who, I,.. conjunctions: and, but, or, … auxiliary verbs: can, may should, … numerals: one, two, three, third, …

Prepositions from CELEX

Pronouns in CELEX

Conjunctions

Auxiliaries

NLP Task I – Determining Part of Speech Tags  The Problem: nounpot advnounadjlarge noun-propernoundeta advnounprepin nounoil verbnounheat POS listing in Brown CorpusWord

POS Tagging: Definition  The process of assigning a part-of-speech or lexical class marker to each word in a corpus: the koala put the keys on the table WORDS TAGS N V P DET

POS Tagging example WORD tag theDET koalaN put V the DET keysN onP theDET tableN

What is POS tagging good for?  Speech synthesis: How to pronounce “lead”? INsult inSULT OBject obJECT OVERflow overFLOW DIScountdisCOUNT CONtent conTENT  Stemming for information retrieval Knowing a word is a N tells you it gets plurals Can search for “aardvarks” get “aardvark”  Parsing and speech recognition and etc Possessive pronouns (my, your, her) followed by nouns Personal pronouns (I, you, he) likely to be followed by verbs

Related Problem in Bioinformatics  Durbin et al. Biological Sequence Analysis, Cambridge University Press.  Several applications, e.g. proteins  From primary structure ATCPLELLLD  Infer secondary structure HHHBBBBBC..

History: From Yair Halevi (Bar-Ilan U.) ‏ 19601970198019902000 Brown Corpus Created (EN-US) ‏ 1 Million Words Brown Corpus Tagged HMM Tagging (CLAWS) ‏ 93%-95% Greene and Rubin Rule Based - 70% LOB Corpus Created (EN-UK) ‏ 1 Million Words DeRose/Church Efficient HMM Sparse Data 95%+ British National Corpus (tagged by CLAWS) ‏ POS Tagging separated from other NLP Transformation Based Tagging (Eric Brill) ‏ Rule Based – 95%+ Tree-Based Statistics (Helmut Shmid) ‏ Rule Based – 96%+ Neural Network 96%+ Trigram Tagger (Kempe) ‏ 96%+ Combined Methods 98%+ Penn Treebank Corpus (WSJ, 4.5M) ‏ LOB Corpus Tagged

British National Carpus What is it used for? Ultimately, its use is limited only by our imagination; if you have any need for up to 100 million words of modern British English, you can make use of the British National Corpus.  The main uses of the corpus, are as follows:  Reference Book Publishing Dictionaries, grammar books, teaching materials, usage guides, thesauri. Increasingly, publishers are referring to the use they make of corpus facilities: it's important to know how well their corpora are planned and constructed.  Linguistic Research Raw data for studying lexis, syntax, morphology, semantics, discourse analysis, stylistics, sociolinguistics...  Artificial Intelligence Extensive data test bed for program development.  Natural language processing Taggers, parsers, natural language understanding programs, spell checking word lists...  English Language Teaching Syllabus and materials design, classroom reference, independent learner research.

Penn Treebank Tagset

A Simplified Tagset for English  Tagsets for English have grown progressively larger since the Brown Corpus until the Penn Treebank project. 34 tags + punctuationUPenn Treebank: 197 tagsLondon-Lund Corpus: 166 tagsLancaster UCREL group: 135 tagsLOB Corpus: 87 tagsBrown Corpus:

Rationale behind British & European tag sets To provide “distinct codings for all classes of words having distinct grammatical behaviour” – Garside et al. 1987  The Lund tagset for adverb distinguishes between Adjunct – Process, Space, Time Wh-type – Manner, Reason, Space, Time, Wh-type + ‘S Conjunct – Appositional, Contrastive, Inferential, Listing, … Disjunct – Content, Style Postmodifier – “else” Negative – “not” Discourse Item – Appositional, Expletive, Greeting, Hesitator, …

Reasons for a Smaller Tagset  Many tags are unique to particular lexical items, and can be recovered automatically if desired. sung/VBNhad/HVNbeen/BEN singing/VBGhaving/HVGbeing/BEG sang/VBDhad/HVDwas/BED sing/VBZhas/HVZis/BEZ sing/VBhave/HVbe/BE Brown Tags For Verbs sung/VBNhad/VBNbeen/VBN singing/VBGhaving/VBGbeing/VBG sang/VBDhad/VBDwas/VBD sing/VBZhas/VBZis/VBZ sing/VBhave/VBbe/VB Penn Treebank Tags For Verbs

Task I – Determining Part of Speech Tags  The Problem:  The Old Solution: Combinatorial search. If each of n words has k tags on average, try the n k combinations until one works. nounpot advnounadjlarge noun-proper noundeta advnounprepin nounoil verbnounheat POS listing in BrownWord

NLP Task I – Determining Part of Speech Tags  Machine Learning Solutions: Automatically learn Part of Speech (POS) assignment. The best techniques achieve 96-97% accuracy per word on new materials, given large training corpora.

Simple Statistical Approaches: Idea 1

Simple Statistical Approaches: Idea 2 For a string of words w = w 1 w 2 w 3 …w n find the string of POS tags T = t 1 t 2 t 3 …t n which maximizes P(T|W) ‏ i.e., the probability of tag string T given that the word string was w i.e., that w was tagged T

Again, The Sparse Data Problem … A Simple, Impossible Approach to Compute P(T|W): Count up instances of the string "heat oil in a large pot" in the training corpus, and pick the most common tag assignment to the string..

A Practical Statistical Tagger

A Practical Statistical Tagger II But we can't accurately estimate more than tag bigrams or so… We change to a model that we CAN estimate:

A Practical Statistical Tagger III So, for a given string W = w 1 w 2 w 3 …w n, the tagger needs to find the string of tags T which maximizes

Training and Performance  To estimate the parameters of this model, given an annotated training corpus:  Because many of these counts are small, smoothing is necessary for best results…  Such taggers typically achieve about 95-96% correct tagging, for tag sets of 40-80 tags.

Ch 9 Part of Speech Tagging (slides adapted from Dan Jurafsky, Jim Martin, Dekang Lin, Rada Mihalcea, and Bonnie Dorr and Mitch Marcus.) ‏

Similar presentations

Presentation on theme: "Ch 9 Part of Speech Tagging (slides adapted from Dan Jurafsky, Jim Martin, Dekang Lin, Rada Mihalcea, and Bonnie Dorr and Mitch Marcus.) ‏"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ch 9 Part of Speech Tagging (slides adapted from Dan Jurafsky, Jim Martin, Dekang Lin, Rada Mihalcea, and Bonnie Dorr and Mitch Marcus.) ‏

Similar presentations

Presentation on theme: "Ch 9 Part of Speech Tagging (slides adapted from Dan Jurafsky, Jim Martin, Dekang Lin, Rada Mihalcea, and Bonnie Dorr and Mitch Marcus.) ‏"— Presentation transcript:

Similar presentations

About project

Feedback