Download presentation
Presentation is loading. Please wait.
Published byBarrie Perkins Modified over 9 years ago
1
Ch 9 Part of Speech Tagging (slides adapted from Dan Jurafsky, Jim Martin, Dekang Lin, Rada Mihalcea, and Bonnie Dorr and Mitch Marcus.)
2
Parts of Speech 8 (ish) traditional parts of speech Noun, verb, adjective, preposition, adverb, article, interjection, pronoun, conjunction, etc This idea has been around for over 2000 years (Dionysius Thrax of Alexandria, c. 100 B.C.) Called: parts-of-speech, lexical category, word classes, morphological classes, lexical tags, POS We’ll use POS most frequently
3
POS examples for English Nnoun chair, bandwidth, pacing Vverb study, debate, munch ADJadj purple, tall, ridiculous ADVadverb unfortunately, slowly, Ppreposition of, by, to PROpronoun I, me, mine DET determiner the, a, that, those
4
Open Class Words Every known human language has nouns and verbs Nouns: people, places, things Classes of nouns — proper vs. common — count vs. mass Verbs: actions and processes Adjectives: properties, qualities Adverbs: hodgepodge! Unfortunately, John walked home extremely slowly yesterday
5
Definition: An adverb is a part of speech. It is any word that modifies any othe r part of language: verbs, adjectives (including numbers), clauses, sentences and other adverbs, except for nouns; modifiers of nouns are primarily determiners and adjectives.part of speechverbs adjectivesclausessentencesnounsdeterminersadjectives
6
Closed Class Words Differ more from language to language than open class words Examples: prepositions: on, under, over, … particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, who, I,.. conjunctions: and, but, or, … auxiliary verbs: can, may should, … numerals: one, two, three, third, …
7
Prepositions from CELEX
8
Pronouns in CELEX
9
Conjunctions
10
Auxiliaries
11
NLP Task I – Determining Part of Speech Tags The Problem: nounpot advnounadjlarge noun-propernoundeta advnounprepin nounoil verbnounheat POS listing in Brown CorpusWord
12
POS Tagging: Definition The process of assigning a part-of-speech or lexical class marker to each word in a corpus: the koala put the keys on the table WORDS TAGS N V P DET
13
POS Tagging example WORD tag theDET koalaN put V the DET keysN onP theDET tableN
14
What is POS tagging good for? Speech synthesis: How to pronounce “lead”? INsult inSULT OBject obJECT OVERflow overFLOW DIScountdisCOUNT CONtent conTENT Stemming for information retrieval Knowing a word is a N tells you it gets plurals Can search for “aardvarks” get “aardvark” Parsing and speech recognition and etc Possessive pronouns (my, your, her) followed by nouns Personal pronouns (I, you, he) likely to be followed by verbs
15
Related Problem in Bioinformatics Durbin et al. Biological Sequence Analysis, Cambridge University Press. Several applications, e.g. proteins From primary structure ATCPLELLLD Infer secondary structure HHHBBBBBC..
16
History: From Yair Halevi (Bar-Ilan U.) 19601970198019902000 Brown Corpus Created (EN-US) 1 Million Words Brown Corpus Tagged HMM Tagging (CLAWS) 93%-95% Greene and Rubin Rule Based - 70% LOB Corpus Created (EN-UK) 1 Million Words DeRose/Church Efficient HMM Sparse Data 95%+ British National Corpus (tagged by CLAWS) POS Tagging separated from other NLP Transformation Based Tagging (Eric Brill) Rule Based – 95%+ Tree-Based Statistics (Helmut Shmid) Rule Based – 96%+ Neural Network 96%+ Trigram Tagger (Kempe) 96%+ Combined Methods 98%+ Penn Treebank Corpus (WSJ, 4.5M) LOB Corpus Tagged
17
British National Carpus What is it used for? Ultimately, its use is limited only by our imagination; if you have any need for up to 100 million words of modern British English, you can make use of the British National Corpus. The main uses of the corpus, are as follows: Reference Book Publishing Dictionaries, grammar books, teaching materials, usage guides, thesauri. Increasingly, publishers are referring to the use they make of corpus facilities: it's important to know how well their corpora are planned and constructed. Linguistic Research Raw data for studying lexis, syntax, morphology, semantics, discourse analysis, stylistics, sociolinguistics... Artificial Intelligence Extensive data test bed for program development. Natural language processing Taggers, parsers, natural language understanding programs, spell checking word lists... English Language Teaching Syllabus and materials design, classroom reference, independent learner research.
18
Penn Treebank Tagset
19
A Simplified Tagset for English Tagsets for English have grown progressively larger since the Brown Corpus until the Penn Treebank project. 34 tags + punctuationUPenn Treebank: 197 tagsLondon-Lund Corpus: 166 tagsLancaster UCREL group: 135 tagsLOB Corpus: 87 tagsBrown Corpus:
20
Rationale behind British & European tag sets To provide “distinct codings for all classes of words having distinct grammatical behaviour” – Garside et al. 1987 The Lund tagset for adverb distinguishes between Adjunct – Process, Space, Time Wh-type – Manner, Reason, Space, Time, Wh-type + ‘S Conjunct – Appositional, Contrastive, Inferential, Listing, … Disjunct – Content, Style Postmodifier – “else” Negative – “not” Discourse Item – Appositional, Expletive, Greeting, Hesitator, …
21
Reasons for a Smaller Tagset Many tags are unique to particular lexical items, and can be recovered automatically if desired. sung/VBNhad/HVNbeen/BEN singing/VBGhaving/HVGbeing/BEG sang/VBDhad/HVDwas/BED sing/VBZhas/HVZis/BEZ sing/VBhave/HVbe/BE Brown Tags For Verbs sung/VBNhad/VBNbeen/VBN singing/VBGhaving/VBGbeing/VBG sang/VBDhad/VBDwas/VBD sing/VBZhas/VBZis/VBZ sing/VBhave/VBbe/VB Penn Treebank Tags For Verbs
22
Task I – Determining Part of Speech Tags The Problem: The Old Solution: Combinatorial search. If each of n words has k tags on average, try the n k combinations until one works. nounpot advnounadjlarge noun-proper noundeta advnounprepin nounoil verbnounheat POS listing in BrownWord
23
NLP Task I – Determining Part of Speech Tags Machine Learning Solutions: Automatically learn Part of Speech (POS) assignment. The best techniques achieve 96-97% accuracy per word on new materials, given large training corpora.
24
Simple Statistical Approaches: Idea 1
25
Simple Statistical Approaches: Idea 2 For a string of words w = w 1 w 2 w 3 …w n find the string of POS tags T = t 1 t 2 t 3 …t n which maximizes P(T|W) i.e., the probability of tag string T given that the word string was w i.e., that w was tagged T
26
Again, The Sparse Data Problem … A Simple, Impossible Approach to Compute P(T|W): Count up instances of the string "heat oil in a large pot" in the training corpus, and pick the most common tag assignment to the string..
27
A Practical Statistical Tagger
28
A Practical Statistical Tagger II But we can't accurately estimate more than tag bigrams or so… We change to a model that we CAN estimate:
29
A Practical Statistical Tagger III So, for a given string W = w 1 w 2 w 3 …w n, the tagger needs to find the string of tags T which maximizes
30
Training and Performance To estimate the parameters of this model, given an annotated training corpus: Because many of these counts are small, smoothing is necessary for best results… Such taggers typically achieve about 95-96% correct tagging, for tag sets of 40-80 tags.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.