Presentation is loading. Please wait.

Presentation is loading. Please wait.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

Similar presentations


Presentation on theme: "Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches."— Presentation transcript:

1 Word classes and part of speech tagging

2 Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches 1: rule-based tagging Automatic approaches 2: stochastic tagging Automatic approaches 3: transformation-based tagging Other issues: tagging unknown words, evaluation

3 Slide 2 Definition “The process of assigning a part-of-speech or other lexical class marker to each word in a corpus” the girl kissed the baby on the cheek WORDS TAGS N V P DET

4 Slide 3 An Example the girl kiss the baby on the cheek LEMMATAG +DET +NOUN +VPAST +DET +NOUN +PREP +DET +NOUN the girl kissed the baby on the cheek WORD

5 Slide 4 Motivation Speech synthesis — pronunciation Speech recognition — class-based N-grams Information retrieval — stemming, selection high- content words Word-sense disambiguation Corpus analysis of language & lexicography

6 Slide 5 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches 1: rule-based tagging Automatic approaches 2: stochastic tagging Automatic approaches 3: transformation-based tagging Other issues: tagging unknown words, evaluation

7 Slide 6 Word Classes Basic word classes: Noun, Verb, Adjective, Adverb, Preposition, … Open vs. Closed classes Open: Nouns, Verbs, Adjectives, Adverbs. Why “open”? (that readily accept new members) Closed: Definition: The category of function words--that is, parts of speech (or word classes) that do not readily accept new members. determiners: a, an, the pronouns: she, he, I prepositions: on, under, over, near, by, …

8 Slide 7 Open Class Words Every known human language has nouns and verbs Nouns: people, places, things Classes of nouns proper vs. common count vs. mass Verbs: actions and processes Adjectives: properties, qualities Adverbs: hodgepodge! Unfortunately, John walked home extremely slowly yesterday Numerals: one, two, three, third, …

9 Slide 8 Closed Class Words Differ more from language to language than open class words Examples: prepositions: on, under, over, … particles: up, down, on, off, … determiners: a, an, the, … pronouns: she, who, I,.. conjunctions: and, but, or, … auxiliary verbs: can, may should, …

10 Slide 9 Prepositions from CELEX

11 Slide 10 English Single-Word Particles

12 Slide 11 Pronouns in CELEX

13 Slide 12 Conjunctions

14 Slide 13 Auxiliaries

15 Slide 14 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches 1: rule-based tagging Automatic approaches 2: stochastic tagging Automatic approaches 3: transformation-based tagging Other issues: tagging unknown words, evaluation

16 Slide 15 Word Classes: Tag Sets Vary in number of tags: a dozen to over 200 Size of tag sets depends on language, objectives and purpose –Some tagging approaches (e.g., constraint grammar based) make fewer distinctions e.g., conflating prepositions, conjunctions, particles –Simple morphology = more ambiguity = fewer tags

17 Slide 16 Word Classes: Tag set example PRP PRP$

18 Slide 17 Example of Penn Treebank Tagging of Brown Corpus Sentence The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS./. VB DT NN. Book that flight. VBZ DT NN VB NN ? Does that flight serve dinner ? See http://www.infogistics.com/posdemo.htmhttp://www.infogistics.com/posdemo.htm

19 Slide 18 The Problem Words often have more than one word class: this This is a nice day = PRP This day is nice = DT You can go this far = RB

20 Slide 19 Word Class Ambiguity (in the Brown Corpus) Unambiguous (1 tag): 35,340 Ambiguous (2-7 tags): 4,100 2 tags3,760 3 tags264 4 tags61 5 tags12 6 tags2 7 tags1

21 Slide 20 Part-of-Speech Tagging Rule-Based Tagger: ENGTWOL (ENGlish TWO Level analysis) Stochastic Tagger: HMM-based Transformation-Based Tagger (Brill)

22 Slide 21 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches 1: rule-based tagging Automatic approaches 2: stochastic tagging Automatic approaches 3: transformation-based tagging Other issues: tagging unknown words, evaluation

23 Slide 22 Rule-Based Tagging Basic Idea: –Assign all possible tags to words –Remove tags according to set of rules of type: if word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1 is not a verb like “consider” then eliminate non-adv else eliminate adv. –Typically more than 1000 hand-written rules, but may be machine-learned.

24 Slide 23 Sample ENGTWOL Lexicon Demo: http://www2.lingsoft.fi/cgi-bin/engtwolhttp://www2.lingsoft.fi/cgi-bin/engtwol

25 Slide 24 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches 1: rule-based tagging Automatic approaches 2: stochastic tagging Automatic approaches 3: transformation-based tagging Other issues: tagging unknown words, evaluation

26 Slide 25 Stochastic Tagging Based on probability of certain tag occurring given various possibilities Requires a training corpus No probabilities for words not in corpus. Training corpus may be different from test corpus.

27 Slide 26 Stochastic Tagging (cont.) Simple Method: Choose most frequent tag in training text for each word! –Result: 90% accuracy –Baseline –Others will do better

28 Slide 27 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches 1: rule-based tagging Automatic approaches 2: stochastic tagging Automatic approaches 3: transformation-based tagging Other issues: tagging unknown words, evaluation

29 Slide 28 Transformation-Based Tagging (Brill Tagging) Combination of Rule-based and stochastic tagging methodologies –Like rule-based because rules are used to specify tags in a certain environment –Like stochastic approach because machine learning is used—with tagged corpus as input Input: –tagged corpus –dictionary (with most frequent tags) Usually constructed from the tagged corpus

30 Slide 29 Transformation-Based Tagging (cont.) Basic Idea: –Set the most probable tag for each word as a start value –Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order Training is done on tagged corpus: –Write a set of rule templates –Among the set of rules, find one with highest score –Continue from 2 until lowest score threshold is passed –Keep the ordered set of rules Rules make errors that are corrected by later rules

31 Slide 30 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches 1: rule-based tagging Automatic approaches 2: stochastic tagging Automatic approaches 3: transformation-based tagging Other issues: tagging unknown words, evaluation

32 Slide 31 Tagging Unknown Words New words added to (newspaper) language 20+ per month Plus many proper names … Increases error rates by 1-2% Method 1: assume they are nouns Method 2: assume the unknown words have a probability distribution similar to words only occurring once in the training set. Method 3: Use morphological information, e.g., words ending with –ed tend to be tagged VBN.

33 Slide 32 Evaluation The result is compared with a manually coded “Gold Standard” –Typically accuracy reaches 96-97% –This may be compared with result for a baseline tagger (one that uses no context). Important: 100% is impossible even for human annotators. Factors that affects the performance –The amount of training data available –The tag set –The difference between training corpus and test corpus –Dictionary –Unknown words


Download ppt "Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches."

Similar presentations


Ads by Google