Download presentation
1
SIMS 290-2: Applied Natural Language Processing
Marti Hearst Sept 8, 2004
2
Today Tokenizing using Regular Expressions Elementary Morphology
Frequency Distributions in NLTK
3
Tokenizing in NLTK The Whitespace Tokenizer doesn’t work very well
What are some of the problems? NLTK provides an easy way to incorporate regex’s into your tokenizer Uses python’s regex package (re) Modified from Dorr and Habash (after Jurafsky and Martin)
4
Regex’s for Tokenizing
Build up your recognizer piece by piece Make a string of regex’s combined with OR’s Put each one in a group (surrounded by parens) Things to recognize: urls words with hyphens in them words in which hyphens should be removed (end of line hyphens) Numerical terms Words with apostrophes Modified from Dorr and Habash (after Jurafsky and Martin)
5
Regex’s for Tokenizing
Here are some I put together: url = r'(( Allows port number but no argument variables. hyphen = r'(\w+\-\s?\w+)‘ Allows for a space after the hyphen apostro = r'(\w+\'\w+)‘ numbers = r'((\$|#)?\d+(\.)?\d+%?)‘ Needs to handle large numbers with commas punct = r'([^\w\s]+)‘ wordr = r'(\w+)‘ A nice python trick: regexp = string.join([url, hyphen, apostro, numbers, wordr, punct],"|") Makes one string in which a “|” goes in between each substring
6
Regex’s for Tokenizing
More code: import string from nltk.token import * from nltk.tokenizer import * t = Token(TEXT='This is the girl\'s depart- ment.') regexp = string.join([url, hyphen, apostrophe, numbers, wordr, punct],"|") RegexpTokenizer(regexp,SUBTOKENS='WORDS').tokenize(t) print t['WORDS'] [<This>, <is>, <the>, <girl's>, <depart- ment>, <store>, <.>]
7
Tokenization Issues Sentence Boundaries Proper Names
Include parens around sentences? What about quotation marks around sentences? Periods – end of line or not? We’ll study this in detail in a couple of weeks. Proper Names What to do about “New York-New Jersey train”? “California Governor Arnold Schwarzenegger”? Clitics and Contractions Modified from Dorr and Habash (after Jurafsky and Martin)
8
Morphology Morphology: Morphemes: Contrasts: A useful resource:
The study of the way words are built up from smaller meaning units. Morphemes: The smallest meaningful unit in the grammar of a language. Contrasts: Derivational vs. Inflectional Regular vs. Irregular Concatinative vs. Templatic (root-and-pattern) A useful resource: Glossary of linguistic terms by Eugene Loos Modified from Dorr and Habash (after Jurafsky and Martin)
9
Examples (English) “unladylike” “technique” “dogs”
3 morphemes, 4 syllables un- ‘not’ lady ‘(well behaved) female adult human’ -like ‘having the characteristics of’ Can’t break any of these down further without distorting the meaning of the units “technique” 1 morpheme, 2 syllables “dogs” 2 morphemes, 1 syllable -s, a plural marker on nouns Modified from Dorr and Habash (after Jurafsky and Martin)
10
Morpheme Definitions Root Stem Affix Clitic
The portion of the word that: is common to a set of derived or inflected forms, if any, when all affixes are removed is not further analyzable into meaningful elements carries the principle portion of meaning of the words Stem The root or roots of a word, together with any derivational affixes, to which inflectional affixes are added. Affix A bound morpheme that is joined before, after, or within a root or stem. Clitic a morpheme that functions syntactically like a word, but does not appear as an independent phonological word Spanish: un beso, las aguas English: Hal’s (genetive marker) Modified from Dorr and Habash (after Jurafsky and Martin)
11
Inflectional vs. Derivational
Word Classes Parts of speech: noun, verb, adjectives, etc. Word class dictates how a word combines with morphemes to form new words Inflection: Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast. Doesn’t change the word class Usually produces a predictable, nonidiosyncratic change of meaning. Derivation: The formation of a new word or inflectable stem from another word or stem. Modified from Dorr and Habash (after Jurafsky and Martin)
12
Inflectional Morphology
Adds: tense, number, person, mood, aspect Word class doesn’t change Word serves new grammatical role Examples come is inflected for person and number: The pizza guy comes at noon. las and rojas are inflected for agreement with manzanas in grammatical gender by -a and in number by –s las manzanas rojas (‘the red apples’) Modified from Dorr and Habash (after Jurafsky and Martin)
13
Derivational Morphology
Nominalization (formation of nouns from other parts of speech, primarily verbs in English): computerization appointee killer fuzziness Formation of adjectives (primarily from nouns) computational clueless Embraceable Diffulcult cases: building from which sense of “build”? A resource: CatVar: Categorial Variation Database Modified from Dorr and Habash (after Jurafsky and Martin)
14
Concatinative Morphology
Morpheme+Morpheme+Morpheme+… Stems: also called lemma, base form, root, lexeme hope+ing hoping hop hopping Affixes Prefixes: Antidisestablishmentarianism Suffixes: Antidisestablishmentarianism Infixes: hingi (borrow) – humingi (borrower) in Tagalog Circumfixes: sagen (say) – gesagt (said) in German Agglutinative Languages uygarlaştıramadıklarımızdanmışsınızcasına uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına Behaving as if you are among those whom we could not cause to become civilized Say (has) said Modified from Dorr and Habash (after Jurafsky and Martin)
15
Templatic Morphology Roots and Patterns Example: Hebrew verbs Root:
Consists of 3 consonants CCC Carries basic meaning Template: Gives the ordering of consonants and vowels Specifies semantic information about the verb Active, passive, middle voice Example: lmd (to learn or study) CaCaC -> lamad (he studied) CiCeC -> limed (he taught) CuCaC -> lumad (he was taught) Psycholinguistic reality format فرمت farmat Modified from Dorr and Habash (after Jurafsky and Martin)
16
Nouns and Verbs (in English)
Nouns have simple inflectional morphology cat cat+s, cat+’s Verbs have more complex morphology Modified from Dorr and Habash (after Jurafsky and Martin)
17
Nouns and Verbs (in English)
Have simple inflectional morphology Cat/Cats Mouse/Mice, Ox, Oxen, Goose, Geese Verbs More complex morphology Walk/Walked Go/Went, Fly/Flew Modified from Dorr and Habash (after Jurafsky and Martin)
18
Regular (English) Verbs
Morphological Form Classes Regularly Inflected Verbs Stem walk merge try map -s form walks merges tries maps -ing form walking merging trying mapping Past form or –ed participle walked merged tried mapped Modified from Dorr and Habash (after Jurafsky and Martin)
19
Irregularly Inflected Verbs
Irregular (English) Verbs Morphological Form Classes Irregularly Inflected Verbs Stem eat catch cut -s form eats catches cuts -ing form eating catching cutting Past form ate caught -ed participle eaten Modified from Dorr and Habash (after Jurafsky and Martin)
20
“To love” in Spanish Modified from Dorr and Habash (after Jurafsky and Martin)
21
Syntax and Morphology Phrase-level agreement
Subject-Verb John studies hard (STUDY+3SG) Noun-Adjective Las vacas hermosas Sub-word phrasal structures שבספרינו ש+ב+ספר+ים+נו That+in+book+PL+Poss:1PL Which are in our books Modified from Dorr and Habash (after Jurafsky and Martin)
22
Phonology and Morphology
Script Limitations Spoken English has 14 vowels heed hid hayed head had hoed hood who’d hide how’d taught Tut toy enough English Alphabet has 5 Use vowel combinatios: far fair fare Consonantal doubling (hopping vs. hoping) Modified from Dorr and Habash (after Jurafsky and Martin)
23
Computational Morphology
Approaches Lexicon only Rules only Lexicon and Rules Finite-state Automata Finite-state Transducers Systems WordNet’s morphy PCKimmo Named after Kimmo Koskenniemi, much work done by Lauri Karttunen, Ron Kaplan, and Martin Kay Accurate but complex Two-level morphology Commercial version available from InXight Corp. Background Chapter 3 of Jurafsky and Martin A short history of Two-Level Morphology Modified from Dorr and Habash (after Jurafsky and Martin)
24
Porter Stemmer Discount morphology
So not all that accurate Uses a series of cascaded rewrite rules ATIONAL -> ATE (relational -> relate) ING -> if stem contains vowel (motoring -> motor) Modified from Dorr and Habash (after Jurafsky and Martin)
25
Porter Stemmer Step 4: Derivational Morphology I: Multiple Suffixes
(m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational > rational (m>0) ENCI -> ENCE valenci > valence (m>0) ANCI -> ANCE hesitanci > hesitance (m>0) IZER -> IZE digitizer > digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli > radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator > operate (m>0) ALISM -> AL feudalism > feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti > formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible Modified from Dorr and Habash (after Jurafsky and Martin)
26
Porter Stemmer Errors of Omission Errors of Commission European Europe
analysis analyzes matrices matrix noise noisy explain explanation Errors of Commission organization organ doing doe generalization generic numerical numerous university universe Modified from Dorr and Habash (after Jurafsky and Martin)
27
Computational Morphology
WORD STEM (+FEATURES)* cats cat +N +PL cat cat +N +SG cities city +N +PL geese goose +N +PL ducks (duck +N +PL) or (duck +V +3SG) merging merge +V +PRES-PART caught (catch +V +PAST-PART) or (catch +V +PAST) Modified from Dorr and Habash (after Jurafsky and Martin)
28
Lexicon-only Morphology
The lexicon lists all surface level and lexical level pairs No rules … Analysis/Generation is easy Very large for English What about Arabic or Turkish or Chinese? acclaim acclaim $N$ acclaim acclaim $V+0$ acclaimed acclaim $V+ed$ acclaimed acclaim $V+en$ acclaiming acclaim $V+ing$ acclaims acclaim $N+s$ acclaims acclaim $V+s$ acclamation acclamation $N$ acclamations acclamation $N+s$ acclimate acclimate $V+0$ acclimated acclimate $V+ed$ acclimated acclimate $V+en$ acclimates acclimate $V+s$ acclimating acclimate $V+ing$ Modified from Dorr and Habash (after Jurafsky and Martin)
29
For Next Week Software status: Lecture on Monday Sept 13:
Software on 3 lab machines, more coming Lecture on Monday Sept 13: Part of speech tagging For Wed Sept 15 Do exercises 1-3 in Tutorial 2 (Tokenizing) Do the following exercises from Tutorial 3 (Tagging) 1a-h 2, 3, 4, 5a-b Turn them in online (I’ll have something available for this by then)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.