Fall 2005 Lecture Notes #4 EECS 595 / LING 541 / SI 661 Natural Language Processing
Features and unification
Introduction Grammatical categories have properties Constraint-based formalisms Example: this flights: agreement is difficult to handle at the level of grammatical categories Example: many water: count/mass nouns Sample rule that takes into account features: S NP VP (but only if the number of the NP is equal to the number of the VP)
Feature structures CAT NP NUMBER SINGULAR PERSON 3 CAT NP AGREEMENT NUMBER SG PERSON 3 Feature paths: {x agreement number}
Unification [NUMBER SG] [NUMBER SG] + [NUMBER SG] [NUMBER PL] - [NUMBER SG] [NUMBER []] = [NUMBER SG] [NUMBER SG] [PERSON 3] = ?
Agreement S NP VP {NP AGREEMENT} = {VP AGREEMENT} Does this flight serve breakfast? Do these flights serve breakfast? S Aux NP VP {Aux AGREEMENT} = {NP AGREEMENT}
Agreement These flights This flight NP Det Nominal {Det AGREEMENT} = {Nominal AGREEMENT} Verb serve {Verb AGREEMENT NUMBER} = PL Verb serves {Verb AGREEMENT NUMBER} = SG
Subcategorization VP Verb {VP HEAD} = {Verb HEAD} {VP HEAD SUBCAT} = INTRANS VP Verb NP {VP HEAD} = {Verb HEAD} {VP HEAD SUBCAT} = TRANS VP Verb NP NP {VP HEAD} = {Verb HEAD} {VP HEAD SUBCAT} = DITRANS
Regular Expressions and Automata
Regular expressions Searching for “woodchuck” Searching for “woodchucks” with an optional final “s” Regular expressions Finite-state automata (singular: automaton)
Regular expressions Basic regular expression patterns Perl-based syntax (slightly different from other notations for regular expressions) Disjunctions [abc] Ranges [A-Z] Negations [^Ss] Optional characters ? and * Wild cards. Anchors ^ and $, also \b and \B Disjunction, grouping, and precedence |
Writing correct expressions Exercise: write a Perl regular expression to match the English article “the”: /the/ /[tT]he/ /\b[tT]he\b/ /[^a-zA-Z][tT]he[^a-zA-Z]/ /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
A more complex example Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: /$[0-9]+/ /$[0-9]+\.[0-9][0-9]/ /\b$[0-9]+(\.[0-9][0-9])?\b/ /\b[0-9]+ *(MHz|[Mm]egahertz|Ghz| [Gg]igahertz)\b/ /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
Advanced operators
Substitutions and memory Substitutions Memory ( \1, \2, etc. refer back to matches) s/colour/color/ s/([0-9]+)/ /
Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
Eliza-style regular expressions s/.* YOU ARE (depressed|sad).*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad).*/WHY DO YOU THINK YOU ARE \1/ s/.* all.*/IN WHAT WAY/ s/.* always.*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ Step 1: replace first person references with second person references Step 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations
Finite-state automata Finite-state automata (FSA) Regular languages Regular expressions
Finite-state automata (machines) baa! baaa! baaaa! baaaaa!... q0q0 q1q1 q2q2 q3q3 q4q4 baa! a baa+! state transition final state
Input tape aba!b q0q0
Finite-state automata Q: a finite set of N states q 0, q 1, … q N : a finite input alphabet of symbols q 0 : the start state F: the set of final states (q,i): transition function
State-transition tables Input Stateba!
The FSM toolkit and friends Developed at AT&T Research (Riley, Pereira, Mohri, Sproat) Download: Tutorial available 4 useful parts: FSM, Lextools, GRM, Dot (separate) –/data2/tools/fsm-3.6/bin –/data2/tools/lextools/bin –/data2/tools/dot/bin
D-RECOGNIZE function D-RECOGNIZE (tape, machine) returns accept or reject index Beginning of tape current-state Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state transition-table [current-state, tape[index]] index index + 1 end
Adding a failing state q0q0 q1q1 q2q2 q3q3 q4q4 baa! a qFqF a ! b ! b! b b a !
Languages and automata Formal languages: regular languages, non-regular languages deterministic vs. non-deterministic FSAs Epsilon ( ) transitions
Using NFSAs to accept strings Backup: add markers at choice points, then possibly revisit underexplored markers Look-ahead: look ahead in input Parallelism: look at alternatives in parallel
Using NFSAs Input Stateba! ,
More about FSAs Transducers Equivalence of DFSAs and NFSAs Recognition as search: depth-first, breadth- search
Recognition using NFSAs
Regular languages Operations on regular languages and FSAs: concatenation, closure, union Properties of regular languages (closed under concatenation, union, disjunction, intersection, difference, complementation, reversal, Kleene closure)
An exercise J&M 2.8. Write a regular expression for the language accepted by the NFSA in the Figure.
Morphology and Finite-State Transducers
Morphemes Stems, affixes Affixes: prefixes, suffixes, infixes: hingi (borrow) – humingi (agent) in Tagalog, circumfixes: sagen – gesagt in German Concatenative morphology Templatic morphology (Semitic languages) : lmd (learn), lamad (he studied), limed (he taught), lumad (he was taught)
Morphological analysis rewrites unbelievably
Inflectional morphology Tense, number, person, mood, aspect Five verb forms in English 40+ forms in French Six cases in Russian: Up to 40,000 forms in Turkish (you cause X to cause Y to … do Z)
Derivational morphology Nominalization: computerization, appointee, killer, fuzziness Formation of adjectives: computational, embraceable, clueless
Finite-state morphological parsing Cats: cat +N +PL Cat: cat +N +SG Cities: city +N +PL Geese: goose +N +PL Ducks: (duck +N +PL) or (duck +V +3SG) Merging: +V +PRES-PART Caught: (catch +V +PAST-PART) or (catch +V +PAST)
Principles of morphological parsing Lexicon Morphotactics (e.g., plural follows noun) Orthography (easy easier) Irregular nouns: e.g., geese, sheep, mice Irregular verbs: e.g., caught, ate, eaten
FSA for adjectives Big, bigger, biggest Cool, cooler, coolest, coolly Red, redder, reddest Clear, clearer, clearest, clearly, unclear, unclearly Happy, happier, happiest, happily Unhappy, unhappier, unhappiest, unhappily What about: unbig, redly, and realest?
Using FSA for recognition Is a string a legitimate word or not? Two-level morphology: lexical level + surface level (Koskenniemi 83) Finite-state transducers (FST) – used for regular relations Inversion and composition of FST
Orthographic rules Beg/begging Make/making Watch/watches Try/tries Panic/panicked
Combining FST lexicon and rules Cascades of transducers: the output of one becomes the input of another
Weighted Automata
Phonetic symbols IPA Arpabet Examples
Using WFST for language modeling Phonetic representation Part-of-speech tagging
Word Classes and Part Of Speech Tagging
Some POS statistics Preposition list from COBUILD Single-word particles Conjunctions Pronouns Modal verbs
Tagsets for English Penn Treebank Other tagsets (see Week 1 slides)
POS ambiguity Degrees of ambiguity (DeRose 1988) Rule-based POS tagging –ENGTWOL (Voutilainen et al. ) –Sample rule: Adverbial-That rule (“it isn’t that odd”) (“Given input: “that” if (+1 A/ADV/QUANT); (+2 SENT-LIM); (NOT –1 SVOC/A); (not a verb like “consider”) then eliminate non-ADV tags else eliminate ADV tag
Evaluating POS taggers Percent correct What is the lower bound on a system’s performance? What about the upper bound?
Kappa N: number of items (index i) n: number of categories (index j) k: number of annotators when >.8 – agreement is considered high
Midterm reading list Chapter 1 – Introduction Chapter 2 – Regular expressions and automata Chapter 3 – Morphology and finite-state transducers + FSM tutorial Chapter 8 – Word classes and POS tagging Chapter 9 – Context-free grammars for English Chapter 10 – Parsing with context-free grammars Chapter 11 - Features and unification
Syntaxscape Written by Juno Suk of Lucent
Read by yourselves 9.9. Spoken language syntax Grammar equivalence Finite-state and context-free grammars