Download presentation
Presentation is loading. Please wait.
1
Fall 2005 Lecture Notes #4 EECS 595 / LING 541 / SI 661 Natural Language Processing
2
Features and unification
3
Introduction Grammatical categories have properties Constraint-based formalisms Example: this flights: agreement is difficult to handle at the level of grammatical categories Example: many water: count/mass nouns Sample rule that takes into account features: S NP VP (but only if the number of the NP is equal to the number of the VP)
4
Feature structures CAT NP NUMBER SINGULAR PERSON 3 CAT NP AGREEMENT NUMBER SG PERSON 3 Feature paths: {x agreement number}
5
Unification [NUMBER SG] [NUMBER SG] + [NUMBER SG] [NUMBER PL] - [NUMBER SG] [NUMBER []] = [NUMBER SG] [NUMBER SG] [PERSON 3] = ?
6
Agreement S NP VP {NP AGREEMENT} = {VP AGREEMENT} Does this flight serve breakfast? Do these flights serve breakfast? S Aux NP VP {Aux AGREEMENT} = {NP AGREEMENT}
7
Agreement These flights This flight NP Det Nominal {Det AGREEMENT} = {Nominal AGREEMENT} Verb serve {Verb AGREEMENT NUMBER} = PL Verb serves {Verb AGREEMENT NUMBER} = SG
8
Subcategorization VP Verb {VP HEAD} = {Verb HEAD} {VP HEAD SUBCAT} = INTRANS VP Verb NP {VP HEAD} = {Verb HEAD} {VP HEAD SUBCAT} = TRANS VP Verb NP NP {VP HEAD} = {Verb HEAD} {VP HEAD SUBCAT} = DITRANS
9
Regular Expressions and Automata
10
Regular expressions Searching for “woodchuck” Searching for “woodchucks” with an optional final “s” Regular expressions Finite-state automata (singular: automaton)
11
Regular expressions Basic regular expression patterns Perl-based syntax (slightly different from other notations for regular expressions) Disjunctions [abc] Ranges [A-Z] Negations [^Ss] Optional characters ? and * Wild cards. Anchors ^ and $, also \b and \B Disjunction, grouping, and precedence |
12
Writing correct expressions Exercise: write a Perl regular expression to match the English article “the”: /the/ /[tT]he/ /\b[tT]he\b/ /[^a-zA-Z][tT]he[^a-zA-Z]/ /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
13
A more complex example Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: /$[0-9]+/ /$[0-9]+\.[0-9][0-9]/ /\b$[0-9]+(\.[0-9][0-9])?\b/ /\b[0-9]+ *(MHz|[Mm]egahertz|Ghz| [Gg]igahertz)\b/ /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
14
Advanced operators
15
Substitutions and memory Substitutions Memory ( \1, \2, etc. refer back to matches) s/colour/color/ s/([0-9]+)/ /
16
Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
17
Eliza-style regular expressions s/.* YOU ARE (depressed|sad).*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad).*/WHY DO YOU THINK YOU ARE \1/ s/.* all.*/IN WHAT WAY/ s/.* always.*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ Step 1: replace first person references with second person references Step 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations
18
Finite-state automata Finite-state automata (FSA) Regular languages Regular expressions
19
Finite-state automata (machines) baa! baaa! baaaa! baaaaa!... q0q0 q1q1 q2q2 q3q3 q4q4 baa! a baa+! state transition final state
20
Input tape aba!b q0q0
21
Finite-state automata Q: a finite set of N states q 0, q 1, … q N : a finite input alphabet of symbols q 0 : the start state F: the set of final states (q,i): transition function
22
State-transition tables Input Stateba! 0100 1020 2030 3034 4000
23
The FSM toolkit and friends Developed at AT&T Research (Riley, Pereira, Mohri, Sproat) Download: http://www.research.att.com/sw/tools/fsm/tech.html http://www.research.att.com/sw/tools/lextools/ Tutorial available 4 useful parts: FSM, Lextools, GRM, Dot (separate) –/data2/tools/fsm-3.6/bin –/data2/tools/lextools/bin –/data2/tools/dot/bin
24
D-RECOGNIZE function D-RECOGNIZE (tape, machine) returns accept or reject index Beginning of tape current-state Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state transition-table [current-state, tape[index]] index index + 1 end
25
Adding a failing state q0q0 q1q1 q2q2 q3q3 q4q4 baa! a qFqF a ! b ! b! b b a !
26
Languages and automata Formal languages: regular languages, non-regular languages deterministic vs. non-deterministic FSAs Epsilon ( ) transitions
27
Using NFSAs to accept strings Backup: add markers at choice points, then possibly revisit underexplored markers Look-ahead: look ahead in input Parallelism: look at alternatives in parallel
28
Using NFSAs Input Stateba! 01000 10200 202,300 30040 40000
29
More about FSAs Transducers Equivalence of DFSAs and NFSAs Recognition as search: depth-first, breadth- search
30
Recognition using NFSAs
31
Regular languages Operations on regular languages and FSAs: concatenation, closure, union Properties of regular languages (closed under concatenation, union, disjunction, intersection, difference, complementation, reversal, Kleene closure)
32
An exercise J&M 2.8. Write a regular expression for the language accepted by the NFSA in the Figure.
33
Morphology and Finite-State Transducers
34
Morphemes Stems, affixes Affixes: prefixes, suffixes, infixes: hingi (borrow) – humingi (agent) in Tagalog, circumfixes: sagen – gesagt in German Concatenative morphology Templatic morphology (Semitic languages) : lmd (learn), lamad (he studied), limed (he taught), lumad (he was taught)
35
Morphological analysis rewrites unbelievably
36
Inflectional morphology Tense, number, person, mood, aspect Five verb forms in English 40+ forms in French Six cases in Russian: http://www.departments.bucknell.edu/russian/language/case.html http://www.departments.bucknell.edu/russian/language/case.html Up to 40,000 forms in Turkish (you cause X to cause Y to … do Z)
37
Derivational morphology Nominalization: computerization, appointee, killer, fuzziness Formation of adjectives: computational, embraceable, clueless
38
Finite-state morphological parsing Cats: cat +N +PL Cat: cat +N +SG Cities: city +N +PL Geese: goose +N +PL Ducks: (duck +N +PL) or (duck +V +3SG) Merging: +V +PRES-PART Caught: (catch +V +PAST-PART) or (catch +V +PAST)
39
Principles of morphological parsing Lexicon Morphotactics (e.g., plural follows noun) Orthography (easy easier) Irregular nouns: e.g., geese, sheep, mice Irregular verbs: e.g., caught, ate, eaten
40
FSA for adjectives Big, bigger, biggest Cool, cooler, coolest, coolly Red, redder, reddest Clear, clearer, clearest, clearly, unclear, unclearly Happy, happier, happiest, happily Unhappy, unhappier, unhappiest, unhappily What about: unbig, redly, and realest?
41
Using FSA for recognition Is a string a legitimate word or not? Two-level morphology: lexical level + surface level (Koskenniemi 83) Finite-state transducers (FST) – used for regular relations Inversion and composition of FST
42
Orthographic rules Beg/begging Make/making Watch/watches Try/tries Panic/panicked
43
Combining FST lexicon and rules Cascades of transducers: the output of one becomes the input of another
44
Weighted Automata
45
Phonetic symbols IPA Arpabet Examples
46
Using WFST for language modeling Phonetic representation Part-of-speech tagging
47
Word Classes and Part Of Speech Tagging
48
Some POS statistics Preposition list from COBUILD Single-word particles Conjunctions Pronouns Modal verbs
49
Tagsets for English Penn Treebank Other tagsets (see Week 1 slides)
50
POS ambiguity Degrees of ambiguity (DeRose 1988) Rule-based POS tagging –ENGTWOL (Voutilainen et al. ) –Sample rule: Adverbial-That rule (“it isn’t that odd”) (“Given input: “that” if (+1 A/ADV/QUANT); (+2 SENT-LIM); (NOT –1 SVOC/A); (not a verb like “consider”) then eliminate non-ADV tags else eliminate ADV tag
51
Evaluating POS taggers Percent correct What is the lower bound on a system’s performance? What about the upper bound?
52
Kappa N: number of items (index i) n: number of categories (index j) k: number of annotators when >.8 – agreement is considered high
53
Midterm reading list Chapter 1 – Introduction Chapter 2 – Regular expressions and automata Chapter 3 – Morphology and finite-state transducers + FSM tutorial Chapter 8 – Word classes and POS tagging Chapter 9 – Context-free grammars for English Chapter 10 – Parsing with context-free grammars Chapter 11 - Features and unification
54
Syntaxscape Written by Juno Suk of Lucent http://www.cs.columbia.edu/~radev/syntaxscape/
56
Read by yourselves 9.9. Spoken language syntax 9.10. Grammar equivalence 9.11. Finite-state and context-free grammars
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.