Download presentation
Presentation is loading. Please wait.
Published byLeo Jonah White Modified over 9 years ago
1
Fall 2004 Lecture Notes #2 EECS 595 / LING 541 / SI 661 Natural Language Processing
2
Course logistics Instructor: Prof. Dragomir Radev (radev@umich.edu)radev@umich.edu Class times: Tu 1:10-3:55 PM, in 412, WH Office hours: M 10-11, Tu 11-12 in 3080, WH http://www.si.umich.edu/~radev/NLP-fall2004 Home page:
3
Regular Expressions and Automata
4
Regular expressions Searching for “woodchuck” Searching for “woodchucks” with an optional final “s” Regular expressions Finite-state automata (singular: automaton)
5
Regular expressions Basic regular expression patterns Perl-based syntax (slightly different from other notations for regular expressions) Disjunctions [abc] Ranges [A-Z] Negations [^Ss] Optional characters ? and * Wild cards. Anchors ^ and $, also \b and \B Disjunction, grouping, and precedence |
6
Writing correct expressions Exercise: write a Perl regular expression to match the English article “the”: /the/ /[tT]he/ /\b[tT]he\b/ /[^a-zA-Z][tT]he[^a-zA-Z]/ /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
7
A more complex example Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: /$[0-9]+/ /$[0-9]+\.[0-9][0-9]/ /\b$[0-9]+(\.[0-9][0-9])?\b/ /\b[0-9]+ *(MHz|[Mm]egahertz|Ghz| [Gg]igahertz)\b/ /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
8
Advanced operators
9
Substitutions and memory Substitutions Memory ( \1, \2, etc. refer back to matches) s/colour/color/ s/([0-9]+)/ /
10
Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
11
Eliza-style regular expressions s/.* YOU ARE (depressed|sad).*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad).*/WHY DO YOU THINK YOU ARE \1/ s/.* all.*/IN WHAT WAY/ s/.* always.*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ Step 1: replace first person references with second person references Step 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations
12
Finite-state automata Finite-state automata (FSA) Regular languages Regular expressions
13
Finite-state automata (machines) baa! baaa! baaaa! baaaaa!... q0q0 q1q1 q2q2 q3q3 q4q4 baa! a baa+! state transition final state
14
Input tape aba!b q0q0
15
Finite-state automata Q: a finite set of N states q 0, q 1, … q N : a finite input alphabet of symbols q 0 : the start state F: the set of final states (q,i): transition function
16
State-transition tables Input Stateba! 0100 1020 2030 3034 4000
17
The FSM toolkit and friends Developed at AT&T Research (Riley, Pereira, Mohri, Sproat) Download: http://www.research.att.com/sw/tools/fsm/tech.html http://www.research.att.com/sw/tools/lextools/ Tutorial available 4 useful parts: FSM, Lextools, GRM, Dot (separate) –/clair3/tools/fsm-3.6/bin –/clair3/tools/lextools/bin –/clair3/tools/dot/bin
18
D-RECOGNIZE function D-RECOGNIZE (tape, machine) returns accept or reject index Beginning of tape current-state Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state transition-table [current-state, tape[index]] index index + 1 end
19
Adding a failing state q0q0 q1q1 q2q2 q3q3 q4q4 baa! a qFqF a ! b ! b! b b a !
20
Languages and automata Formal languages: regular languages, non-regular languages deterministic vs. non-deterministic FSAs Epsilon ( ) transitions
21
Using NFSAs to accept strings Backup: add markers at choice points, then possibly revisit underexplored markers Look-ahead: look ahead in input Parallelism: look at alternatives in parallel
22
Using NFSAs Input Stateba! 01000 10200 202,300 30040 40000
23
More about FSAs Transducers Equivalence of DFSAs and NFSAs Recognition as search: depth-first, breadth- search
24
Recognition using NFSAs
25
Regular languages Operations on regular languages and FSAs: concatenation, closure, union Properties of regular languages (closed under concatenation, union, disjunction, intersection, difference, complementation, reversal, Kleene closure)
26
An exercise J&M 2.8. Write a regular expression for the language accepted by the NFSA in the Figure.
27
Morphology and Finite-State Transducers
28
Morphemes Stems, affixes Affixes: prefixes, suffixes, infixes: hingi (borrow) – humingi (agent) in Tagalog, circumfixes: sagen – gesagt in German Concatenative morphology Templatic morphology (Semitic languages) : lmd (learn), lamad (he studied), limed (he taught), lumad (he was taught)
29
Morphological analysis rewrites unbelievably
30
Inflectional morphology Tense, number, person, mood, aspect Five verb forms in English 40+ forms in French Six cases in Russian: http://www.departments.bucknell.edu/russian/language/case.html http://www.departments.bucknell.edu/russian/language/case.html Up to 40,000 forms in Turkish (you cause X to cause Y to … do Z)
31
Derivational morphology Nominalization: computerization, appointee, killer, fuzziness Formation of adjectives: computational, embraceable, clueless
32
Finite-state morphological parsing Cats: cat +N +PL Cat: cat +N +SG Cities: city +N +PL Geese: goose +N +PL Ducks: (duck +N +PL) or (duck +V +3SG) Merging: +V +PRES-PART Caught: (catch +V +PAST-PART) or (catch +V +PAST)
33
Principles of morphological parsing Lexicon Morphotactics (e.g., plural follows noun) Orthography (easy easier) Irregular nouns: e.g., geese, sheep, mice Irregular verbs: e.g., caught, ate, eate
34
FSA for adjectives Big, bigger, biggest Cool, cooler, coolest, coolly Red, redder, reddest Clear, clearer, clearest, clearly, unclear, unclearly Happy, happier, happiest, happily Unhappy, unhappier, unhappiest, unhappily What about: unbig, redly, and realest?
35
Using FSA for recognition Is a string a legitimate word or not? Two-level morphology: lexical level + surface level (Koskenniemi 83) Finite-state transducers (FST) – used for regular relations Inversion and composition of FST
36
Orthographic rules Beg/begging Make/making Watch/watches Try/tries Panic/panicked
37
Combining FST lexicon and rules Cascades of transducers: the output of one becomes the input of another
38
Weighted Automata
39
Phonetic symbols IPA Arpabet Examples
40
Using WFST for language modeling Phonetic representation Part-of-speech tagging
41
Word Classes and Part Of Speech Tagging
42
Some POS statistics Preposition list from COBUILD Single-word particles Conjunctions Pronouns Modal verbs
43
Tagsets for English Penn Treebank Other tagsets (see Week 1 slides)
44
POS ambiguity Degrees of ambiguity (DeRose 1988) Rule-based POS tagging –ENGTWOL (Voutilainen et al. ) –Sample rule: Adverbial-That rule (“it isn’t that odd”) (“Given input: “that” if (+1 A/ADV/QUANT); (+2 SENT-LIM); (NOT –1 SVOC/A); (not a verb like “consider”) then eliminate non-ADV tags else eliminate ADV tag
45
Evaluating POS taggers Percent correct What is the lower bound on a system’s performance? What about the upper bound?
46
Kappa N: number of items (index i) n: number of categories (index j) k: number of annotators when >.8 – agreement is considered high
47
Readings for next time J&M Chapters 5.9, 8, 9 Lecture notes #2
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.