Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fall 2004 Lecture Notes #2 EECS 595 / LING 541 / SI 661 Natural Language Processing.

Similar presentations


Presentation on theme: "Fall 2004 Lecture Notes #2 EECS 595 / LING 541 / SI 661 Natural Language Processing."— Presentation transcript:

1 Fall 2004 Lecture Notes #2 EECS 595 / LING 541 / SI 661 Natural Language Processing

2 Course logistics Instructor: Prof. Dragomir Radev (radev@umich.edu)radev@umich.edu Class times: Tu 1:10-3:55 PM, in 412, WH Office hours: M 10-11, Tu 11-12 in 3080, WH http://www.si.umich.edu/~radev/NLP-fall2004 Home page:

3 Regular Expressions and Automata

4 Regular expressions Searching for “woodchuck” Searching for “woodchucks” with an optional final “s” Regular expressions Finite-state automata (singular: automaton)

5 Regular expressions Basic regular expression patterns Perl-based syntax (slightly different from other notations for regular expressions) Disjunctions [abc] Ranges [A-Z] Negations [^Ss] Optional characters ? and * Wild cards. Anchors ^ and $, also \b and \B Disjunction, grouping, and precedence |

6 Writing correct expressions Exercise: write a Perl regular expression to match the English article “the”: /the/ /[tT]he/ /\b[tT]he\b/ /[^a-zA-Z][tT]he[^a-zA-Z]/ /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/

7 A more complex example Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: /$[0-9]+/ /$[0-9]+\.[0-9][0-9]/ /\b$[0-9]+(\.[0-9][0-9])?\b/ /\b[0-9]+ *(MHz|[Mm]egahertz|Ghz| [Gg]igahertz)\b/ /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/

8 Advanced operators

9 Substitutions and memory Substitutions Memory ( \1, \2, etc. refer back to matches) s/colour/color/ s/([0-9]+)/ /

10 Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED

11 Eliza-style regular expressions s/.* YOU ARE (depressed|sad).*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad).*/WHY DO YOU THINK YOU ARE \1/ s/.* all.*/IN WHAT WAY/ s/.* always.*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ Step 1: replace first person references with second person references Step 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations

12 Finite-state automata Finite-state automata (FSA) Regular languages Regular expressions

13 Finite-state automata (machines) baa! baaa! baaaa! baaaaa!... q0q0 q1q1 q2q2 q3q3 q4q4 baa! a baa+! state transition final state

14 Input tape aba!b q0q0

15 Finite-state automata Q: a finite set of N states q 0, q 1, … q N  : a finite input alphabet of symbols q 0 : the start state F: the set of final states  (q,i): transition function

16 State-transition tables Input Stateba! 0100 1020 2030 3034 4000

17 The FSM toolkit and friends Developed at AT&T Research (Riley, Pereira, Mohri, Sproat) Download: http://www.research.att.com/sw/tools/fsm/tech.html http://www.research.att.com/sw/tools/lextools/ Tutorial available 4 useful parts: FSM, Lextools, GRM, Dot (separate) –/clair3/tools/fsm-3.6/bin –/clair3/tools/lextools/bin –/clair3/tools/dot/bin

18 D-RECOGNIZE function D-RECOGNIZE (tape, machine) returns accept or reject index  Beginning of tape current-state  Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state  transition-table [current-state, tape[index]] index  index + 1 end

19 Adding a failing state q0q0 q1q1 q2q2 q3q3 q4q4 baa! a qFqF a ! b ! b! b b a !

20 Languages and automata Formal languages: regular languages, non-regular languages deterministic vs. non-deterministic FSAs Epsilon (  ) transitions

21 Using NFSAs to accept strings Backup: add markers at choice points, then possibly revisit underexplored markers Look-ahead: look ahead in input Parallelism: look at alternatives in parallel

22 Using NFSAs Input Stateba!  01000 10200 202,300 30040 40000

23 More about FSAs Transducers Equivalence of DFSAs and NFSAs Recognition as search: depth-first, breadth- search

24 Recognition using NFSAs

25 Regular languages Operations on regular languages and FSAs: concatenation, closure, union Properties of regular languages (closed under concatenation, union, disjunction, intersection, difference, complementation, reversal, Kleene closure)

26 An exercise J&M 2.8. Write a regular expression for the language accepted by the NFSA in the Figure.

27 Morphology and Finite-State Transducers

28 Morphemes Stems, affixes Affixes: prefixes, suffixes, infixes: hingi (borrow) – humingi (agent) in Tagalog, circumfixes: sagen – gesagt in German Concatenative morphology Templatic morphology (Semitic languages) : lmd (learn), lamad (he studied), limed (he taught), lumad (he was taught)

29 Morphological analysis rewrites unbelievably

30 Inflectional morphology Tense, number, person, mood, aspect Five verb forms in English 40+ forms in French Six cases in Russian: http://www.departments.bucknell.edu/russian/language/case.html http://www.departments.bucknell.edu/russian/language/case.html Up to 40,000 forms in Turkish (you cause X to cause Y to … do Z)

31 Derivational morphology Nominalization: computerization, appointee, killer, fuzziness Formation of adjectives: computational, embraceable, clueless

32 Finite-state morphological parsing Cats: cat +N +PL Cat: cat +N +SG Cities: city +N +PL Geese: goose +N +PL Ducks: (duck +N +PL) or (duck +V +3SG) Merging: +V +PRES-PART Caught: (catch +V +PAST-PART) or (catch +V +PAST)

33 Principles of morphological parsing Lexicon Morphotactics (e.g., plural follows noun) Orthography (easy  easier) Irregular nouns: e.g., geese, sheep, mice Irregular verbs: e.g., caught, ate, eate

34 FSA for adjectives Big, bigger, biggest Cool, cooler, coolest, coolly Red, redder, reddest Clear, clearer, clearest, clearly, unclear, unclearly Happy, happier, happiest, happily Unhappy, unhappier, unhappiest, unhappily What about: unbig, redly, and realest?

35 Using FSA for recognition Is a string a legitimate word or not? Two-level morphology: lexical level + surface level (Koskenniemi 83) Finite-state transducers (FST) – used for regular relations Inversion and composition of FST

36 Orthographic rules Beg/begging Make/making Watch/watches Try/tries Panic/panicked

37 Combining FST lexicon and rules Cascades of transducers: the output of one becomes the input of another

38 Weighted Automata

39 Phonetic symbols IPA Arpabet Examples

40 Using WFST for language modeling Phonetic representation Part-of-speech tagging

41 Word Classes and Part Of Speech Tagging

42 Some POS statistics Preposition list from COBUILD Single-word particles Conjunctions Pronouns Modal verbs

43 Tagsets for English Penn Treebank Other tagsets (see Week 1 slides)

44 POS ambiguity Degrees of ambiguity (DeRose 1988) Rule-based POS tagging –ENGTWOL (Voutilainen et al. ) –Sample rule: Adverbial-That rule (“it isn’t that odd”) (“Given input: “that” if (+1 A/ADV/QUANT); (+2 SENT-LIM); (NOT –1 SVOC/A); (not a verb like “consider”) then eliminate non-ADV tags else eliminate ADV tag

45 Evaluating POS taggers Percent correct What is the lower bound on a system’s performance? What about the upper bound?

46 Kappa N: number of items (index i) n: number of categories (index j) k: number of annotators when  >.8 – agreement is considered high

47 Readings for next time J&M Chapters 5.9, 8, 9 Lecture notes #2


Download ppt "Fall 2004 Lecture Notes #2 EECS 595 / LING 541 / SI 661 Natural Language Processing."

Similar presentations


Ads by Google