Fall 2005 Lecture Notes #4 EECS 595 / LING 541 / SI 661 Natural Language Processing.

Slides:



Advertisements
Similar presentations
Finite-state automata and Morphology
Advertisements

Jing-Shin Chang1 Morphology & Finite-State Transducers Morphology: the study of constituents of words Word = {a set of morphemes, combined in language-dependent.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Natural Language Processing Lecture 3—9/3/2013 Jim Martin.
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
1 Morphology September 2009 Lecture #4. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705 Julia Hirschberg.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 Morphology September 4, 2012 Lecture #3. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
1 Regular Expressions and Automata September Lecture #2-2.
Finite-State Automata Shallow Processing Techniques for NLP Ling570 October 5, 2011.
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
Computational language: week 9 Finish finite state machines FSA’s for modelling word structure Declarative language models knowledge representation and.
Computational Language Finite State Machines and Regular Expressions.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Monz Regular Expressions and Finite State Automata (J&M 2) Prof. Bonnie J. Dorr.
Morphological analysis
Finite-State Morphology CMSC 723: Computational Linguistics I ― Session #3 Jimmy Lin The iSchool University of Maryland Wednesday, September 16, 2009.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705 Julia Hirschberg.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 14: 10/12.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar.
Topics Non-Determinism (NFSAs) Recognition of NFSAs Proof that regular expressions = FSAs Very brief sketch: Morphology, FSAs, FSTs Very brief sketch:
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
Introduction to English Morphology Finite State Transducers
Chapter 3. Morphology and Finite-State Transducers From: Chapter 3 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
Finite State Automata and Tries Sambhav Jain IIIT Hyderabad.
Fall 2004 Lecture Notes #2 EECS 595 / LING 541 / SI 661 Natural Language Processing.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
March 1, 2009 Dr. Muhammed Al-mulhem 1 ICS 482 Natural Language Processing Regular Expression and Finite Automata Muhammed Al-Mulhem March 1, 2009.
Fall 2004 Lecture Notes #4 EECS 595 / LING 541 / SI 661 Natural Language Processing.
Finite State Transducers for Morphological Parsing
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
Finite-state automata Day 12 LING Computational Linguistics Harry Howard Tulane University.
1 LING 6932 Spring 2007 LING 6932 Topics in Computational Linguistics Hana Filip Lecture 2: Regular Expressions, Finite State Automata.
1 Regular Expressions and Automata August Lecture #2.
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
Natural Language Processing Lecture 4 : Regular Expressions and Automata.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Human Language Technology Part of Speech (POS) Tagging II Rule-based Tagging.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Speech and Language Processing
Morphology: Parsing Words
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
CPSC 503 Computational Linguistics
Morphological Parsing
Presentation transcript:

Fall 2005 Lecture Notes #4 EECS 595 / LING 541 / SI 661 Natural Language Processing

Features and unification

Introduction Grammatical categories have properties Constraint-based formalisms Example: this flights: agreement is difficult to handle at the level of grammatical categories Example: many water: count/mass nouns Sample rule that takes into account features: S  NP VP (but only if the number of the NP is equal to the number of the VP)

Feature structures CAT NP NUMBER SINGULAR PERSON 3 CAT NP AGREEMENT NUMBER SG PERSON 3 Feature paths: {x agreement number}

Unification [NUMBER SG] [NUMBER SG] + [NUMBER SG] [NUMBER PL] - [NUMBER SG] [NUMBER []] = [NUMBER SG] [NUMBER SG] [PERSON 3] = ?

Agreement S  NP VP {NP AGREEMENT} = {VP AGREEMENT} Does this flight serve breakfast? Do these flights serve breakfast? S  Aux NP VP {Aux AGREEMENT} = {NP AGREEMENT}

Agreement These flights This flight NP  Det Nominal {Det AGREEMENT} = {Nominal AGREEMENT} Verb  serve {Verb AGREEMENT NUMBER} = PL Verb  serves {Verb AGREEMENT NUMBER} = SG

Subcategorization VP  Verb {VP HEAD} = {Verb HEAD} {VP HEAD SUBCAT} = INTRANS VP  Verb NP {VP HEAD} = {Verb HEAD} {VP HEAD SUBCAT} = TRANS VP  Verb NP NP {VP HEAD} = {Verb HEAD} {VP HEAD SUBCAT} = DITRANS

Regular Expressions and Automata

Regular expressions Searching for “woodchuck” Searching for “woodchucks” with an optional final “s” Regular expressions Finite-state automata (singular: automaton)

Regular expressions Basic regular expression patterns Perl-based syntax (slightly different from other notations for regular expressions) Disjunctions [abc] Ranges [A-Z] Negations [^Ss] Optional characters ? and * Wild cards. Anchors ^ and $, also \b and \B Disjunction, grouping, and precedence |

Writing correct expressions Exercise: write a Perl regular expression to match the English article “the”: /the/ /[tT]he/ /\b[tT]he\b/ /[^a-zA-Z][tT]he[^a-zA-Z]/ /(^|[^a-zA-Z])[tT]he[^a-zA-Z]/

A more complex example Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: /$[0-9]+/ /$[0-9]+\.[0-9][0-9]/ /\b$[0-9]+(\.[0-9][0-9])?\b/ /\b[0-9]+ *(MHz|[Mm]egahertz|Ghz| [Gg]igahertz)\b/ /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/

Advanced operators

Substitutions and memory Substitutions Memory ( \1, \2, etc. refer back to matches) s/colour/color/ s/([0-9]+)/ /

Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED

Eliza-style regular expressions s/.* YOU ARE (depressed|sad).*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad).*/WHY DO YOU THINK YOU ARE \1/ s/.* all.*/IN WHAT WAY/ s/.* always.*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ Step 1: replace first person references with second person references Step 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations

Finite-state automata Finite-state automata (FSA) Regular languages Regular expressions

Finite-state automata (machines) baa! baaa! baaaa! baaaaa!... q0q0 q1q1 q2q2 q3q3 q4q4 baa! a baa+! state transition final state

Input tape aba!b q0q0

Finite-state automata Q: a finite set of N states q 0, q 1, … q N  : a finite input alphabet of symbols q 0 : the start state F: the set of final states  (q,i): transition function

State-transition tables Input Stateba!

The FSM toolkit and friends Developed at AT&T Research (Riley, Pereira, Mohri, Sproat) Download: Tutorial available 4 useful parts: FSM, Lextools, GRM, Dot (separate) –/data2/tools/fsm-3.6/bin –/data2/tools/lextools/bin –/data2/tools/dot/bin

D-RECOGNIZE function D-RECOGNIZE (tape, machine) returns accept or reject index  Beginning of tape current-state  Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state  transition-table [current-state, tape[index]] index  index + 1 end

Adding a failing state q0q0 q1q1 q2q2 q3q3 q4q4 baa! a qFqF a ! b ! b! b b a !

Languages and automata Formal languages: regular languages, non-regular languages deterministic vs. non-deterministic FSAs Epsilon (  ) transitions

Using NFSAs to accept strings Backup: add markers at choice points, then possibly revisit underexplored markers Look-ahead: look ahead in input Parallelism: look at alternatives in parallel

Using NFSAs Input Stateba!  ,

More about FSAs Transducers Equivalence of DFSAs and NFSAs Recognition as search: depth-first, breadth- search

Recognition using NFSAs

Regular languages Operations on regular languages and FSAs: concatenation, closure, union Properties of regular languages (closed under concatenation, union, disjunction, intersection, difference, complementation, reversal, Kleene closure)

An exercise J&M 2.8. Write a regular expression for the language accepted by the NFSA in the Figure.

Morphology and Finite-State Transducers

Morphemes Stems, affixes Affixes: prefixes, suffixes, infixes: hingi (borrow) – humingi (agent) in Tagalog, circumfixes: sagen – gesagt in German Concatenative morphology Templatic morphology (Semitic languages) : lmd (learn), lamad (he studied), limed (he taught), lumad (he was taught)

Morphological analysis rewrites unbelievably

Inflectional morphology Tense, number, person, mood, aspect Five verb forms in English 40+ forms in French Six cases in Russian: Up to 40,000 forms in Turkish (you cause X to cause Y to … do Z)

Derivational morphology Nominalization: computerization, appointee, killer, fuzziness Formation of adjectives: computational, embraceable, clueless

Finite-state morphological parsing Cats: cat +N +PL Cat: cat +N +SG Cities: city +N +PL Geese: goose +N +PL Ducks: (duck +N +PL) or (duck +V +3SG) Merging: +V +PRES-PART Caught: (catch +V +PAST-PART) or (catch +V +PAST)

Principles of morphological parsing Lexicon Morphotactics (e.g., plural follows noun) Orthography (easy  easier) Irregular nouns: e.g., geese, sheep, mice Irregular verbs: e.g., caught, ate, eaten

FSA for adjectives Big, bigger, biggest Cool, cooler, coolest, coolly Red, redder, reddest Clear, clearer, clearest, clearly, unclear, unclearly Happy, happier, happiest, happily Unhappy, unhappier, unhappiest, unhappily What about: unbig, redly, and realest?

Using FSA for recognition Is a string a legitimate word or not? Two-level morphology: lexical level + surface level (Koskenniemi 83) Finite-state transducers (FST) – used for regular relations Inversion and composition of FST

Orthographic rules Beg/begging Make/making Watch/watches Try/tries Panic/panicked

Combining FST lexicon and rules Cascades of transducers: the output of one becomes the input of another

Weighted Automata

Phonetic symbols IPA Arpabet Examples

Using WFST for language modeling Phonetic representation Part-of-speech tagging

Word Classes and Part Of Speech Tagging

Some POS statistics Preposition list from COBUILD Single-word particles Conjunctions Pronouns Modal verbs

Tagsets for English Penn Treebank Other tagsets (see Week 1 slides)

POS ambiguity Degrees of ambiguity (DeRose 1988) Rule-based POS tagging –ENGTWOL (Voutilainen et al. ) –Sample rule: Adverbial-That rule (“it isn’t that odd”) (“Given input: “that” if (+1 A/ADV/QUANT); (+2 SENT-LIM); (NOT –1 SVOC/A); (not a verb like “consider”) then eliminate non-ADV tags else eliminate ADV tag

Evaluating POS taggers Percent correct What is the lower bound on a system’s performance? What about the upper bound?

Kappa N: number of items (index i) n: number of categories (index j) k: number of annotators when  >.8 – agreement is considered high

Midterm reading list Chapter 1 – Introduction Chapter 2 – Regular expressions and automata Chapter 3 – Morphology and finite-state transducers + FSM tutorial Chapter 8 – Word classes and POS tagging Chapter 9 – Context-free grammars for English Chapter 10 – Parsing with context-free grammars Chapter 11 - Features and unification

Syntaxscape Written by Juno Suk of Lucent

Read by yourselves 9.9. Spoken language syntax Grammar equivalence Finite-state and context-free grammars