Topics Non-Determinism (NFSAs) Recognition of NFSAs Proof that regular expressions = FSAs Very brief sketch: Morphology, FSAs, FSTs Very brief sketch:

Slides:



Advertisements
Similar presentations
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Advertisements

Natural Language Processing Lecture 3—9/3/2013 Jim Martin.
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
CPSC Compiler Tutorial 4 Midterm Review. Deterministic Finite Automata (DFA) Q: finite set of states Σ: finite set of “letters” (input alphabet)
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705 Julia Hirschberg.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
1 Regular Expressions and Automata September Lecture #2-2.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
Finite-State Automata Shallow Processing Techniques for NLP Ling570 October 5, 2011.
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.
CSCI 5832 Natural Language Processing Lecture 5 Jim Martin.
Computational Language Finite State Machines and Regular Expressions.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
CMSC 723 / LING 645: Intro to Computational Linguistics September 8, 2004: Monz Regular Expressions and Finite State Automata (J&M 2) Prof. Bonnie J. Dorr.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
Finite state automaton (FSA)
Finite-State Morphology CMSC 723: Computational Linguistics I ― Session #3 Jimmy Lin The iSchool University of Maryland Wednesday, September 16, 2009.
CS 4705 Regular Expressions and Automata in Natural Language Analysis CS 4705 Julia Hirschberg.
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
CMSC 723: Intro to Computational Linguistics Lecture 2: February 4, 2004 Regular Expressions and Finite State Automata Professor Bonnie J. Dorr Dr. Nizar.
Introduction to English Morphology Finite State Transducers
1 Scanning Aaron Bloomfield CS 415 Fall Parsing & Scanning In real compilers the recognizer is split into two phases –Scanner: translate input.
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
Fall 2004 Lecture Notes #2 EECS 595 / LING 541 / SI 661 Natural Language Processing.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
REGULAR LANGUAGES.
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Natural Language Processing Lecture 2—1/15/2015 Susan W. Brown.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
1 LING 6932 Spring 2007 LING 6932 Topics in Computational Linguistics Hana Filip Lecture 2: Regular Expressions, Finite State Automata.
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
Spell checking. Spelling Correction and Edit Distance Non-word error detection: – detecting “graffe” “ سوژن ”, “ مصواک ”, “ مداا ” Non-word error correction:
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Speech and Language Processing
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
CSC NLP - Regex, Finite State Automata
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Morphological Parsing
Presentation transcript:

Topics Non-Determinism (NFSAs) Recognition of NFSAs Proof that regular expressions = FSAs Very brief sketch: Morphology, FSAs, FSTs Very brief sketch: Tokenization and Segmentation Very brief sketch: Minimum Edit Distance

Substitutions and Memory Substitutions s/colour/color/ s/colour/color/g /the (.*)er they were, the $1er they will be/ /the (.*)er they (.*), the $1er they $2/ Substitute as many times as possible! Case insensitive matching s/colour/color/i  Memory ($1, $2, etc. refer back to matches)

Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED

Eliza-style regular expressions s/.* YOU ARE (depressed|sad).*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad).*/WHY DO YOU THINK YOU ARE \1/ s/.* all.*/IN WHAT WAY/ s/.* always.*/CAN YOU THINK OF A SPECIFIC EXAMPLE/ Step 1: replace first person with second person references s/\bI(’m| am)\b /YOU ARE/g s/\bmy\b /YOUR/g S/\bmine\b /YOURS/g Step 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations

Summary on REs so far Regular expressions are perhaps the single most useful tool for text manipulation Compilers Text editing Sequence analysis in Bioinformatics etc. Eliza: you can do a lot with simple regular- expression substitutions

Three Views Three equivalent formal ways to look at what we’re up to Regular Expressions Regular Languages Finite State Automata Regular Grammars

Finite State Automata Terminology: Finite State Automata, Finite State Machines, FSA, Finite Automata Regular expressions are one way of specifying the structure of finite-state automata. FSAs and their close relatives are at the core of most algorithms for speech and language processing.

Finite-state Automata (Machines) /^baa+!$/ q0q0 q1q1 q2q2 q3q3 q4q4 baa! a state transition final stat e baa! baaa! baaaa! baaaa a!...

Sheep FSA We can say the following things about this machine It has 5 states At least b,a, and ! are in its alphabet q0 is the start state q4 is an accept state It has 5 transitions

But note There are other machines that correspond to this language This is a NFA.

More Formally: Defining an FSA You can specify an FSA by enumerating the following things. The set of states: Q A finite alphabet: Σ A start state q 0 A set F of accepting/final states F  Q A transition function  (q,i) that maps Q x Σ to Q

Yet Another View State-transition table

Recognition Recognition is the process of determining if a string is accepted by a machine (also known as) the process of determining if a string is in the language we’re defining with the machine (also) the process of determining if a regular expression matches a string

Recognition Think of the input as being stored in a tape. The read head will read the input from left to right, one symbol at a time.

Recognition Start in the start state Examine the current input Consult the table Go to a new state and update the tape pointer. Until you run out of tape.

Input Tape aba!b q0q baa ! a REJECT

Input Tape baaa q0q0 q1q1 q2q2 q3q3 q3q3 q4q4 ! baa ! a ACCEPT

Adding a failing state q0q0 q1q1 q2q2 q3q3 q4q4 baa! a qFqF a ! b ! b! b b a !

D-RECOGNIZE function D-RECOGNIZE (tape, machine) returns accept or reject index  Beginning of tape current-state  Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elsif transition-table [current-state, tape[index]] is empty then return reject else current-state  transition-table [current-state, tape[index]] index  index + 1 end

Tracing D-Recognize

Key Points Deterministic means that at each point in processing there is always one unique thing to do (no choices). D-recognize is a simple table-driven interpreter The algorithm is universal for all regular languages To change the machine, you change the table.

Key Points To perform regular expression matching translate the expression into a machine (table) and pass the table to an interpreter

Generative Formalisms Formal Languages are sets of strings composed of symbols from a finite set of symbols. Finite-state automata define formal languages The term Generative vs. accepting model some models (e.g. grammar) generate some models (e.g. automaton) accept

Non-determinism A deterministic automaton is one whose behavior during recognition is fully determined by the state it is in and the symbol it is looking at. Non-determinism: more than one choice. If one of the paths leads to acceptance, we say the input is accepted. Rules of a solitaire game can be viewed as non- deterministic. (choice is what makes the game interesting.)

Non-Determinism

Non-Determinism cont. Yet another technique Epsilon transitions These transitions do not examine or advance the tape during recognition ε

NFSA = FSA Non-deterministic machines can be converted to deterministic ones with a fairly simple construction That means that they have the same power; non- deterministic machines are not more powerful than deterministic ones It also means that one way to do recognition with a non-deterministic machine is to turn it into a deterministic one.

Non-Deterministic Recognition In a ND FSA there exists at least one path through the machine for a string that is in the language defined by the machine. But not all paths directed through the machine for an accept string lead to an accept state. No paths through the machine lead to an accept state for a string not in the language.

Non-Deterministic Recognition So success in a non-deterministic recognition occurs when a path is found through the machine that ends in an accept. Failure occurs when none of the possible paths lead to an accept state.

Example ba a a !\ q0q0 q1q1 q2q2 q2q2 q3q3 q4q4

Using NFSA to accept strings In general, solutions to the problem of choice in non-deterministic models: Backup: –When we come to a choice point –Put a marker indicating:  Where we are in the tape  What the state is Lookahead Parallelism

Key AI idea: Search We model problem-solving as a search for a solution Through a space of possible solutions. The space consists of states States in the search space are pairings of tape positions and states in the machine. By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input.

Two kinds of search Depth-first search Explore one path all the way to the end Then backup And try other paths Breadth-first search Explore all the paths simultaneously Incrementally extending each tier of the paths

Depth-first search example

NFSA Recognition of “baaa!”

Breadth-first Recognition of “baaa!” should be q 2

Three Views Three equivalent formal ways to look at what we’re up to Regular Expressions Regular Languages Finite State Automata Regular Grammars

Regular languages Regular languages are characterized by FSAs For every NFSA, there is an equivalent DFSA. Regular languages are closed under concatenation, Kleene closure, union.

Regular languages The class of languages characterizable by regular expressions Given alphabet , the regular languages over  is: The empty set  is a regular language  a    , {a} is a regular language If L1 and L2 are regular lgs, then so are: –L1 · L2 = {xy|x  L1,y  L2}, concatenation of L1 & L2 –L1  L2, the union of L1 and L2 –L1*, the Kleene closure of L1

Going from regular expression to FSA Since all regular languages meet above properties And regular languages are languages characterized by regular expressions All regular expression operators can be implemented by combinations of union, disjunction, closure

from reg exp to FSA So if we could just show how to turn closure/union/concat from regexps to FSAs, this would give an idea of how FSA compilation works. The actual proof that reg lgs = FSAs has 2 parts An FSA can be built for each regular lg A regular lg can be built for each automaton So I’ll give the intuition of the first part: Take any regular expression and build an automaton Intuition: induction –Base case: build an automaton for single symbol (say ‘a’), as well as epsilon and the empty language –Inductive step: Show how to imitate the 3 regexp operations in automata

Union Accept a string in either of two languages

Concatenation Accept a string consisting of a string from language L1 followed by a string from language L2.

Kleene Closure Accept a string consisting of a string from language L1 repeated zero or more times.

Summary so far Finite State Automata Deterministic Recognition of FSAs Non-Determinism (NFSAs) Recognition of NFSAs (sketch of) Proof that regular expressions = FSAs

FSAs and Computational Morphology An important use of FSAs is for morphology, the study of word parts

English Morphology Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes We can usefully divide morphemes into two classes Stems: The core meaning bearing units Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions

Nouns and Verbs (English) Nouns are simple (not really) Markers for plural and possessive Verbs are only slightly more complex Markers appropriate to the tense of the verb

Regulars and Irregulars Ok so it gets a little complicated by the fact that some words misbehave (refuse to follow the rules) Mouse/mice, goose/geese, ox/oxen Go/went, fly/flew The terms regular and irregular will be used to refer to words that follow the rules and those that don’t.

Regular and Irregular Nouns and Verbs Regulars… Walk, walks, walking, walked, walked Table, tables Irregulars Eat, eats, eating, ate, eaten Catch, catches, catching, caught, caught Cut, cuts, cutting, cut, cut Goose, geese

Compute Many paths are possible… Start with compute Computer -> computerize -> computerization Computation -> computational Computer -> computerize -> computerizable Compute -> computee

Why care about morphology? `Stemming’ in information retrieval Might want to search for “going home” and find pages with both “went home” and “will go home” Morphology in machine translation Need to know that the Spanish words quiero and quieres are both related to querer ‘want’ Morphology in spell checking Need to know that misclam and antiundoggingly are not words despite being made up of word parts

Can’t just list all words Agglutinative languages (e.g. Turkish) Uygarlastiramadiklarimizdanmissinizcasina `(behaving) as if you are among those whom we could not civilize’ Uygar `civilized’ + las `become’ + tir `cause’ + ama `not able’ + dik `past’ + lar ‘plural’+ imiz ‘p1pl’ + dan ‘abl’ + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’

What we want Something to automatically do the following kinds of mappings: Catscat +N +PL Catcat +N +SG Citiescity +N +PL Mergingmerge +V +Present-participle Caughtcatch +V +past-participle

Morphological Parsing: Goal

FSAs and the Lexicon This will actual require a kind of FSA: the Finite State Transducer (FST) First we’ll capture the morphotactics The rules governing the ordering of affixes in a language. Then we’ll add in the actual words

Building a Morphological Parser Three components: Lexicon Morphotactics Orthographic or Phonological Rules

Lexicon: FSA Inflectional Noun Morphology reg-nounIrreg-pl-nounIrreg-sg-nounplural fox cat dog geese sheep mice goose sheep mouse -s English Noun Lexicon English Noun Rule

Lexicon and Rules: FSA English Verb Inflectional Morphology reg-verb-stemirreg-verb-stemirreg-past-verbpastpast-partpres-part3sg walk fry talk impeach cut speak spoken sing sang caught ate eaten -ed -ing-s

More Complex Derivational Morphology

Using FSAs for Recognition: English Nouns and Inflection

Parsing/Generation vs. Recognition We can only recognize words But this isn’t the same as parsing Parsing: building structure Usually if we find some string in the language we need to find the structure in it (parsing) Or we have some structure and we want to produce a surface form (production/generation) Example From “cats” to “cat +N +PL”

Finite State Transducers The simple story Add another tape Add extra symbols to the transitions On one tape we read “cats”, on the other we write “cat +N +PL”

Nominal Inflection FST

Some on-line demos Finite state automata demos ent-analysis/fsCompiler/fsinput.html Finite state morphology ent-analysis/demos/english

4. Tokenization Segmenting words in running text Segmenting sentences in running text Why not just periods and white-space? Mr. Sherwood said reaction to Sea Containers’ proposal has been "very positive." In New York Stock Exchange composite trading yesterday, Sea Containers closed at $62.625, up 62.5 cents. I said, ‘what’re you? Crazy?’ said Sadowsky. I can’t afford to do that.’’ Words like: cents.said,positive.Crazy?

Can’t just segment on punctuation Word-internal punctuation M.p.h Ph.D. AT&T 01/02/06 Google.com 555, Expanding clitics What’re -> what are I’m -> I am Multi-token words New York Rock ‘n’ roll

Sentence Segmentation !, ? relatively unambiguous Period “.” is quite ambiguous Sentence boundary Abbreviations like Inc. or Dr. General idea: Build a binary classifier: –Looks at a “.” –Decides EndOfSentence/NotEOS –Could be hand-written rules, or machine-learning

Word Segmentation in Chinese Some languages don’t have spaces Chinese, Japanese, Thai, Khmer Chinese: Words composed of characters Characters are generally 1 syllable and 1 morpheme. Average word is 2.4 characters long. Standard segmentation algorithm: –Maximum Matching (also called Greedy)

Maximum Matching Word Segmentation Given a wordlist of Chinese, and a string. 1)Start a pointer at the beginning of the string 2)Find the longest word in dictionary that matches the string starting at pointer 3)Move the pointer over the word in string 4)Go to 2

English example (Palmer 00) the table down there Theta bled own there Words astonishingly well in Chinese Far better than this English example suggests Modern algorithms better still: probabilistic segmentation

5. Spell-checking and Edit Distance Non-word error detection: detecting “graffe” Non-word error correction: figuring out that “graffe” should be “giraffe” Context-dependent error detection and correction: Figuring out that “war and piece” should be peace

Non-word error detection Any word not in a dictionary Assume it’s a spelling error Need a big dictionary! What to use? FST dictionary!!

Isolated word error correction How do I fix “graffe”? Search through all words: –graf –craft –grail –giraffe Pick the one that’s closest to graffe What does “closest” mean? We need a distance metric. The simplest one: edit distance. –(More sophisticated probabilistic ones: noisy channel)

Edit Distance The minimum edit distance between two strings Is the minimum number of editing operations Insertion Deletion Substitution Needed to transform one into the other

Minimum Edit Distance If each operation has cost of 1 Distance between these is 5 If substitutions cost 2 (Levenshtein) Distance between these is 8

N9 O8 I7 T6 N5 E4 T3 N2 I1 # #EXECUTION

N9 O8 I7 T6 N5 E4 T3 N2 I1 # #EXECUTION

N O I T N E T N I # #EXECUTION

Suppose we want the alignment too We can keep a “backtrace” Every time we enter a cell, remember where we came from Then when we reach the end, we can trace back from the upper right corner to get an alignment

N O I T N E T N I # #EXECUTION

Summary Minimum Edit Distance A “dynamic programming” algorithm We will see a probabilistic version of this called “Viterbi”

Summary Finite State Automata Deterministic Recognition of FSAs Non-Determinism (NFSAs) Recognition of NFSAs Proof that regular expressions = FSAs Very brief sketch: Morphology, FSAs, FSTs Very brief sketch: Tokenization Minimum Edit Distance