LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.

Slides:



Advertisements
Similar presentations
Finite State Automata. A very simple and intuitive formalism suitable for certain tasks A bit like a flow chart, but can be used for both recognition.
Advertisements

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Natural Language Processing Lecture 3—9/3/2013 Jim Martin.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Finite-state automata 2 Day 13 LING Computational Linguistics Harry Howard Tulane University.
Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
1 Regular Expressions and Automata September Lecture #2-2.
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
Lecture 3: Closure Properties & Regular Expressions Jim Hook Tim Sheard Portland State University.
Computational language: week 9 Finish finite state machines FSA’s for modelling word structure Declarative language models knowledge representation and.
Aho-Corasick String Matching An Efficient String Matching.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Topics Non-Determinism (NFSAs) Recognition of NFSAs Proof that regular expressions = FSAs Very brief sketch: Morphology, FSAs, FSTs Very brief sketch:
Introduction to English Morphology Finite State Transducers
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Chapter 2. Regular Expressions and Automata From: Chapter 2 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition,
Lexical Analysis Constructing a Scanner from Regular Expressions.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Natural Language Processing Lecture 2—1/15/2015 Susan W. Brown.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
CSA3050: Natural Language Algorithms Finite State Devices.
October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
June 13, 2016 Prof. Abdelaziz Khamis 1 Chapter 2 Scanning – Part 2.
1/29/02CSE460 - MSU1 Nondeterminism-NFA Section 4.1 of Martin Textbook CSE460 – Computability & Formal Language Theory Comp. Science & Engineering Michigan.
WELCOME TO A JOURNEY TO CS419 Dr. Hussien Sharaf Dr. Mohammad Nassef Department of Computer Science, Faculty of Computers and Information, Cairo University.
CIS Automata and Formal Languages – Pei Wang
CIS, Ludwig-Maximilians-Universität München Computational Morphology
CPSC 503 Computational Linguistics
Speech and Language Processing
Non Deterministic Automata
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
PDAs Accept Context-Free Languages
Pushdown Automata.
Two issues in lexical analysis
CPSC 503 Computational Linguistics
Natural Language Processing
CSCI 5832 Natural Language Processing
REGULAR LANGUAGES AND REGULAR GRAMMARS
Speech and Language Processing
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
Non-Deterministic Finite Automata
CSC NLP - Regex, Finite State Automata
CSCI 5832 Natural Language Processing
Lexical Analysis — Part II: Constructing a Scanner from Regular Expressions Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.
CPSC 503 Computational Linguistics
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Chapter 1 Regular Language
CPSC 503 Computational Linguistics
Lecture 5 Scanning.
Morphological Parsing
CHAPTER 1 Regular Languages
Regular grammars Programming Language Design and Implementation (4th Edition) by T. Pratt and M. Zelkowitz Prentice Hall, 2001 Section
Presentation transcript:

LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004

Today 9/30 Week 1 Finite State Automata Deterministic Recognition of FSAs Non-Determinism (NFSAs) Recognition of NFSAs Proof that regular expressions = FSAs Very brief sketch: Morphology, FSAs, FSTs Why finite-state machines are so great. 11/24/2018 LING 138/238 Autumn 2004

Three Views Three equivalent formal ways to look at what we’re up to (thanks to Martin Kay) Regular Expressions Finite State Automata Regular Languages 11/24/2018 LING 138/238 Autumn 2004

Finite State Automata Terminology: Finite State Automata, Finite State Machines, FSA, Finite Automata Regular expressions are one way of specifying the structure of finite-state automata. FSAs and their close relatives are at the core of most algorithms for speech and language processing. 11/24/2018 LING 138/238 Autumn 2004

Intuition: FSAs as Graphs Let’s start with the sheep language from the text /baa+!/ 11/24/2018 LING 138/238 Autumn 2004

Sheep FSA We can say the following things about this machine It has 5 states At least b,a, and ! are in its alphabet q0 is the start state q4 is an accept state It has 5 transitions 11/24/2018 LING 138/238 Autumn 2004

But note There are other machines that correspond to this language More on this one later 11/24/2018 LING 138/238 Autumn 2004

More Formally: Defining an FSA You can specify an FSA by enumerating the following things. The set of states: Q A finite alphabet: Σ A start state q0 A set F of accepting/final states FQ A transition function (q,i) that maps QxΣ to Q 11/24/2018 LING 138/238 Autumn 2004

Yet Another View State-transition table 11/24/2018 LING 138/238 Autumn 2004

Recognition Recognition is the process of determining if a string should be accepted by a machine Or… it’s the process of determining if a string is in the language we’re defining with the machine Or… it’s the process of determining if a regular expression matches a string 11/24/2018 LING 138/238 Autumn 2004

Recognition Traditionally, (Turing’s idea) this process is depicted with a tape. 11/24/2018 LING 138/238 Autumn 2004

Recognition Start in the start state Examine the current input Consult the table Go to a new state and update the tape pointer. Until you run out of tape. 11/24/2018 LING 138/238 Autumn 2004

D-Recognize 11/24/2018 LING 138/238 Autumn 2004

Tracing D-Recognize 11/24/2018 LING 138/238 Autumn 2004

Key Points Deterministic means that at each point in processing there is always one unique thing to do (no choices). D-recognize is a simple table-driven interpreter The algorithm is universal for all unambiguous languages. To change the machine, you change the table. 11/24/2018 LING 138/238 Autumn 2004

Key Points Crudely therefore… matching strings with regular expressions (ala Perl) is a matter of translating the expression into a machine (table) and passing the table to an interpreter 11/24/2018 LING 138/238 Autumn 2004

Recognition as Search You can view this algorithm as state-space search. States are pairings of tape positions and state numbers. Operators are compiled into the table Goal state is a pairing with the end of tape position and a final accept state 11/24/2018 LING 138/238 Autumn 2004

Generative Formalisms Formal Languages are sets of strings composed of symbols from a finite set of symbols. Finite-state automata define formal languages (without having to enumerate all the strings in the language) The term Generative is based on the view that you can run the machine as a generator to get strings from the language. 11/24/2018 LING 138/238 Autumn 2004

Generative Formalisms FSAs can be viewed from two perspectives: Acceptors that can tell you if a string is in the language Generators to produce all and only the strings in the language 11/24/2018 LING 138/238 Autumn 2004

Dollars and Cents 11/24/2018 LING 138/238 Autumn 2004

Summary Regular expressions are just a compact textual representation of FSAs Recognition is the process of determining if a string/input is in the language defined by some machine. Recognition is straightforward with deterministic machines. 11/24/2018 LING 138/238 Autumn 2004

Three Views Three equivalent formal ways to look at what we’re up to (thanks to Martin Kay) Regular Expressions Finite State Automata Regular Languages 11/24/2018 LING 138/238 Autumn 2004

Regular Languages More on these in a couple of weeks S → b a a A 11/24/2018 LING 138/238 Autumn 2004

Non-Determinism 11/24/2018 LING 138/238 Autumn 2004

Non-Determinism cont. Yet another technique Epsilon transitions Key point: these transitions do not examine or advance the tape during recognition ε 11/24/2018 LING 138/238 Autumn 2004

Equivalence Non-deterministic machines can be converted to deterministic ones with a fairly simple construction That means that they have the same power; non-deterministic machines are not more powerful than deterministic ones It also means that one way to do recognition with a non-deterministic machine is to turn it into a deterministic one. 11/24/2018 LING 138/238 Autumn 2004

Non-Deterministic Recognition In a ND FSA there exists at least one path through the machine for a string that is in the language defined by the machine. But not all paths directed through the machine for an accept string lead to an accept state. No paths through the machine lead to an accept state for a string not in the language. 11/24/2018 LING 138/238 Autumn 2004

Non-Deterministic Recognition So success in a non-deterministic recognition occurs when a path is found through the machine that ends in an accept. Failure occurs when none of the possible paths lead to an accept state. 11/24/2018 LING 138/238 Autumn 2004

Example b a a a ! \ q0 q1 q2 q2 q3 q4 11/24/2018 LING 138/238 Autumn 2004

Example 11/24/2018 LING 138/238 Autumn 2004

Example 11/24/2018 LING 138/238 Autumn 2004

Example 11/24/2018 LING 138/238 Autumn 2004

Example 11/24/2018 LING 138/238 Autumn 2004

Example 11/24/2018 LING 138/238 Autumn 2004

Example 11/24/2018 LING 138/238 Autumn 2004

Example 11/24/2018 LING 138/238 Autumn 2004

Example 11/24/2018 LING 138/238 Autumn 2004

Key Points States in the search space are pairings of tape positions and states in the machine. By keeping track of as yet unexplored states, a recognizer can systematically explore all the paths through the machine given an input. 11/24/2018 LING 138/238 Autumn 2004

ND-Recognize Code 11/24/2018 LING 138/238 Autumn 2004

Infinite Search If you’re not careful such searches can go into an infinite loop. How? 11/24/2018 LING 138/238 Autumn 2004

Why Bother? Non-determinism doesn’t get us more formal power and it causes headaches so why bother? More natural solutions Machines based on construction are too big 11/24/2018 LING 138/238 Autumn 2004

Regular languages The class of languages characterizable by regular expressions Given alphabet , the reg. lgs. over  is: The empty set  is a regular language a    , {a} is a regular language If L1 and L2 are regular lgs, then so are: L1 · L2 = {xy|x  L1,y  L2}, concatenation of L1 & L2 L1  L2, the union of L1 and L2 L1*, the Kleene closure of L1 11/24/2018 LING 138/238 Autumn 2004

Going from regexp to FSA Since all regular lgs meet above properties And reg lgs are the lgs characterizable by regular expressions All regular expression operators can be implemented by combinations of union, disjunction, closure Counters (*,+) are repetition plus closure Anchors are individual symbols [] and () and . are kinds of disjunction 11/24/2018 LING 138/238 Autumn 2004

Going from regexp to FSA So if we could just show how to turn closure/union/concat from regexps to FSAs, this would give an idea of how FSA compilation works. The actual proof that reg lgs = FSAs has 2 parts An FSA can be built for each regular lg A regular lg can be built for each automaton So I’ll give the intuition of the first part: Take any regular expression and build an automaton Intuition: induction Base case: build an automaton for single symbol (say ‘a’) Inductive step: Show how to imitate the 3 regexp operations in automata 11/24/2018 LING 138/238 Autumn 2004

Union Accept a string in either of two languages 11/24/2018 LING 138/238 Autumn 2004

Concatenation Accept a string consisting of a string from language L1 followed by a string from language L2. 11/24/2018 LING 138/238 Autumn 2004

FSAs and Computational Morphology An important use of FSAs is for morphology, the study of word parts We’ll just have time for a quick overview today This is the exact topic of LING 239F, being offered this quarter! 11/24/2018 LING 138/238 Autumn 2004

English Morphology Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes We can usefully divide morphemes into two classes Stems: The core meaning bearing units Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions 11/24/2018 LING 138/238 Autumn 2004

Nouns and Verbs (English) Nouns are simple (not really) Markers for plural and possessive Verbs are only slightly more complex Markers appropriate to the tense of the verb 11/24/2018 LING 138/238 Autumn 2004

Regulars and Irregulars Ok so it gets a little complicated by the fact that some words misbehave (refuse to follow the rules) Mouse/mice, goose/geese, ox/oxen Go/went, fly/flew The terms regular and irregular will be used to refer to words that follow the rules and those that don’t. 11/24/2018 LING 138/238 Autumn 2004

Regular and Irregular Nouns and Verbs Regulars… Walk, walks, walking, walked, walked Table, tables Irregulars Eat, eats, eating, ate, eaten Catch, catches, catching, caught, caught Cut, cuts, cutting, cut, cut Goose, geese 11/24/2018 LING 138/238 Autumn 2004

Compute Many paths are possible… Start with compute Computer -> computerize -> computerization Computation -> computational Computer -> computerize -> computerizable Compute -> computee 11/24/2018 LING 138/238 Autumn 2004

Why care about morphology? `Stemming’ in information retrieval Might want to search for “aardvark” and find pages with both “aardvark” and “aardvarks” Morphology in machine translation Need to know that the Spanish words quiero and quieres are both related to querer ‘want’ Morphology in spell checking Need to know that misclam and antiundoggingly are not words despite being made up of word parts 11/24/2018 LING 138/238 Autumn 2004

Can’t just list all words Turkish Uygarlastiramadiklarimizdanmissinizcasina `(behaving) as if you are among those whom we could not civilize’ Uygar `civilized’ + las `become’ + tir `cause’ + ama `not able’ + dik `past’ + lar ‘plural’+ imiz ‘p1pl’ + dan ‘abl’ + mis ‘past’ + siniz ‘2pl’ + casina ‘as if’ 11/24/2018 LING 138/238 Autumn 2004

What we want Something to automatically do the following kinds of mappings: Cats cat +N +PL Cat cat +N +SG Cities city +N +PL Merging merge +V +Present-participle Caught catch +V +past-participle 11/24/2018 LING 138/238 Autumn 2004

FSAs and the Lexicon This will actual require a kind of FSA we won’t be studying: the Finite State Transducer (FST) But we’ll give a quick overview anyhow First we’ll capture the morphotactics The rules governing the ordering of affixes in a language. Then we’ll add in the actual words 11/24/2018 LING 138/238 Autumn 2004

Simple Rules 11/24/2018 LING 138/238 Autumn 2004

Adding the Words 11/24/2018 LING 138/238 Autumn 2004

Derivational Rules 11/24/2018 LING 138/238 Autumn 2004

Parsing/Generation vs. Recognition Recognition is usually not quite what we need. Usually if we find some string in the language we need to find the structure in it (parsing) Or we have some structure and we want to produce a surface form (production/generation) Example From “cats” to “cat +N +PL” 11/24/2018 LING 138/238 Autumn 2004

Finite State Transducers The simple story Add another tape Add extra symbols to the transitions On one tape we read “cats”, on the other we write “cat +N +PL” 11/24/2018 LING 138/238 Autumn 2004

Transitions c:c a:a t:t +N:ε +PL:s c:c means read a c on one tape and write a c on the other +N:ε means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s 11/24/2018 LING 138/238 Autumn 2004

Lexical to Intermediate Level 11/24/2018 LING 138/238 Autumn 2004

Some on-line demos Finite state automata demos http://www.xrce.xerox.com/competencies/content-analysis/fsCompiler/fsinput.html Finite state morphology http://www.xrce.xerox.com/competencies/content-analysis/demos/english Some other downloadable FSA tools: http://www.research.att.com/sw/tools/fsm/ 11/24/2018 LING 138/238 Autumn 2004