FST Morphology Miriam Butt October 2003 Based on Beesley and Karttunen 2003.

Slides:



Advertisements
Similar presentations
Finite State Automata. A very simple and intuitive formalism suitable for certain tasks A bit like a flow chart, but can be used for both recognition.
Advertisements

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Beesley 2000 Introduction to the xfst Interface Review Introduction to Morphology Relations and Transducers Introduction to xfst.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Beesley 2001 Finite-State Technology and Linguistic Applications March 2001 Xerox Research Centre Europe Grenoble Laboratory 6, chemin de Maupertuis.
Intro to NLP - J. Eisner1 Finite-State Methods.
Writing Lexical Transducers Using xfst
October 2006Advanced Topics in NLP1 Finite State Machinery Xerox Tools.
1 Languages. 2 A language is a set of strings String: A sequence of letters Examples: “cat”, “dog”, “house”, … Defined over an alphabet: Languages.
Finite-State Automata Shallow Processing Techniques for NLP Ling570 October 5, 2011.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
1 Languages and Finite Automata or how to talk to machines...
Finite state automaton (FSA)
Homework #9 Solutions.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
FST Morphology Miriam Butt October 2002 Based on Beesley and Karttunen 2002.
Introduction to English Morphology Finite State Transducers
Great Theoretical Ideas in Computer Science.
May 2007CLINT/LIN xfst 1 Introduction to the xfst Interface Review Introduction to Morphology Relations and Transducers Introduction to xfst.
9/8/20151 Natural Language Processing Lecture Notes 1.
CPSC 388 – Compiler Design and Construction Scanners – Finite State Automata.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Computational Linguistics Yoad Winter *General overview *Examples: Transducers; Stanford Parser; Google Translate; Word-Sense Disambiguation * Finite State.
Computational Lexicology, Morphology and Syntax Diana Trandab ă ț Academic year
TM Design Universal TM MA/CSSE 474 Theory of Computation.
Monday Afternoon Review Introduction to Natural-Language Morphology Relations and Transducers Introduction to xfst.
Automating Construction of Lexers. Example in javacc TOKEN: { ( | | "_")* > | ( )* > | } SKIP: { " " | "\n" | "\t" } --> get automatically generated code.
Lecture 1 Computation and Languages CS311 Fall 2012.
Finite State Transducers for Morphological Parsing
Finite State Machinery - I Fundamentals Recognisers and Transducers.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Computability Construct TMs. Decidability. Preview: next class: diagonalization and Halting theorem.
TRANSITION DIAGRAM BASED LEXICAL ANALYZER and FINITE AUTOMATA Class date : 12 August, 2013 Prepared by : Karimgailiu R Panmei Roll no. : 11CS10020 GROUP.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
The Simplest NL Applications: Text Searching and Pattern Matching Read J & M Chapter 2.
November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
November 2003Computational Morphology III1 CSA405: Advanced Topics in NLP Xerox Notation.
Lecture 17 Undecidability Topics:  TM variations  Undecidability June 25, 2015 CSCE 355 Foundations of Computation.
Models of Computing Regular Expressions 1. Formal models of computation What can be computed? What is a valid program? What is a valid name of a variable.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
November 2003Computational Morphology VI1 CSA4050 Advanced Topics in NLP Non-Concatenative Morphology – Reduplication – Interdigitation.
using Deterministic Finite Automata & Nondeterministic Finite Automata
Donghyun (David) Kim Department of Mathematics and Physics North Carolina Central University 1 Chapter 2 Context-Free Languages Some slides are in courtesy.
BİL711 Natural Language Processing1 Regular Expressions & FSAs Any regular expression can be realized as a finite state automaton (FSA) There are two kinds.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
A Quick Review of Set Theory A set is a collection of objects. A B D E We can enumerate the “members” or “elements” of finite sets: { A, D, B, E }. There.
Deterministic Finite Automata Nondeterministic Finite Automata.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
Lecture 15: Theory of Automata:2014 Finite Automata with Output.
Testing with the Finite-State Calculus Thursday AM Kenneth R. Beesley Xerox Research Centre Europe.
General Discussion of “Properties” The Pumping Lemma Membership, Emptiness, Etc.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Languages.
Composition is Our Friend
Turing Machines Acceptors; Enumerators
Prepare to partition your brain to learn a whole new formalism.
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
4. Properties of Regular Languages
Non-Deterministic Finite Automata
Chapter Five: Nondeterministic Finite Automata
Writing Lexical Transducers Using xfst
Building Finite-State Machines
languages & relations regular expressions finite-state networks
Presentation transcript:

FST Morphology Miriam Butt October 2003 Based on Beesley and Karttunen 2003

Recap Last Time: Finite State Automata can model most anything that involes a finite amount of states. We modeled a Coke Machine and saw that it could also be thought of as defining a language. We will now look at the extension to natural language more closely.

A One-Word Language canto

A Three-Word Language canto t igr e m es a

Analysis: A Successful Match canto t igr e m es a mesa Input:

Rejects The analysis of libro, tigra, cant, mesas will fail. Why? Use: for example for a spell checker

Transducers: Beyond Accept and Reject canto t igr e m es a t igr e canto m es a

Analysis Process: Start at the Start State Match the input symbols of string against the lower- side symbol on the arcs, consuming the input symbols and finding a path to a final state. If successful, return the string of upper-side symbols on the path as the result. If unsucessful, return nothing.

A Two-Level Transducer canto t igr e m es a t igr e canto m es a mesa Output: mesa Input:

A Lexical Transducer canta Output: cant Input: r +PresInd +1P +Sg cant00000o o cant +Verb ar +PresInd+1P+Sg+Verb One Possible Path through the Network

A Lexical Transducer canto Output: cant Input: +Masc +Sg canto000 o cant +Noun o +Masc +Sg +Noun Another Possible Path through the Network (found through backtracking)

Why Transducer? General Definition: Device that converts energy from one form to another. In this Context: Device that converts one string of symbols into another another string of symbols.

The Tags Tags or Symbols like +Noun or +Verb are arbitrary: the naming convention is determined by the (computational) linguist and depends on the larger picture (type of theory/type of application). One very successful tagging/naming convention is the Penn Treebank Tag Set

The Tags What kind of Tags might be useful?

Generation vs. Analysis The same finite state transducers we have been using for the analysis of a given surface string can also be used in reverse: for generation. The XRCE people think of analysis as lookup, of generation of lookdown.

Generation --- Lookdown canta Output: cant Input: r +PresInd +1P +Sg cant00000o ocant +Verb ar +PresInd+1P+Sg+Verb

Generation --- Lookdown Analysis Process: Start at the Start State and the beginning of the input string Match the input symbols of string against the upper- side symbols on the arcs, consuming the input symbols and finding a path to a final state. If successful, return the string of lower-side symbols on the path as the result. If generation is unsuccessful, return nothing.

Tokenization General String Conversion: Tokenizer Task: Divide up running text into individual tokens couldn’t -> could not to and from -> to-and-fro,,, ->, The farmer walked -> The^TBfarmer^TBwalked

Sharing Structure and Sets Networks can be compressed quite cleverly (p ). FST Networks are based on formal language theory. This includes basics of set theory: membership, union, intersection, subtraction, complementation Sets:

Some Basic Concepts/Representations Empty Language Empty-String Language

Some Basic Concepts/Representations A Network for the Universal Language A Network for an Infinite Language a ?

Relations Ordered Set: members are ordered. Ordered Pair: vs. Relation: set whose members are ordered pairs Family Trees (p. 21) {,,... }

Relations An Infinite Relation:. Identity Relation: a A b B c C... z Z

Basic Set Operations union (p. 24) intersection (p. 25) subtraction (p. 25)

Concatenation One can also concatenate two existing languages (finite state networks with one another to build up new words productively/dynamically). This works nicely, but one has to write extra rules to avoid things like: *trys, *tryed, though trying is okay.

Concatenation work i ng s e d Network for the Language {“work”} Network for the Language {“s”, “ed”, “ing”}

Concatenation wor k i n g s e d Concatenation of the two networks What strings/language does this result in?

Composition Composition is an operation on two relations. Composition of the two relations and yields Example: with gives

Composition Katze chat c at chat

Katze c at chat Merging the two networks

Composition Katze cat The Composition of the Networks What is this reminiscent of?

Composition + Rule Application Composition can be used to encode phonology-style rules. E.g., Lexical Phonology assumes several iterations/levels at which different kinds of rules apply. This can be modeled in terms of cascades of FST transducers (p. 35). These can then be composed together into one single transducer.

Next Time Begin working with xfst Now: Exercises , , Optional: