November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst.

Slides:



Advertisements
Similar presentations
Finite-state Recognizers
Advertisements

Finite State Automata. A very simple and intuitive formalism suitable for certain tasks A bit like a flow chart, but can be used for both recognition.
4b Lexical analysis Finite Automata
Beesley 2000 Introduction to the xfst Interface Review Introduction to Morphology Relations and Transducers Introduction to xfst.
Regular Expressions and DFAs COP 3402 (Summer 2014)
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
1Basic Mathematics - Finite-State Methods in Natural-Language Processing: Basic Mathematics Ronald M. Kaplan and Martin Kay.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Regular Expressions, Backus-Naur Form and Reverse Polish Notation.
Writing Lexical Transducers Using xfst
October 2006Advanced Topics in NLP1 Finite State Machinery Xerox Tools.
1 Introduction to Computability Theory Lecture3: Regular Expressions Prof. Amos Israeli.
1 Introduction to Computability Theory Lecture4: Regular Expressions Prof. Amos Israeli.
Introduction to Computability Theory
Amirkabir University of Technology Computer Engineering Faculty AILAB Efficient Parsing Ahmad Abdollahzadeh Barfouroush Aban 1381 Natural Language Processing.
CS5371 Theory of Computation Lecture 8: Automata Theory VI (PDA, PDA = CFG)
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Finite State Machines Data Structures and Algorithms for Information Processing 1.
May 2007CLINT/LIN xfst 1 Introduction to the xfst Interface Review Introduction to Morphology Relations and Transducers Introduction to xfst.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
::ICS 804:: Theory of Computation - Ibrahim Otieno SCI/ICT Building Rm. G15.
Regular Expressions. Notation to specify a language –Declarative –Sort of like a programming language. Fundamental in some languages like perl and applications.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Automata, Computability, & Complexity by Elaine Rich ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Slides provided by author Slides edited for.
Languages, Grammars, and Regular Expressions Chuck Cusack Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5 th edition, by Kenneth.
Learning Automata and Grammars Peter Černo.  The problem of learning or inferring automata and grammars has been studied for decades and has connections.
Finite State Transducers for Morphological Parsing
4b 4b Lexical analysis Finite Automata. Finite Automata (FA) FA also called Finite State Machine (FSM) –Abstract model of a computing entity. –Decides.
Finite State Machinery - I Fundamentals Recognisers and Transducers.
Human Language Technology Finite State Transducers.
Regular Expressions Chapter 6 1. Regular Languages Regular Language Regular Expression Finite State Machine L Accepts 2.
Intro to NLP - J. Eisner1 Building Finite-State Machines.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 11 Midterm Exam 2 -Context-Free Languages Mälardalen University 2005.
Theory of Computation, Feodor F. Dragan, Kent State University 1 TheoryofComputation Spring, 2015 (Feodor F. Dragan) Department of Computer Science Kent.
Programming Languages and Design Lecture 3 Semantic Specifications of Programming Languages Instructor: Li Ma Department of Computer Science Texas Southern.
CSA3050: Natural Language Algorithms Finite State Devices.
1 / 48 Formal a Language Theory and Describing Semantics Principles of Programming Languages 4.
October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
Foundations of (Theoretical) Computer Science Chapter 2 Lecture Notes (Section 2.2: Pushdown Automata) Prof. Karen Daniels, Fall 2010 with acknowledgement.
FST Morphology Miriam Butt October 2003 Based on Beesley and Karttunen 2003.
Parsing and Code Generation Set 24. Parser Construction Most of the work involved in constructing a parser is carried out automatically by a program,
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
CSC312 Automata Theory Lecture # 26 Chapter # 12 by Cohen Context Free Grammars.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
November 2003Computational Morphology III1 CSA405: Advanced Topics in NLP Xerox Notation.
November 2003Computational Morphology VI1 CSA4050 Advanced Topics in NLP Non-Concatenative Morphology – Reduplication – Interdigitation.
Donghyun (David) Kim Department of Mathematics and Physics North Carolina Central University 1 Chapter 2 Context-Free Languages Some slides are in courtesy.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen Department of Computer Science University of Texas-Pan American.
CS 154 Formal Languages and Computability February 11 Class Meeting Department of Computer Science San Jose State University Spring 2016 Instructor: Ron.
CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture 1 Ahmed Ezzat.
Set, Alphabets, Strings, and Languages. The regular languages. Clouser properties of regular sets. Finite State Automata. Types of Finite State Automata.
Regular Expressions, Backus-Naur Form and Reverse Polish Notation
BİL711 Natural Language Processing
Context-Free Grammars: an overview
Syntax Specification and Analysis
CO4301 – Advanced Games Development Week 2 Introduction to Parsing
Chapter 7 PUSHDOWN AUTOMATA.
Lexical and Syntax Analysis
CSC NLP - Regex, Finite State Automata
CHAPTER 2 Context-Free Languages
Writing Lexical Transducers Using xfst
Building Finite-State Machines
languages & relations regular expressions finite-state networks
Mathematical Background 1
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Teori Bahasa dan Automata Lecture 9: Contex-Free Grammars
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
Presentation transcript:

November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst

November 2003CSA4050: Computational Morphology IV 2 What is xfst? xfst is a general tool for creating and manipulating finite state networks, both simple automota and transducers. xfst and other Xerox tools employ a notation very close to the notation we have been using so far. For full documentation on the syntax and semantics of Xerox REs, see –

November 2003CSA4050: Computational Morphology IV 3 Simple Commands command line (via babe) > xfst define: give a name to an RE print: print information read: read information various stack operations file interaction

November 2003CSA4050: Computational Morphology IV 4 define command define name regexp ; xfst[0]: define foo [d o g] | [c a t]; xfst[0]: define R1 [a | b | c | d]; xfst[0]: define R2 [d | e | f | g]; xfst[0]: define R3 [f | g | h | i | j]; xfst[0]: define baz R1 & R2;

November 2003CSA4050: Computational Morphology IV 5 print words print words name - see the words in the language called name xfst[0]: print words R1 d c b a xfst[0]:

November 2003CSA4050: Computational Morphology IV 6 print net print net name - see detailed information about the network name. xfst[0]: define z R1 & R2; xfst[0]: define baz R1 & R2; xfst[0]: print net z Sigma: a b c d e f g Size: 7 Net: FC370 Flags: deterministic, pruned, minimized, epsilon_free, loop_free Arity: 1 s0: d -> fs1. fs1: (no arcs) xfst[0]:

November 2003CSA4050: Computational Morphology IV 7 Some Properties of Networks epsilon free: there are no arcs labeled with the epsilon symbol deterministic: no state has more than one outgoing arc minimised: there is no other network with exactly the same paths that has fewer states. These make sense for FSAs – not necessarily for FSTs.

November 2003CSA4050: Computational Morphology IV 8 Equivalent? a:0 ab a a b A B no. states? no. paths? relation encoded?

November 2003CSA4050: Computational Morphology IV 9 Remarks A and B encode the same relation {, } They are both deterministic and minimal They have different numbers of states. Arcs labeled with a pair containing an epsilon on one side can sometimes be redistributed or eliminated, reducing the number of states. This situation does not occur with FSAs

November 2003CSA4050: Computational Morphology IV 10 FST Determinism: Sequential vs. Unambiguous Unambiguous: for any input there is at most one output. –Transducer A is unambiguous in either direction. Sequential: No state has more than one arc with the same symbol on the input side. –Transducer A is not sequential in one direction. A transducer is sequentiable if the relation it encodes is unambiguous and all the local ambiguities resolve themselves in a fixed number of steps

November 2003CSA4050: Computational Morphology IV 11 Basic Stack Operations read regex : push network onto stack: print stack : list items on stack print net : detailed info on top stack item pop stack : remove top item from stack define name : set name to value of top stack item

November 2003CSA4050: Computational Morphology IV 12 Stack Operations: intersect net; union net, etc. Load stack with N suitable arguments. Ensure that arguments are pushed onto stack in correct (reverse) order. intersect net command is issued. These are popped from the stack, the operation is performed, and the result written back onto the stack.

November 2003CSA4050: Computational Morphology IV 13 Stack Example 1 xfst[0]: clear stack; xfst[0]: read regex [d |c |e | b | w] xfst[1]: read regex [b | s | h | w] xfst[2]: read regex [s | d | c | f | w] xfst[3]: print stack xfst[3]: intersect net xfst[1]: print stack xfst[1]: print net xfst[1]: print words x1

November 2003CSA4050: Computational Morphology IV 14 Stack Example 2 xfst[0]: clear stack; xfst[0]: read regex [e d | i n g | s |[]] xfst[1]: read regex [t a l k | k i c k] xfst[2]: print stack xfst[2]: print net xfst[2]: print words xfst[2]: concatenate net xfst[1]: print words x2/a

November 2003CSA4050: Computational Morphology IV 15 Creating Relations A simple example of a transducer can be shown using the crossproduct operator: xfst[0] clear stack xfst[0] define Y [d o g | c a t]; xfst[0] define Z [c h i e n | c h a t]; xfst[0] read regex Y.x. Z We can now use apply up and apply down to test the transducer’s behaviour. x3ab

November 2003CSA4050: Computational Morphology IV 16 apply up; apply down applyup(arg,R) = {x | in R} applydown(arg,R) = {x | in R} xfst[0] read regex [d o g | c a t].x.[c h i e n | c h a t]; xfst[1] apply up chien dog cat xfst[1] apply down cat chien chat

November 2003CSA4050: Computational Morphology IV 17 Exercise for.x. What RE would perform the correct translations? Define it in xfst. Define an RE in xfst which relates the surface forms "sing", "sang" and "sung" to the lexical form "sing". x3c

November 2003CSA4050: Computational Morphology IV 18 Replace Rules Xerox RE notation, includes replace rules. Replace rules do not increase the descriptive power of REs; however, they do provide a powerful abbreviated rule- like notation. There are two main types of replace rules:unconditional and conditional

November 2003CSA4050: Computational Morphology IV 19 Unconditional Replace Rules The most straightforward kind of unconditional replace rule is: a -> b This denotes an FS relation in which every symbol a in the upper language corresponds to a symbol b in the lower language. Checkpoint: how does this differ from a:b? What is the FST that computes this relation

November 2003CSA4050: Computational Morphology IV 20 Unconditional Replace e.g. xfst[0]: read regex c -> r xfst[0]: apply down cat xfst[0]: apply down dog Where there is no match, the string is identity mapped. The general pattern for simple Replace rules is A -> B, where A and B are REs denoting arbitrarily complex languages (not relations) x4ab

November 2003CSA4050: Computational Morphology IV 21 Definition of A → B A → B = [no_A [A.x. B]]* no_A where no_A ~$[A – 0] N.B. if upper does not contain empty str ~$[upper – 0] = ~$[upper] otherwise ~$[upper] is null whereas ~$[upper – 0] contains at least the empty str.

November 2003CSA4050: Computational Morphology IV 22 Conditional Replace Rules More complex replace rules can also specify left and right context, as in A -> B || L _ R each lexical substring A is related to a substring B when the left context ends with L and the right context starts with R. A, B, L and R are REs denoting languages not relations. x4c

November 2003CSA4050: Computational Morphology IV 23 Special Cases The symbol.#. refers to the absolute beginning or end of string in left and right contexts. For example e -> i ||.#. p _ r Checkpoint: write a replace rule that brings lexical "go" into correspondence with surface "went".

November 2003CSA4050: Computational Morphology IV 24 The kaNpat exercise Suppose we have a language in which kaNpat is a lexical string consisting of the morpheme kaN concatenated with the suffix pat. N just before nasal p gets realised as m. p occurring just after an m is realised as m.

November 2003CSA4050: Computational Morphology IV 25 kaNpat rules We can write the following two rules to account for this behaviour: Rule 1. [N -> m || _ p] Notice that the lh context is empty, meaning that any context will do. Rule 2. [p -> m || m _] Note that the linguist must keep track of the order in which rules are applied.

November 2003CSA4050: Computational Morphology IV 26 Derivation of kammat Lexical: kaNpat apply [N -> m || _ p] Intermediate: kampat apply [p -> m || m _] surface: kammat The first rule feeds the second Checkpoint: what happens if rules are applied in reverse order?

November 2003CSA4050: Computational Morphology IV 27 Composing the Relations Each rule describes a certain relation: call these R1 and R2 If R1 maps X to Y and R2 maps Y to Z, then there must exist a single relation which maps directly from X to Z without passing through Y. Mathematically, that relation is the composition of R1 and R2.

November 2003CSA4050: Computational Morphology IV 28 Composing the Rules Each rule is compiled into an FST. If Rule1 compiles to F1, and Rule2 to F2, then there must be an F3 which computes the composition of F1 and F2. Checkpoint: write the RE corresponding to the composition of the original 2 rules.

November 2003CSA4050: Computational Morphology IV 29 Testing the kaNpat grammar First get rules onto stack xfst[0] read regex [N->m || _p].o. [p->m||m_]; Try the following and explain – apply down (kaNpat; kampat; kammat) – apply up kammat – Try the above but with rules in reverse order X5ab

November 2003CSA4050: Computational Morphology IV 30 Practical use of xfst Regular expression files (text) xfst[0] read regexp < regexpfile Binary files (compiled networks) xfst[1]: save stack binfile xfst[0]: load stack binfile Scripts (xfst commands) xfst[0] source scriptfile % xfst -f myscript % xfst -l myscript

November 2003CSA4050: Computational Morphology IV 31 A’ is the sequentiable a:0 ab a a 0:b A A’ no. states? no. paths? relation encoded? b:a