Finite State Transducers for Morphological Parsing CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing 5.11.2002 CSA3050: NLP Algorithms
Resumé FSAs are equivalent to regular languages FSTs are equivalent to regular relations (over pairs of regular languages) FSTs are like FSAs but with complex labels. We can use FSTs to transduce between surface and lexical levels. 5.11.2002 CSA3050: NLP Algorithms
Dotted Pair Notation 1) FSA recogniser for "fox" f o x 2) FST transducers for fox/fox; goose/geese f:f o:o x:x g:g o:e s:s e:e 5.11.2002 CSA3050: NLP Algorithms
Dotted Pair Notation (2) By convention, x:y pairs lexical symbol x with surface symbol y By convention, within the context of FSTs, we often encounter "default pairs" of the form x:x. These are often written as "x". g o:e s e 5.11.2002 CSA3050: NLP Algorithms
FSA for Number Inflection How can we augment this to produce an analysis? 5.11.2002 CSA3050: NLP Algorithms
3 Steps Create a transducer Tnum for noun number inflection. This will add number and category information given word classes as input. Create a transducer Tstems mapping words to word classes. Hook the two together. 5.11.2002 CSA3050: NLP Algorithms
Tnum example ^ “lexical” +N +PL reg-noun-stem s # reg-noun-stem “intermediate” 5.11.2002 CSA3050: NLP Algorithms
1. Tnum: Noun Number Inflection multi-character symbols morpheme boundary ^ word boundary # 5.11.2002 CSA3050: NLP Algorithms
Tstems example # “intermediate” reg-noun-stem d:d o:o g:g f:f o:o x:x “surface” 5.11.2002 CSA3050: NLP Algorithms
Tstems example # “intermediate” m o:i u:ε s e s h e e p # “surface” irreg-pl-noun-form Tstems m o:i u:ε s e s h e e p # “surface” 5.11.2002 CSA3050: NLP Algorithms
2. Tstems Lexicon 5.11.2002 CSA3050: NLP Algorithms
Hooking Together There are two ways to hook the two transducers together Cascading: hooking the output of one transducer with the input of the other and running them in series. Composition: composing the two transducers together mathematically to create a third, equivalent transducer. 5.11.2002 CSA3050: NLP Algorithms
Hooking Together: cascading +PL reg-noun-stem +N lexical Tnum s reg-noun-stem ^ # intermediate Tstems dog fox s # surface 5.11.2002 CSA3050: NLP Algorithms
Composition of Relations Let R and S be binary relations. The composition of R and S written R S is defined as: (a,c) R S if and only if (a,b) R and (b,c) S for all a,b,c Transducers can also be composed 5.11.2002 CSA3050: NLP Algorithms
Tnum o Tstem 5.11.2002 CSA3050: NLP Algorithms
English Spelling Rules consonant doubling: beg / begging y replacement: try/tries k insertion: panic/panicked e deletion: make/making e insertion: watch/watches Each rule can be stated in more detail ... 5.11.2002 CSA3050: NLP Algorithms
e Insertion Rule Insert an e on the surface tape just when the lexical tape has morpheme ending in x,s,z,or ch and the next and final morpheme is -s Stated formally e [x|s|z|ch]^ __ s# 5.11.2002 CSA3050: NLP Algorithms
e insertion over 3 levels The rule corresponds to the mapping between surface and intermediate levels 5.11.2002 CSA3050: NLP Algorithms
e insertion as an FST 5.11.2002 CSA3050: NLP Algorithms
Incorporating Spelling Rules Spelling rules, each corresponding to an FST, can be run in parallel provided that they are "aligned". The set of spelling rules is positioned between the surface level and the intermediate level. Parallel execution of FSTs can be carried out: by simulation: in this case FSTs must first be aligned. by first constructing a a single FST corresponding to their intersection. 5.11.2002 CSA3050: NLP Algorithms
Putting it all together execution of FSTi takes place in parallel 5.11.2002 CSA3050: NLP Algorithms
Kaplan and Kay The Xerox View FSTi are aligned but separate FSTi intersected together 5.11.2002 CSA3050: NLP Algorithms
Operations over FSTs We can perform operations over FSTs which yield other FSTs. Inversion Union Composition The inversion of T, or T-1 simply computes the inverse mapping to T. 5.11.2002 CSA3050: NLP Algorithms
Inversion T-1 T c a t ^ PL c a t ^ PL lexical lexical surface surface 5.11.2002 CSA3050: NLP Algorithms
Inversion To invert a transducer Practical consequences: we switch the order of the complex symbols, i.e. every i:o becomes o:i or we leave the transducer alone, and slightly change the parsing algorithm. Practical consequences: Transducer is reversible We can use the exactly the same transducer to perform either analysis or generation. 5.11.2002 CSA3050: NLP Algorithms
Closure Properties of FSTs Relations computed by FSTs are closed under inversion union composition not closed (in general) under intersection. However intersection is possible provided that we restrict the class of transducers. complementation subtraction 5.11.2002 CSA3050: NLP Algorithms