Download presentation
Presentation is loading. Please wait.
Published byStephany Gordon Modified over 9 years ago
1
May 2007CLINT/LIN xfst 1 Introduction to the xfst Interface Review Introduction to Morphology Relations and Transducers Introduction to xfst
2
May 2007CLINT/LIN xfst 2 Basic Formal-Language Review What is a Symbol? What is an Alphabet? What is a string (=word)? What is a Language? What basic operations can be performed on Sets? What basic operations can be performed on Languages?
3
May 2007CLINT/LIN xfst 3 Formal Languages and Natural Languages Any set of strings is a formal language L1 = { “a”, “aa”, “aaa”, “aaaa”, “aaaaa”, …} L2 = { “zzmy”, “niwhiuhew”, “sjehuiwheu” } L3 = { “dog”, “cat”, “elephant” } The systems that we write will “accept” or “map” words in a formal language. In practical natural-language processing, we try to make these formal languages as close as possible to a natural language, e.g. Swahili. I.e. we try to model a natural language, as perfectly as possible, in our grammars. We write our grammars using xfst and lexc.
4
May 2007CLINT/LIN xfst 4 Concatenation can form “Real” Words work talk walk Root Language ing ed s Suffix Language working worked works talking talked talks walking walked walks The concatenation of the Suffix language after the Root language.
5
May 2007CLINT/LIN xfst 5 Concatenation can also form Bad Words try plot wiggle Root Language ing ed s Suffix Language Raw Concatenation Result/Level/Language: *trys *tryed trying plots *ploted *ploting wiggles *wiggleed *wiggleing tries tried trying plots plotted plotting wiggles wiggled wiggling Desired Final Result / Level/Language
6
May 2007CLINT/LIN xfst 6 Inuktitut Paris+mut+nngau+juma+niraq+lauq+si+ma+nngit+junga Pari mu nngau juma nira lauq si ma nngit tunga Paris‘Paris’ mutterminalis-case nngaudirection-to jumawant niraqdeclare that lauqpast siperfective maresulting state nngitnegative junga1P pres. indic “I never said that I wanted to go to Paris”
7
May 2007CLINT/LIN xfst 7 Morphology In most languages, morphemes are just concatenations of symbols from the alphabet of the language. In most languages, words are just concatenations of morphemes. But raw concatenation often gives us abstract, morphophonemic, not-yet-correct words. There are alternations between the raw concatenations and the desired final words. There are two challenges in natural-language Morphology: Morphotactics: describe word-formation Alternation: describe mappings between raw concatenations and final forms Both can be modeled and computed using finite-state methods
8
May 2007CLINT/LIN xfst 8 Transducers Recall that finite-state transducers can “map” from one string of symbols to a different string of symbols. c a n t a r +Verb +PInd +1P +Sg c a n t є є є o є є We can also use transducers to map between abstract, not-yet-correct forms (usually built by simple concatenation) and correct forms. w i g g l e i n g w i g g l є i n g
9
May 2007CLINT/LIN xfst 9 Regular Relations A Regular Language is a set of strings, e.g. { “cat”, “fly”, “big” }. An ordered pair of strings, notated, relates two strings, e.g.. A Regular Relation is a set of ordered pairs of strings, e.g. {,,, } Or {,,,, } The set of upper-side strings in a relation is a Regular Language. The set of lower-side strings in a relation is a Regular Language. A Regular Relation is a “mapping” between two Regular Languages. Each string in one of the languages is “related” to one or more strings of the other language. A Regular Relation can be encoded in a Finite-State Transducer (FST).
10
May 2007CLINT/LIN xfst 10 Relations, Analysis and Generation Given a transducer (relation), and a string, we can see the mappings of the relation via Analysis and Generation: c a n t a r +Verb +PInd +1P +Sg c a n t o Upper-side string: c a n t a r +Verb +PInd +1P +Sg Lower-side string: c a n t o Apply the transducer in a downward direction to the upper-side string to perform Generation. Apply the transducer in an upward direction to the lower-side string to perform Analysis.
11
May 2007CLINT/LIN xfst 11 Transducers encode Finite-State Relations Let a Relation X include the ordered string pairs {, } What is the upper-side Language of this Relation? What is the lower-side Language of this Relation? How can such a relation be encoded? What do you get when you analyze the string “canto”? What do you get when you generate from the string “cantar+Verb+PInd+1P+Sg”?
12
May 2007CLINT/LIN xfst 12 Rules and Infinite Relations One or both of the Languages related by a Relation can be infinite, e.g. the relation that relates lower-case words to their upper-case versions: {,,, … } a:A b:B c:C d:D Etc, (assume arcs for all other symbols in the alphabet) Apply this network in a downward direction to the input string “cad”. What is the output?
13
May 2007CLINT/LIN xfst 13 Alternation Rules We will write finite-state rules to describe alternations between abstract morphophonemic words and well- formed surface words. These rules compile into finite-state transducers (relations) that can be used to compute these mappings. Typically the upper language of a rule FST is the Universal Language, the set of all possible strings. Typically the lower language is like the upper language, except for the alternations controlled by the rule. Strings that don’t match the rule are mapped unchanged.
14
May 2007CLINT/LIN xfst 14 Rule Application = Composition Composition is an operation that merges two transducers “vertically”. Let X be a transducer that contains the single ordered pair. Let Y be a transducer that contains the single ordered pair. The composition of X over Y, notated X.o. Y, is the relation that contains the ordered pair. Composition merges any two transducers. If the shared middle level has a non-empty intersection, then the result will be a non- empty relation. Rule application is done via composition. Composition is a difficult topic that we will return to many times. Read pp 28-34 and do exercise 1.10.3 on page 37.
15
May 2007CLINT/LIN xfst 15 Review: Basic Concepts Language = a set of strings/words Regular Language = a set of string/words that can be generated using concatenation, union, iteration and similar operations Simple Finite-State Automaton (“Acceptor”)= a finite-state machine that accepts/recognizes a regular language Regular Relation = a mapping between two regular languages Finite-State Transducer (FST) = a two-level finite-state automaton that maps between two regular languages (performs look-up and generation)
16
May 2007CLINT/LIN xfst 16 Regular Expressions A compact formula for describing a regular language or regular relation. The regular-expression language is a metalanguage. Think of regular expressions as the “programming language” of xfst Each implementation of regular expressions is slightly different (Python, Perl, emacs, …) We will have to learn the Xerox flavor of regular expressions as used in xfst.
17
May 2007CLINT/LIN xfst 17 Regular Expressions Denoting a Language Regular Expression Regular Language Finite-State Automaton (“acceptor”) describescompiles into accepts/recognizes
18
May 2007CLINT/LIN xfst 18 Regular Expression Denoting a Relation Regular Expression Regular Relation Finite-State Transducer describescompiles into maps
19
May 2007CLINT/LIN xfst 19 Introduction to xfst xfst is an interface giving access to the finite-state operations (algorithms such as union, concatenation, iteration, intersection). xfst includes a powerful and efficient regular-expression compiler. xfst includes the lookup operation (‘apply up’) and the generation operation (‘apply down’) so that we can test our networks. For small examples, we can also print out all the words in the language using the command ‘print words’. We have to learn the Xerox regular-expression metalanguage.
20
May 2007CLINT/LIN xfst 20 Xerox Regular-Expression Operators I aa simple symbol c a ta CONCATENATION of three symbols [ c a t ] grouping brackets ?denotes any single symbol %+Noun or “+Noun” %+Verb or “+Verb” %+Adj or “+Adj” single symbols with multicharacter print names (aka “multicharacter symbols”) catBeware: this will be compiled by xfst as a single multicharacter symbol {cat}explosion brackets: equivalent to [ c a t ]
21
May 2007CLINT/LIN xfst 21 Xerox Regular Expression Operators II [] 0two ways to denote the empty (zero-length) string Now, where A and B are arbitrarily complex regular expressions: [A]bracketing; equivalent to A A | Bunion (A)optional; equivalent to [ A | 0 ] A & Bintersection A Bconcatenation (N.B. the space between A and B) A - Bsubtraction
22
May 2007CLINT/LIN xfst 22 Xerox Regular-Expression Operators III A*Kleene star; zero or more iterations A+Kleene plus; one or more iterations ?*The Universal Language ~AThe complement of language A; equivalent to [ ?* - A] ~[?*]The empty language (i.e. it contains no strings at all, not even the zero-length string) %+the literal plus-sign symbol %*the literal asterisk symbol and similarly for %?, %(, %), %~, etc.
23
May 2007CLINT/LIN xfst 23 Denoting Relations A.x. B the “cross-product”; relates every string in A to every string in B, and vice versa; e.g. [ g o.x. w e n t ]relates “go” and “went” a:bshorthand for [ a.x. b ] %+Pl:sshorthand for [ %+Pl.x. s ] %+Past:{ed} shorthand for [ %+Past.x. e d ] %+Prog:{ing}shorthand for [ %+Prog.x. i n g ]
24
May 2007CLINT/LIN xfst 24 Useful Abbreviations $Adenotes the language of all strings that contain A; equivalent to [ ?* A ?* ], e.g. $bdenotes the language of all strings that contain a ‘b’ anywhere A/Bdenotes the language of all strings in A, ignoring any strings from B, e.g. a*/bcontains “a”, “aa”, “aaa”, … “ba”, “ab”, “aba”,... \Aany single symbol, minus strings in A; i.e. [ ? - A ], e.g. \bdenotes any single symbol, except a ‘b’ Beware: NOT to be confused with ~Athe complement of A; i.e. [ ?* - A ]
25
May 2007CLINT/LIN xfst 25 Basic xfst interface commands UnixPrompt% xfst xfst> help xfst> help union net xfst> exit xfst> read regex [ d o g | c a t ] ; xfst> read regex < myfile.regex xfst> apply up dog xfst> apply down dog xfst> pop stack xfst> clear stack xfst> save stack myfile.fsm
26
May 2007CLINT/LIN xfst 26 xfst saves networks in a LIFO stack xfst> read regex [ d o g | c a t ] ; or xfst> read regex < myfile.regex causes the compiled network to be “pushed” onto the stack. When you type xfst> pop stack the top network is popped off the stack and discarded. When you type xfst> apply up dog the top network on the stack is applied in an upward direction (lookup) on the string “dog”, and the related string or strings are printed. When you type xfst> clear stack the entire stack is popped and left empty. When you type xfst> save stack myfile.fsm the contents of the stack are written in binary (compiled) form to the indicated file.
27
May 2007CLINT/LIN xfst 27 Setting Variables xfst> define Myvar pops the top network off of the stack and saves it as the value of Myvar, which can be used in subsequent regular expressions xfst> define Myvar2 [ d o g | c a t ] ; assigns a value to Myvar2 without modifying the stack. It is equivalent to the two commands xfst> read regex [ d o g | c a t ] ; xfst> define Myvar2 xfst> undefine Myvar undefines Myvar and recycles the memory
28
May 2007CLINT/LIN xfst 28 Using Variables in Regular Expressions xfst> define var1 [ b i r d | f r o g | d o g ] ; xfst> define var2 [ d o g | c a t ] ; You can now use var1 and var2 in subsequent regular expressions: xfst> define var3 var1 | var2 ; xfst> define var4 var1 var2 ; xfst> define var5 var1 & var2 ; xfst> define var6 var1 - var2 ;
29
May 2007CLINT/LIN xfst 29 Performing network operations on the stack xfst> read regex [ d o g | c a t ] ; xfst> read regex [ m o u s e | r a t ] ; xfst> read regex [ d e e r | s q u i r r e l ] ; xfst> union net ‘union net’ will pop its arguments off of the stack one at a time, perform the union operation, and push the result back onto the stack, leaving just one network on the stack. Enter the command ‘words’ to see the resulting language.
30
May 2007CLINT/LIN xfst 30 A little concatenation example xfst> define Root [ w a l k | t a l k | w o r k ] ; xfst> define Prefix [ 0 | r e ] ; xfst> define Suffix [ 0 | s | e d | i n g ] ; xfst> read regex Prefix Root Suffix ; xfst> words xfst> apply up walking Try to get the same result by starting with the same three definitions and then pushing them on the stack, invoking ‘concatenate net’ to perform the concatenation. Remember that concatenation is an ordered operation.
31
May 2007CLINT/LIN xfst 31 The Simplest Replace Rules Replace rules are a very powerful extension to the regular-expression metalanguage. Here is the simplest kind needed for the kaNpat and Portuguese-pronunciation exercises. The arrow -> is typed as a hyphen followed by a right angle-bracket. The || operator consists of two vertical bars typed together. The _ is the underscore. Rule Schema: (a) upper -> lower (b) upper -> lower || leftcontext _ rightcontext e.g. xfst> read regex s -> z || [ a | e | i | o | u ] _ [ a | e | i | o | u ] ; xfst> apply down casa What is this rule intended to do? What comes out?
32
May 2007CLINT/LIN xfst 32 kaNpat example Assume a language that joins morpheme kaN (with an underspecified nasal N) and morpheme pat into the underlying or morphophonemic form kaNpat. This language then has “alternation” rules that dictate that N, when followed by p, gets realized as m. And p, when preceded by m, gets realized as m. The derivation looks like Underlying input:kaNpat Rule1:N -> m || _ p Output of Rule1:kampat Rule2:p -> m || m _ Output of Rule2:kammat The composition operation (.o.) reduces the derivational cascade of transducer networks into a single transducer network.
33
May 2007CLINT/LIN xfst 33 Your first cascade of rules xfst> define Rule1 N -> m || _ p ; xfst> define Rule2 p -> m || m _ ; xfst> read regex Rule1.o. Rule2 ; xfst> apply down kaNpat What is the output? Now restart (with ‘clear stack’), define the two Rules as shown above, push them on the stack in the right order, and perform the composition on the stack using ‘compose net’. What is your result? (Remember that the networks must be pushed in the right order.)
34
May 2007CLINT/LIN xfst 34 Rule Abbreviations Multiple left-hand sides, separated by commas: b -> p, d -> t, g -> k || _.#. Multiple right-hand sides, separated by commas: e -> i || _ (s).#.,.#. p _ r Use.#. to refer to either the very beginning or the very end of a word.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.