October 2006Advanced Topics in NLP1 Finite State Machinery Xerox Tools
October 2006Advanced Topics in NLP2 Finite State Methods Many Domains of Application –Tokenization –Sentence breaking –Spelling correction –Morphology (analysis/generation) –Phonological disambiguation (Speech Recognition) –Morphological disambiguation (“Tagging”) –Pattern matching (“Named Entity Recognition”) –Shallow Parsing
October 2006Advanced Topics in NLP3 The Xerox Approach Lauri Karttunen, Martin Kay, Ronald Kaplan, Kimmo Koskienniemi. Meta-languages for describing regular languages and regular relations. Compiler for mapping meta-language "programs" into efficient FS machinery Several tools and applications
October 2006Advanced Topics in NLP4 xerox tools xfst Xerox Finite-State ToolXerox Finite-State Tool lexc Finite-State Lexicon CompilerFinite-State Lexicon Compiler twolc Two-Level Rule CompilerTwo-Level Rule Compiler
October 2006Advanced Topics in NLP5 xerox tools All of these applications are built around a central library, now written in C, called c-fsm. The library defines the data structures, provides the input/output routines, and implements the fundamental operations on finite-state networks. All based on long-term Xerox research, originated by Ronald M. Kaplan and Martin Kay at PARC in the early 1980s.Ronald M. Kaplan Martin Kay
October 2006Advanced Topics in NLP6 Textbook CLSI Publications Studies in Computational Linguistics series See also website
October 2006Advanced Topics in NLP7 xfst xfst is a general tool for creating and manipulating finite state networks, both simple automota and transducers. xfst and other Xerox tools employ a special "xfst notation" (more powerful than that used in Unix, Perl, C# etc.)
October 2006Advanced Topics in NLP8 Simple Regular Expressions Atomic Expressions Complex Expressions
October 2006Advanced Topics in NLP9 Atomic Expressions The simplest kind of RE is a symbol. Typically, a symbol is the sort of item that can appear on the arc of a network. For example, the symbol a is an RE that designates the language containing the string "a" and nothing else Multicharacter symbols such as Plur are also symbols, but they happen to have multicharacter print names.
October 2006Advanced Topics in NLP10 Special Atomic Expressions The epsilon ( symbol 0 denotes the empty string language {""}. The ANY symbol ? denotes the language of all single symbol strings. The empty string is not included in ?.
October 2006Advanced Topics in NLP11 Complex REs: Union If A and B are arbitrary REs, [A | B] is the union of A and B which denotes the union of the languages denoted by A and B respectively. If A is an arbitrarily complex RE, [A] is equivalent to A. Checkpoint: Write down the strings in the language denoted by [ a | b | ab].
October 2006Advanced Topics in NLP12 Complex REs: Intersection If A and B are arbitrary REs, [A & B] is the intersection of A and B which denotes the intersection of the languages denoted by A and B respectively. Checkpoint: Write down the strings in the language denoted by [a | b | c | d | e] & [d | e | f | g]
October 2006Advanced Topics in NLP13 Complex REs: Concatenation If A and B are arbitrary REs [A B] is the concatenation of A and B Checkpoint: note the difference between – [d o g] – dog – [d og]
October 2006Advanced Topics in NLP14 Concatenation over Reg. Expression and Language Regular Expression E1: =[a|b] E2: = [c|d] E1 E2 = [a|b] [c|d] Language L1 = {"a", "b"} L2 = {"c", "d"} L1 L2 = {"ac", "ad", "bc", "bd"}
October 2006Advanced Topics in NLP15 Concatenation over FS Automata a b c d a b c d +
October 2006Advanced Topics in NLP16 Complex REs: Closures A+ denotes the concatenation of A with itself zero or more times. A* (Kleene Star) denotes [A+ | 0].
October 2006Advanced Topics in NLP17 Other Operations Minus: [A - B] denotes the set difference of the languages denoted by A and B. ([A-B] = [A & ˜B]) Checkpoint: What is the language denoted by [dog | cat | elephant] - [elephant | horse | cow]
October 2006Advanced Topics in NLP18 Some Other Conventions A* Closure (Kleene Star) (A) Optional Element ? Any symbol \b Any symbol other than b ~A Complement (= [?* - A ]) 0 Empty string language $A [ ?* A ?* ]
October 2006Advanced Topics in NLP19 Simple Commands In addition to the language there are also commands: –define: give a name to an RE –print: print information –read: read information –various stack operations –file interaction –various command line options
October 2006Advanced Topics in NLP20 define command define name regexp xfst[0]: define foo [d o g] | [c a t]; xfst[0]: define R1 [a | b | c | d]; xfst[0]: define R2 [d | e | f | g]; xfst[0]: define R3 [f | g | h | i | j]; x0
October 2006Advanced Topics in NLP21 print command print words name - see the words in the language called name print net name - see detailed information about the network name. xfst[0]: print words foo; xfst[0]: print net baz; xfst[0]: define baz R1 & R2;
October 2006Advanced Topics in NLP22 Exercise Compute the words in – R1 minus R2. – R2 intersect R1 Define a network that contains the words "eeny", "meeny", "miny", "mo". Determine how many states there are in each result.
October 2006Advanced Topics in NLP23 Basic Stack Operations read regex : push network onto stack: print stack : list items on stack print net : detailed info on top stack item pop stack : remove top item from stack define name : set name to value of top stack item
October 2006Advanced Topics in NLP24 Stack Operations Normally the stack is loaded with suitable arguments, Command is issued requiring N arguments. These are popped from the stack, the operation is performed, and the result written back onto the stack. For correct results, items should be pushed onto the stack in reverse order.
October 2006Advanced Topics in NLP25 Stack Demo 1 xfst[0]: clear stack; xfst[0]: read regex [d |c |e | b | w] xfst[1]: read regex [b | s | h | w] xfst[2]: read regex [s | d | c | f | w] xfst[3]: print stack xfst[3]: intersect net xfst[1]: print stack xfst[1]: print net xfst[1]: print words
October 2006Advanced Topics in NLP26 Stack Exercise 2 xfst[0]: clear stack; xfst[0]: read regex [e d | i n g | s |[]] xfst[1]: read regex [t a l k | k i c k] xfst[2]: print stack xfst[2]: print net xfst[2]: print words xfst[2]: concatenate net xfst[1]: print words
October 2006Advanced Topics in NLP27 lexc Source File lexc Compiled Network ? lexc is a high level programming language and compiler that is well suited for defining NL lexicons. The output is a compiled form of FS network in a format identical to other Xerox tools ( xfst, twolc ).
October 2006Advanced Topics in NLP28 lexc source file !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! ex0-lex.txt LEXICON Root dine #; dines #; dined #; line #; lines #; lined #; END !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
October 2006Advanced Topics in NLP29 lexc ! ex1-lex.txt LEXICON Root Noun; Verb; LEXICON Noun line NounSuffix; LEXICON Verb dine VerbSuffix; line VerbSuffix; LEXICON NounSuffix s #; #; LEXICON VerbSuffix s #; d #; #;
October 2006Advanced Topics in NLP30 Running lexc lexc> compile-source ex1-lex.txt Opening 'ex1-lex.txt'... Root...2, Noun...1, Verb...2, NounSuffix...2, VerbSuffix...3 Building lexicon...Minimizing...Done! SOURCE: 6 states, 7 arcs, 6 words lexc>
October 2006Advanced Topics in NLP31 lexc The resulting lexicon contains the same six words The form lines actually gets constructed twice, once as a verb, once as a noun. After minimization, only one of them remains. The compiler first processes each sublexicon separately, keeping track of continuation pointers, and then joins the structures to a single network which is determinized and minimized.
October 2006Advanced Topics in NLP32 Resulting FSA s i l d en d