October 2006Advanced Topics in NLP1 Finite State Machinery Xerox Tools.

Slides:



Advertisements
Similar presentations
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Advertisements

Finite State Automata. A very simple and intuitive formalism suitable for certain tasks A bit like a flow chart, but can be used for both recognition.
Beesley 2000 Introduction to the xfst Interface Review Introduction to Morphology Relations and Transducers Introduction to xfst.
COMP-421 Compiler Design Presented by Dr Ioanna Dionysiou.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Beesley 2001 Finite-State Technology and Linguistic Applications March 2001 Xerox Research Centre Europe Grenoble Laboratory 6, chemin de Maupertuis.
Intro to NLP - J. Eisner1 Finite-State Methods.
Lecture 12 – ADTs and Stacks.  Modularity  Divide the program into smaller parts  Advantages  Keeps the complexity managable  Isolates errors (parts.
Writing Lexical Transducers Using xfst
ISBN Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl –(on reserve.
Topic 15 Implementing and Using Stacks
Languages, grammars, and regular expressions
Regular expressions Mastering Regular Expressions by Jeffrey E. F. Friedl Linux editors and commands (e.g.
COS 320 Compilers David Walker. Outline Last Week –Introduction to ML Today: –Lexical Analysis –Reading: Chapter 2 of Appel.
Normal forms for Context-Free Grammars
Topic 15 Implementing and Using Stacks
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
ADT Stacks and Queues. Stack: Logical Level “An ordered group of homogeneous items or elements in which items are added and removed from only one end.”
CMSC 330 Exercise: Write a Ruby function that takes an array of names in “Last, First Middle” format and returns the same list in “First Middle Last” format.
May 2007CLINT/LIN xfst 1 Introduction to the xfst Interface Review Introduction to Morphology Relations and Transducers Introduction to xfst.
Language Recognizer Connecting Type 3 languages and Finite State Automata Copyright © – Curt Hill.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Computational Linguistics Yoad Winter *General overview *Examples: Transducers; Stanford Parser; Google Translate; Word-Sense Disambiguation * Finite State.
Regular Expressions. Notation to specify a language –Declarative –Sort of like a programming language. Fundamental in some languages like perl and applications.
Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Machine-independent code improvement Target code generation Machine-specific.
October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.
CMSC 330: Organization of Programming Languages Theory of Regular Expressions.
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
1 INFO 2950 Prof. Carla Gomes Module Modeling Computation: Language Recognition Rosen, Chapter 12.4.
Grammars CPSC 5135.
Finite State Transducers for Morphological Parsing
Lexical Analysis I Specifying Tokens Lecture 2 CS 4318/5531 Spring 2010 Apan Qasem Texas State University *some slides adopted from Cooper and Torczon.
Finite State Machinery - I Fundamentals Recognisers and Transducers.
1 Assignment #1 is due on Friday. Any questions?.
Human Language Technology Finite State Transducers.
Lecture 5 Regular Expressions CSCI – 1900 Mathematics for Computer Science Fall 2014 Bill Pine.
Regular Expressions Theory and Practice Jeff Schoolcraft MDCFUG 12/13/2005.
Regular Expressions CIS 361. Need finite descriptions of infinite sets of strings. Discover and specify “regularity”. The set of languages over a finite.
Review: Compiler Phases: Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Symbol.
Strings and Languages CS 130: Theory of Computation HMU textbook, Chapter 1 (Sec 1.5)
Syntax The Structure of a Language. Lexical Structure The structure of the tokens of a programming language The scanner takes a sequence of characters.
November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst.
October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.
FST Morphology Miriam Butt October 2003 Based on Beesley and Karttunen 2003.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
November 2003Computational Morphology III1 CSA405: Advanced Topics in NLP Xerox Notation.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
GRAMMARS & PARSING. Parser Construction Most of the work involved in constructing a parser is carried out automatically by a program, referred to as a.
Strings and Languages Denning, Section 2.7. Alphabet An alphabet V is a finite nonempty set of symbols. Each symbol is a non- divisible or atomic object.
November 2003Computational Morphology VI1 CSA4050 Advanced Topics in NLP Non-Concatenative Morphology – Reduplication – Interdigitation.
1 Topic 2: Lexing and Flexing COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer.
Lecture 15: Theory of Automata:2014 Finite Automata with Output.
Stacks Access is allowed only at one point of the structure, normally termed the top of the stack access to the most recently added item only Operations.
CS510 Compiler Lecture 2.
Chapter 3 Lexical Analysis.
Regular Expressions.
CO4301 – Advanced Games Development Week 2 Introduction to Parsing
CSE 105 theory of computation
Formal Language Theory
Prepare to partition your brain to learn a whole new formalism.
Chapter Seven: Regular Expressions
Review: Compiler Phases:
CSC NLP - Regex, Finite State Automata
Writing Lexical Transducers Using xfst
languages & relations regular expressions finite-state networks
Topic 15 Implementing and Using Stacks
Lecture 5 Scanning.
Presentation transcript:

October 2006Advanced Topics in NLP1 Finite State Machinery Xerox Tools

October 2006Advanced Topics in NLP2 Finite State Methods Many Domains of Application –Tokenization –Sentence breaking –Spelling correction –Morphology (analysis/generation) –Phonological disambiguation (Speech Recognition) –Morphological disambiguation (“Tagging”) –Pattern matching (“Named Entity Recognition”) –Shallow Parsing

October 2006Advanced Topics in NLP3 The Xerox Approach Lauri Karttunen, Martin Kay, Ronald Kaplan, Kimmo Koskienniemi. Meta-languages for describing regular languages and regular relations. Compiler for mapping meta-language "programs" into efficient FS machinery Several tools and applications

October 2006Advanced Topics in NLP4 xerox tools xfst Xerox Finite-State ToolXerox Finite-State Tool lexc Finite-State Lexicon CompilerFinite-State Lexicon Compiler twolc Two-Level Rule CompilerTwo-Level Rule Compiler

October 2006Advanced Topics in NLP5 xerox tools All of these applications are built around a central library, now written in C, called c-fsm. The library defines the data structures, provides the input/output routines, and implements the fundamental operations on finite-state networks. All based on long-term Xerox research, originated by Ronald M. Kaplan and Martin Kay at PARC in the early 1980s.Ronald M. Kaplan Martin Kay

October 2006Advanced Topics in NLP6 Textbook CLSI Publications Studies in Computational Linguistics series See also website

October 2006Advanced Topics in NLP7 xfst xfst is a general tool for creating and manipulating finite state networks, both simple automota and transducers. xfst and other Xerox tools employ a special "xfst notation" (more powerful than that used in Unix, Perl, C# etc.)

October 2006Advanced Topics in NLP8 Simple Regular Expressions Atomic Expressions Complex Expressions

October 2006Advanced Topics in NLP9 Atomic Expressions The simplest kind of RE is a symbol. Typically, a symbol is the sort of item that can appear on the arc of a network. For example, the symbol a is an RE that designates the language containing the string "a" and nothing else Multicharacter symbols such as Plur are also symbols, but they happen to have multicharacter print names.

October 2006Advanced Topics in NLP10 Special Atomic Expressions The epsilon (  symbol 0 denotes the empty string language {""}. The ANY symbol ? denotes the language of all single symbol strings. The empty string is not included in ?.

October 2006Advanced Topics in NLP11 Complex REs: Union If A and B are arbitrary REs, [A | B] is the union of A and B which denotes the union of the languages denoted by A and B respectively. If A is an arbitrarily complex RE, [A] is equivalent to A. Checkpoint: Write down the strings in the language denoted by [ a | b | ab].

October 2006Advanced Topics in NLP12 Complex REs: Intersection If A and B are arbitrary REs, [A & B] is the intersection of A and B which denotes the intersection of the languages denoted by A and B respectively. Checkpoint: Write down the strings in the language denoted by [a | b | c | d | e] & [d | e | f | g]

October 2006Advanced Topics in NLP13 Complex REs: Concatenation If A and B are arbitrary REs [A B] is the concatenation of A and B Checkpoint: note the difference between – [d o g] – dog – [d og]

October 2006Advanced Topics in NLP14 Concatenation over Reg. Expression and Language Regular Expression E1: =[a|b] E2: = [c|d] E1 E2 = [a|b] [c|d] Language L1 = {"a", "b"} L2 = {"c", "d"} L1 L2 = {"ac", "ad", "bc", "bd"}

October 2006Advanced Topics in NLP15 Concatenation over FS Automata a b c d a b c d + 

October 2006Advanced Topics in NLP16 Complex REs: Closures A+ denotes the concatenation of A with itself zero or more times. A* (Kleene Star) denotes [A+ | 0].

October 2006Advanced Topics in NLP17 Other Operations Minus: [A - B] denotes the set difference of the languages denoted by A and B. ([A-B] = [A & ˜B]) Checkpoint: What is the language denoted by [dog | cat | elephant] - [elephant | horse | cow]

October 2006Advanced Topics in NLP18 Some Other Conventions A* Closure (Kleene Star) (A) Optional Element ? Any symbol \b Any symbol other than b ~A Complement (= [?* - A ]) 0 Empty string language $A [ ?* A ?* ]

October 2006Advanced Topics in NLP19 Simple Commands In addition to the language there are also commands: –define: give a name to an RE –print: print information –read: read information –various stack operations –file interaction –various command line options

October 2006Advanced Topics in NLP20 define command define name regexp xfst[0]: define foo [d o g] | [c a t]; xfst[0]: define R1 [a | b | c | d]; xfst[0]: define R2 [d | e | f | g]; xfst[0]: define R3 [f | g | h | i | j]; x0

October 2006Advanced Topics in NLP21 print command print words name - see the words in the language called name print net name - see detailed information about the network name. xfst[0]: print words foo; xfst[0]: print net baz; xfst[0]: define baz R1 & R2;

October 2006Advanced Topics in NLP22 Exercise Compute the words in – R1 minus R2. – R2 intersect R1 Define a network that contains the words "eeny", "meeny", "miny", "mo". Determine how many states there are in each result.

October 2006Advanced Topics in NLP23 Basic Stack Operations read regex : push network onto stack: print stack : list items on stack print net : detailed info on top stack item pop stack : remove top item from stack define name : set name to value of top stack item

October 2006Advanced Topics in NLP24 Stack Operations Normally the stack is loaded with suitable arguments, Command is issued requiring N arguments. These are popped from the stack, the operation is performed, and the result written back onto the stack. For correct results, items should be pushed onto the stack in reverse order.

October 2006Advanced Topics in NLP25 Stack Demo 1 xfst[0]: clear stack; xfst[0]: read regex [d |c |e | b | w] xfst[1]: read regex [b | s | h | w] xfst[2]: read regex [s | d | c | f | w] xfst[3]: print stack xfst[3]: intersect net xfst[1]: print stack xfst[1]: print net xfst[1]: print words

October 2006Advanced Topics in NLP26 Stack Exercise 2 xfst[0]: clear stack; xfst[0]: read regex [e d | i n g | s |[]] xfst[1]: read regex [t a l k | k i c k] xfst[2]: print stack xfst[2]: print net xfst[2]: print words xfst[2]: concatenate net xfst[1]: print words

October 2006Advanced Topics in NLP27 lexc Source File lexc Compiled Network ? lexc is a high level programming language and compiler that is well suited for defining NL lexicons.  The output is a compiled form of FS network in a format identical to other Xerox tools ( xfst, twolc ).

October 2006Advanced Topics in NLP28 lexc source file !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! ex0-lex.txt LEXICON Root dine #; dines #; dined #; line #; lines #; lined #; END !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

October 2006Advanced Topics in NLP29 lexc ! ex1-lex.txt LEXICON Root Noun; Verb; LEXICON Noun line NounSuffix; LEXICON Verb dine VerbSuffix; line VerbSuffix; LEXICON NounSuffix s #; #; LEXICON VerbSuffix s #; d #; #;

October 2006Advanced Topics in NLP30 Running lexc lexc> compile-source ex1-lex.txt Opening 'ex1-lex.txt'... Root...2, Noun...1, Verb...2, NounSuffix...2, VerbSuffix...3 Building lexicon...Minimizing...Done! SOURCE: 6 states, 7 arcs, 6 words lexc>

October 2006Advanced Topics in NLP31 lexc The resulting lexicon contains the same six words The form lines actually gets constructed twice, once as a verb, once as a noun. After minimization, only one of them remains. The compiler first processes each sublexicon separately, keeping track of continuation pointers, and then joins the structures to a single network which is determinized and minimized.

October 2006Advanced Topics in NLP32 Resulting FSA s i l d en d