Finite State Morphology

Slides:



Advertisements
Similar presentations
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Advertisements

LING/C SC/PSYC 438/538 Lecture 11 Sandiway Fong. Administrivia Homework 3 graded.
Finite-State Transducers: Applications in Natural Language Processing Heli Uibo Institute of Computer Science University of Tartu
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 2 Mälardalen University 2005.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
October 2006Advanced Topics in NLP1 Finite State Machinery Xerox Tools.
Regular Languages Sequential Machine Theory Prof. K. J. Hintz Department of Electrical and Computer Engineering Lecture 3 Comments, additions and modifications.
Lecture 3: Closure Properties & Regular Expressions Jim Hook Tim Sheard Portland State University.
CSC 3130: Automata theory and formal languages Andrej Bogdanov The Chinese University of Hong Kong Regular.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Finite state automaton (FSA)
1 Finite state automaton (FSA) LING 570 Fei Xia Week 2: 10/07/09 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Normal forms for Context-Free Grammars
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Introduction to English Morphology Finite State Transducers
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.
Introduction to CS Theory Lecture 3 – Regular Languages Piotr Faliszewski
1 Regular Expressions. 2 Regular expressions describe regular languages Example: describes the language.
Lecture # 3 Chapter #3: Lexical Analysis. Role of Lexical Analyzer It is the first phase of compiler Its main task is to read the input characters and.
Finite State Transducers for Morphological Parsing
Human Language Technology Finite State Transducers.
1 CD5560 FABER Formal Languages, Automata and Models of Computation Lecture 3 Mälardalen University 2010.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
An Ambiguity-Controlled Morphological Analyzer for Modern Standard Arabic By: Mohammed A. Attia Abbas Al-Julaih Natural Language Processing ICS.
November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst.
Closure Properties Lemma: Let A 1 and A 2 be two CF languages, then the union A 1  A 2 is context free as well. Proof: Assume that the two grammars are.
Natural Language Processing Chapter 2 : Morphology.
October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
UNIT - I Formal Language and Regular Expressions: Languages Definition regular expressions Regular sets identity rules. Finite Automata: DFA NFA NFA with.
November 2003Computational Morphology III1 CSA405: Advanced Topics in NLP Xerox Notation.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
November 2003Computational Morphology VI1 CSA4050 Advanced Topics in NLP Non-Concatenative Morphology – Reduplication – Interdigitation.
CSCI 4325 / 6339 Theory of Computation Zhixiang Chen Department of Computer Science University of Texas-Pan American.
LECTURE 5 Scanning. SYNTAX ANALYSIS We know from our previous lectures that the process of verifying the syntax of the program is performed in two stages:
CS412/413 Introduction to Compilers Radu Rugina Lecture 3: Finite Automata 25 Jan 02.
Lecture 2 Compiler Design Lexical Analysis By lecturer Noor Dhia
Finite State Morphology Alexander Fraser & Luisa Berlanda CIS, Ludwig-Maximilians-Universität München Computational Morphology.
Finite State Morphology Alexander Fraser & Luisa Berlanda CIS, Ludwig-Maximilians-Universität München Computational Morphology.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
Finite State Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
CIS Automata and Formal Languages – Pei Wang
BİL711 Natural Language Processing
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Pushdown Automata.
Languages.
Speech and Language Processing
Composition is Our Friend
PDAs Accept Context-Free Languages
Two issues in lexical analysis
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
4. Properties of Regular Languages
CSC NLP - Regex, Finite State Automata
Chapter Nine: Advanced Topics in Regular Languages
4b Lexical analysis Finite Automata
Building Finite-State Machines
4b Lexical analysis Finite Automata
CPSC 503 Computational Linguistics
Instructor: Aaron Roth
Lecture 5 Scanning.
Morphological Parsing
What is it? The term "Automata" is derived from the Greek word "αὐτόματα" which means "self-acting". An automaton (Automata in plural) is an abstract self-propelled.
Statistical NLP Winter 2009
Presentation transcript:

Finite State Morphology Alexander Fraser fraser@cis.uni-muenchen.de CIS, Ludwig-Maximilians-Universität München Computational Morphology and Electronic Dictionaries SoSe 2017 2017-05-15

Outline Today we will cover finite state morphology more formally We'll review basic concepts from the first lecture and from the exercises And define operations in finite state more formally We will then show how to convert regular expressions to finite state automata

Credits Credits: Slides mostly adapted from: Finite State Morphology Helmut Schmid U. Tübingen - Summer Semester 2015 Thanks also to Kemal Oflazer and Lauri Kartunnen

Review: Computational Morphology examines word formations processes provides analyses of word forms such as Tarifverhandlungen: Tarif<NN>verhandeln<V>ung<SUFF><+NN><Fem><Nom><Pl> splits word forms into roots and affixes provides information on part-of-speech such as NN, V canonical forms such as „verhandeln“ morphosyntactic properties such as Fem, Nom, Pl

Terminology word form word as it appears in a running text: weitergehst lemma citation as listed in a dictionary: weitergehen stem part of a word to which derivational of inflectional affixes are attached: weitergeh root stem which cannot be further analysed: geh morpheme smallest morphological units (stems, affixes): weiter, geh, en

Word Formation Processes Inflection Derivation Compounding

Inflection modifies a word in order to express different grammatical categories such as tense, mood, voice, aspect, person, number, gender, case verbal inflection: conjugation walks, walked, walking nominal inflection: declension computers usually realised by prefixation suffixation circumfixation ge+hab+t infixation auf+zu+machen (not a perfect example) reduplication: orang+orang (plural of „man“ in Indonesian)

Derivation creates new words Examples: un+translat+abil+ity piti+less-ness changes the part-of-speech and/or meaning of the word adds prefixes, suffixes, circumfixes conversion: changes the part-of-speech without modifying the word book (N) → book (V) leid(en) (V) → Leid (N) templatic morphology in Arabic ktb + CVCCVC + (a,a) -> kattab (write)

Compounding creates new words by combining several stems example: Donau-dampf-schiff-fahrts-gesellschaft very productive in German affixoid compounding process that turns into a derivation process Gas+werk, Stück+werk, Laub+werk schul+frei, schulter+frei, schulden+frei → no absolute boundary between compounding and derivation

Classification of Languages isolating: Chinese, Vietnamese little or no derivation and inflection analytic: Chinese, English little or no inflection synthetic agglutinative: Finnish, Turkish, Hungarian, Swahili morphemes are concatenated with little modification each affix usually encodes a single feature fusional (inflecting): Sanskrit, Latin, Russian, German inflectional affixes often encode a feature bundle: les+e (1 sg pres)

Productivity productive process new word forms can easily be created use+less, hope+less, point+less, beard+less unproductive process: morphological process which is no longer active streng+th, warm+th, dep+th

Morphotactics Which morphemes can be arranged in which order? translat+abil+ity *translat+ity+abil translat+able *translat+able+ity (Allomorphs able-abil)

Orthographic/Phonological Rules How is a morpheme realised in a certain context? city+s → cities bake+ing → baking (e-elision) crash+s → crashes (e-epenthesis) beg+ing → begging (gemination) ad+simil+ate → assimilate (assimilation) ip+lEr → ipler kız+lEr → kızlar (vowel harmony)

Morphological Ambiguity leaves hanged hung leaf+N+pl leave+N+pl leave+V+3+sg hang+V+past

Ingredients of a Morph. Analyser List of roots with part-of-speech List of derivational affixes morphotactic rules orthographic (phonological) rules

Computational Morphology analyses and/or generates word forms analysis Abteilungen → Abteilung<NN><Fem><Nom><Pl> Abteilung<NN><Fem><Acc><Pl> … ab<VPART>teilen<V>ung<NNSuff><Fem><Acc><Pl> … Abtei<NN> Lunge<NN><Fem><Nom><Pl> … Abt<NN> Ei<NN> Lunge<NN><Fem><Nom><Pl> … Abt<NN> eilen<V> ung<NNSuff><Fem><Nom><Pl> … generation sichern<+V><1><Sg><Pres><Ind> → sichere, sichre

Implementation using a mapping table works reasonably well for languages such as English, Chinese algorithmic more suitable for languages with complex morphology such as Turkish or Czech finite state transducers simple, well understood, efficient, bidirectional (analysis & generation)

Short History 1968 Chomsky & Halle propose ordered context-sensitive rewrite rules x → y / w _ z (replace x by y in the context w … z) 1972 C. Douglas Johnson discovers that ordered rewrite rules can be implemented with a cascade of FSTs if the rules are never applied to their own output 1961 Schützenberger proved that 2 sequential transducers (where the output of the first forms the input of the second) can be replaced by a single transducer. 1980 Kaplan & Kay rediscover the findings of Johnson and Schützenberger 1983 Kimmo Koskenniemi invents 2-level-morphology 1987 Karttunen & Koskenniemi implement the first FST compiler based on Kaplan’s implementation of the finite-state calculus

Finite State Automaton directed graph with labelled transitions, a start state and a set of final states w a l k i n g recognises walk, walks, walked, walking, talk, talks, talked, talking s e d t

Finite State Automaton FSAs are isomorphic to regular expressions and regular grammars. All of them define a regular language. regular expression: (w|t)alk(s|ed|ing)? regular grammar: S → w A B → s B → S → t A B → e d A → a l k B B → i n g both equivalent to the automaton on the previous slide

Finite State Automaton FSAs are isomorphic to regular expressions and regular grammars. All of them define a regular language. regular expression: (w|t)alk(s|ed|ing)? regular grammar: S → w A B → s S → t A B → e d A → a l k B B → i n g B → Both are equivalent to the automaton on the previous slide context-free regular type 0 context- sensitive

Operations on FSAs Concatenation A B Optionality A? = (|A) Kleene‘s star A* = (|A|AA|AAA|…) Disjunction A | B Conjunction A & B Complement !A Subtraction A – B = A & !B Reversal

From Regular Expressions to FSAs single symbol a Create a new start state and a new end state Add a transition from the start to the end state labelled „a“ a 1 2

From Regular Expressions to FSAs Concatenation A B add epsilon transition from final state of A to start state of B make final state of B the new final state 1 2 ε 1 2 3 4 3 4

From Regular Expressions to FSAs Optionality A? add an epsilon transition from start to end state ε 1 2 1 2

From Regular Expressions to FSAs Kleene‘ star A* add an epsilon transition from end to start state make start state the new end state ε 1 2 1 2

From Regular Expressions to FSAs Disjunction A B new start state with epsilon transitions to the old start states new final state with epsilon transitions from the old final states 1 2 ε 1 2 ε 5 6 ε ε 3 4 3 4

From Regular Expressions to FSAs Reversal reverse all transitions swap start and end state 1 2 1 2

From Regular Expressions to FSAs Conjunction A & B I'm skipping the details of conjunction (see the Appendix for the algorithm) Basically, we can automatically create a new FSA that essentially runs both acceptors in parallel Our new FSA only accepts if both FSAs are in the accept state Clearly the FSA A&B then only accepts strings that are in the regular languages accepted by both FSAs (FSA A and FSA B)

Properties of FSAs epsilon-free no transition is labelled with the empty string epsilon deterministic epsilon-free and no two transitions originating in the same state have the same label minimal no other automaton has a smaller number of states

Properties of FSAs II We can algorithmically construct a new FSA from the old FSA such that it is: epsilon-free deterministic minimal See the Appendix for the algorithms

Conclusion: Finite State Acceptors Any regular expression can be mapped to a finite state acceptor However, "regexes" in Python are misnamed! "Regexes" contain more powerful constructs than mathematical regular expressions For instance /(.+)\1/ However, these constructs are not used much See EN Wikipedia page on regular expressions, subsection "Regular expressions in programming languages" for details We will now move on to finite state transducers

Finite State Transducers FSTs are FSAs whose transitions are labelled with symbol pairs They map strings to (sets of) other strings maps walk, walks, walked, walking to walk and talk, talks, talked, talking to talk (in generation mode) can also map walk to walk, walks, walked, walking in analysis mode s: w:w a:a l:l k:k i: n: g: t:t e: d:

FSTs and Regular Expressions Single symbol mapping a:b Operations on FSTs Concatenation, Kleene's star, disjunction, conjunction, complement (from FSAs) composition A || B The output of transducer A is the input of transducer B. projection upper language replaces transition label a:b by b:b lower language replaces transition label a:b by a:a The result corresponds to an automaton a:b

Weighted Transducers A weighted FST assigns a numerical weight to each transition The total weight of a string-to-string mapping is the sum of the weights on the corresponding path from start to end state. Weighted FSTs allow disambiguation between different analyses by choosing the one with the smallest (or largest) weight

Working with FSTs FSTs can be specified by means of regular expressions (like FSAs). The translation is performed by a compiler. Using the same algorithms as for FSA FSTs can be made epsilon-free in the sense that no transition is labelled with ε:ε (a pair of empty string symbols) FSTs can be made deterministic in the sense that no two transitions originating in the same state have the same label pair FSTs can be minimised in the sense that no other FST which produces the same regular relation with the same input-output alignment is smaller. (There might be a smaller transducer producing the same relation with a different alignment.) FSTs can be used in both directions (generation and analysis)

FST Toolkits Some FST toolkits Xerox finite-state tools xfst and lexc well-suited for building morphological analysers foma (Mans Hulden) open-source alternative to xfst/lexc AT&T tools weighted transducers for tasks such as speech recognition little support for building morphological analysers openFST (Google, NYU) open-source alternative to the AT&T tools SFST open-source alternative to xfst/lexc but using a more general and flexible programming language

SFST programming language for developing finite-state transducers compiler which translates programs to transducers tools for applying transducers printing transducers comparing transducers

SFST Example Session > echo "Hello\ World\!" > test.fst storing a small test program > fst-compiler test.fst test.a calling the compiler test.fst: 2 > fst-mor test.a interactive transducer usage reading transducer... transducer is loaded finished. analyze> Hello World! input Hello World! recognised analyze> Hello World another input no result for Hello World not recognised analyze> q terminate program

SFST Programming Language Colon operator a:b empty string symbol <> Example: m:m o:i u:<> s:c e:e identity mapping a (an abbreviation for a:a) Example: m o:i u:<> s:c e {abc}:{AB} is expanded to a:A b:B c:<> Example: {mouse}:{mice}

Disjunction John | Mary | James accepts these three strings and maps them onto themselves mouse | {mouse}:{mice} analyses mouse and mice as mouse note that analysis here maps lower language (mice) to upper language (mouse), i.e., implements lemmatization Generation goes in the opposite direction

Multi-Character Symbols strings enclosed in <…> are treated as a single unit. {mouse<N><pl>}:{mice} analyzes mice as mouse<N><pl>

Multi-Character Symbols A more complex example: schreib {<V><pres>}:{} (\ {<1><sg>}:{e} |\ {<2><sg>}:{st} |\ {<3><sg>}:{t} |\ {<1><pl>}:{en} |\ {<2><pl>}:{t} |\ {<3><pl>}:{en}) The backslashes (\) indicate that the expression continues in the next line What is the analysis of schreibst and schreiben?

Conclusion: Finite State Morphology Talked about finite state morphology in a more formal way Showed how to convert regular expressions to finite state automata Talked about finite state transducers for computational morphology Morphological analysis and generation

Thank you for your attention

Appendix Details of Conjunction of FSAs Algorithms for Determinisation, Composition and Minimisation of FSAs

From Regular Expressions to FSAs Conjunction A & B The new state space Q is the Kartesian product of the old state spaces Q1 and Q2, i.e. Q = {(a,b)| a Q1 &bQ2} The new start state is the pair of the old start states. The new final state is the pair of the old final states A transition labelled a exists from new state (a,b) to new state (c,d) iff a transition labelled a exists from a to c in A and from b to d in B, i.e. (a,b) → (c,d) iff a → c and b → d

Determinisation of FSAs The new state set is the powerset of the old state set (set of all subsets). The new start state is the epsilon-closure of the old start state (i.e. the start state + all states reachable from it via epsilon transitions) There is a transition from state q to r labelled a iff there is a transition labelled a from some old state a in q to some old state b in r. The set of final states comprises all states q which contain an old final state a.

Composition of FSAs First, make the two FSAs deterministic. The new state set is then the Kartesian product of the two old state sets The new start state is the pair consisting of the two old start states There is a transition from state (a,b) to state (c,d) labelled x:z iff there is some transition labelled x:y from state a to state c and a transition labelled y:z from state b to state d The final state set comprises all state pairs (a,b) where both a and b are old final states.

Minimisation of FSAs Minimisation of A a simple (but inefficient) minimisation algorithm determinise reverse