Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.

Slides:



Advertisements
Similar presentations
Finite-state automata and Morphology
Advertisements

Jing-Shin Chang1 Morphology & Finite-State Transducers Morphology: the study of constituents of words Word = {a set of morphemes, combined in language-dependent.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Natural Language Processing Lecture 3—9/3/2013 Jim Martin.
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
1 Morphology September 2009 Lecture #4. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 Morphology September 4, 2012 Lecture #3. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011.
5/16/ ICS 482 Natural Language Processing Words & Transducers-Morphology - 1 Muhammed Al-Mulhem March 1, 2009.
Brief introduction to morphology
BİL711 Natural Language Processing1 Morphology Morphology is the study of the way words are built from smaller meaningful units called morphemes. We can.
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.
Computational language: week 9 Finish finite state machines FSA’s for modelling word structure Declarative language models knowledge representation and.
Fall 2005 Lecture Notes #4 EECS 595 / LING 541 / SI 661 Natural Language Processing.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
Morphology See Harald Trost “Morphology”. Chapter 2 of R Mitkov (ed.) The Oxford Handbook of Computational Linguistics, Oxford (2004): OUP D Jurafsky &
Morphological analysis
Finite-State Morphology CMSC 723: Computational Linguistics I ― Session #3 Jimmy Lin The iSchool University of Maryland Wednesday, September 16, 2009.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:
LING 438/538 Computational Linguistics Sandiway Fong Lecture 14: 10/12.
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
Introduction to English Morphology Finite State Transducers
Morphology 1 Introduction Morphology Morphological Analysis (MA)
Chapter 3. Morphology and Finite-State Transducers From: Chapter 3 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
Morphology (CS ) By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
Finite State Automata and Tries Sambhav Jain IIIT Hyderabad.
Fall 2004 Lecture Notes #2 EECS 595 / LING 541 / SI 661 Natural Language Processing.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 3 27 July 2005.
Finite State Transducers
Chapter 3: Morphology and Finite State Transducer Heshaam Faili University of Tehran.
Finite State Transducers for Morphological Parsing
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 3 27 July 2007.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
CSA3050: Natural Language Algorithms Finite State Devices.
Natural Language Processing Chapter 2 : Morphology.
October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Lecture 7 Summary Survey of English morphology
Speech and Language Processing
Morphology: Parsing Words
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Morphological Parsing
Presentation transcript:

Morphology and Finite-State Transducers

Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words like Fox Goose Fish etc What is their plural form?

Knowledge required Two kinds of knowledge is required to search for singulars and plurals of these forms Spelling rules Words that end with ‘y’ changes to ‘i+es’ in the plural form Morphological rules Fish has null plural and the plural of goose is formed by changing the vowel

Morphological parsing. The problem of recognizing that foxes breaks down into the two morphemes fox and -es is called morphological parsing. Parsing means taking an input and producing some sort of structure for it Similar problem of mapping foxes into fox in the information retrieval domain: stemming

Morphological parsing contd… Applied to many affixes other than plurals Takes verb form ending in –ing( going,talking…) Parse it into verbal stem + -ing morpheme Given the surface or input form going, we might want to produce the parsed form: VERB-go + GERUND-ing

In this chapter Morphology Finite-State Transducers Finite-state transducers? The main component of an important algorithm for morphological parsing

Morphological parsing contd… It is quite inefficient to list all forms of noun and verb in the dictionary because the productivity of the forms. Productive suffix Applies to every verb Example –ing Morphological parsing is necessary more than just IR, but also Machine translation Spelling checking

Survey of English Morphology Morphology is the study of the way words are built up from smaller meaning-bearing units, morphemes. Two broad classes of morphemes: The stems: the “main” morpheme of the word, supplying the main meaning The affixes: add “additional” meaning of various kinds.

Affixes Affixes are further divided into prefixes, suffixes, infixes, and circumfixes. Suffix: eat-s Prefix: un-buckle Circumfix: ge-sag-t (said) (in German) Infix: hingi (borrow) + affix ‘um’ = h um ingi (in Philippine language )

Another classification of morphology Prefixes and suffixes are often called concatenative morphology. A word is composed of a number of morphemes concatenated together A number of languages have extensive non-concatenative morphology Morphemes are combined in a more complex way The Tagalog infixation example Another kind of non-concatenative morphology Templatic morphology or root-and-pattern morphology Common in Arabic, Hebrew, and other Semitic languages

Ways to form words from morphemes Two broad classes Inflection: the combination of a word stem with a grammatical morpheme usually resulting in a word of the same class as the original stem Derivation : the combination of a word stem with a grammatical morpheme usually resulting in a word of a different class, often with a meaning hard to predict exactly

Inflectional Morphology -NOUN In English, only nouns, verbs, and sometimes adjectives can be inflected the number of affixes is quite small. Inflections of nouns in English: An affix marking plural, cat(-s) ibis(-es), thrush(-es) waltz(-es), finch(-es), box(-es) butterfly(-lies) ox (oxen), mouse (mice) [irregular nouns]

An affix marking possessive Regular singular noun- llama’s Plural noun not ending in ‘s ‘-children’s Regular plural noun –llamas’ Names ending in ‘s’ or ‘z’ - Euripides’ comedies

Inflectional Morphology- VERB Verbal inflection is more complicated than nominal inflection. English has three kinds of verbs: Main verbs, eat, sleep, impeach Modal verbs, can will, should Primary verbs, be, have, do Of these verbs a large class are regular All verbs of this class have the same endings marking the same functions.

Morphological forms of regular verbs Have four morphological form Just by knowing the stem we can predict the other forms By adding one of the three predictable endings Making some regular spelling changes These regular verbs and forms are significant in the morphology of English because of their majority and being productive. stemwalkmergetrymap -s formwalksmergestriesmaps -ing principlewalkingmergingtryingmapping Past form or – ed participle walkedmergedtriedmapped

Morphological forms of irregular verbs Have five different forms But can have as many as eight (verb ‘be’) Or as few as three (verb ‘cut’ ) stemeatcatchcut -s formeatscatchescuts -ing principleeatingcatchingcutting Past formatecaughtcut – ed participle eatencaughtcut the simple form: be the -ing participle form: being the past participle: been the first person singular present tense form: am the third person present tense (-s) form: is the plural present tense form: are the singular past tense form: was the plural past tense form: were

Derivational Morphology Nominalization in English: The formation of new nouns, often from verbs or adjectives SuffixBase Verb/AdjectiveDerived Noun -ationcomputerize (V)computerization -eeappoint (V)appointee -erkill (V)killer -nessfuzzy (A)fuzziness SuffixBase Noun/VerbDerived Adjective -alcomputation (N)computational -ableembrace (V)embraceable -lessclue (A)clueless –Adjectives derived from nouns or verbs

Derivational Morphology Derivation in English is more complex than inflection because It is generally less productive A nominalizing affix like –ation can not be added to absolutely every verb. eatation(*) There are subtle and complex meaning differences among nominalizing suffixes. For example, sincerity has a subtle difference in meaning from sincereness.

Finite-State Morphological Parsing The problem of parsing English morphology Aim is to take input forms in the first column and produce output forms in the second column InputMorphological parsed output cats cat cities geese goose gooses merging caught cat +N +PL cat +N +SG city +N +PL goose +N +PL ( goose +N +SG ) or ( goose +V ) goose +V +3SG merge +V +PRES-PART ( caught +V +PAST-PART ) or ( catch +V +PAST ) Stems and morphological features Input form

To build a morphological parser: 1. Lexicon : the list of stems and affixes, together with basic information about them Eg., Noun stem or Verb stem, etc. 2. Morphotactics : the model of morpheme ordering that explains which classes of morphemes can follow other classes of morphemes inside a word. E.g., the rule that English plural morpheme follows the noun rather than preceding it. 3. Orthographic rules : these spelling rules are used to model the changes that occur in a word, usually when two morphemes combine E.g., the y→ie spelling rule changes city + -s to cities.

The Lexicon and Morphotactics A lexicon is a repository for words. The simplest one would consist of an explicit list of every word of the language. Incovenient or impossible! Computational lexicons are usually structured with a list of each of the stems and Affixes of the language together with a representation of morphotactics telling us how they can fit together. The most common way of modeling morphotactics is the finite-state automaton.

An FSA for English nominal inflection Reg-nounIrreg-pl-nounIrreg-sg-nounplural fox fat fog fardvark geese sheep Mice goose sheep mouse -s

3.2 Finite-State Morphological Parsing The Lexicon and Morphotactics Reg-verb-stemIrreg-verb-stemIrreg-past-verbpastPast-partPres-part3sg walk fry talk impeach cut speak sing sang spoken caught ate eaten -ed -ing-s An FSA for English verbal inflection

The Lexicon and Morphotactics English derivational morphology is more complex than English inflectional morphology, and so automata of modeling English derivation tends to be quite complex. Some even based on CFG A small part of morphosyntactics of English adjectives big, bigger, biggest cool, cooler, coolest, coolly red, redder, reddest clear, clearer, clearest, clearly, unclear, unclearly happy, happier, happiest, happily unhappy, unhappier, unhappiest, unhappily real, unreal, really An FSA for a fragment of English adjective Morphology #1

Finite-State Morphological Parsing An FSA for a fragment of English adjective Morphology #2 The FSA#1 recognizes all the listed adjectives, and ungrammatical forms like unbig, redly, and realest. Thus #1 is revised to become #2. The complexity is expected from English derivation.

Finite-State Morphological Parsing An FSA for another fragment of English derivational morphology

Finite-State Morphological Parsing We can now use these FSAs to solve the problem of morphological recognition : Determining whether an input string of letters makes up a legitimate English word or not We do this by taking the morphotactic FSAs, and plugging in each “sub-lexicon” into the FSA. The resulting FSA can then be defined as the level of the individual letter.

Morphological Parsing with FST Given the input, for example, cats, we would like to produce cat +N +PL. Two-level morphology, by Koskenniemi (1983) Representing a word as a correspondence between a lexical level Representing a simple concatenation of morphemes making up a word, and The surface level Representing the actual spelling of the final word. Morphological parsing is implemented by building mapping rules that maps letter sequences like cats on the surface level into morpheme and features sequence like cat +N +PL on the lexical level.

Morphological Parsing with FST The automaton we use for performing the mapping between these two levels is the finite-state transducer or FST. A transducer maps between one set of symbols and another; An FST does this via a finite automaton. Thus an FST can be seen as a two-tape automaton which recognizes or generates pairs of strings. The FST has a more general function than an FSA: An FSA defines a formal language An FST defines a relation between sets of strings. Another view of an FST: A machine reads one string and generates another.

Morphological Parsing with FST FST as recognizer: a transducer that takes a pair of strings as input and output accept if the string-pair is in the string-pair language, and a reject if it is not. FST as generator: a machine that outputs pairs of strings of the language. Thus the output is a yes or no, and a pair of output strings. FST as transducer: A machine that reads a string and outputs another string. FST as set relater: A machine that computes relation between sets.

Morphological Parsing with FST A formal definition of FST (based on the Mealy machine extension to a simple FSA): Q: a finite set of N states q 0, q 1,…, q N  : a finite alphabet of complex symbols. Each complex symbol is composed of an input-output pair i : o; one symbol I from an input alphabet I, and one symbol o from an output alphabet O, thus   I  O. I and O may each also include the epsilon symbol ε. q 0 : the start state F: the set of final states, F  Q  (q, i:o): the transition function or transition matrix between states. Given a state q  Q and complex symbol i:o  ,  (q, i:o) returns a new state q’  Q.  is thus a relation from Q   to Q.

Morphological Parsing with FST FSAs are isomorphic to regular languages, FSTs are isomorphic to regular relations. Regular relations are sets of pairs of strings, a natural extension of the regular language, which are sets of strings. FSTs are closed under union, but generally they are not closed under difference, complementation, and intersection. Two useful closure properties of FSTs: Inversion: If T maps from I to O, then the inverse of T, T -1 maps from O to I. Composition: If T 1 is a transducer from I 1 to O 1 and T 2 a transducer from I 2 to O 2, then T 1 。 T 2 maps from I 1 to O 2

Morphological Parsing with FST Inversion is useful because it makes it easy to convert a FST-as-parser into an FST- as-generator. Composition is useful because it allows us to take two transducers than run in series and replace them with one complex transducer. T 1 。 T 2 (S) = T 2 (T 1 (S) ) Reg-nounIrreg-pl-nounIrreg-sg-noun fox fat fog aardvark g o:e o:e s e sheep m o:i u:εs:c e goose sheep mouse

A transducer for English nominal number inflection T num

Morphological Parsing with FST The transducer T stems, which maps roots to their root-class

Morphological Parsing with FST A fleshed-out English nominal inflection FST T lex = T num 。 T stems ^: morpheme boundary #: word boundary

Orthographic Rules and FSTs Spelling rules (or orthographic rules ) NameDescription of RuleExample Consonant doubling E deletion E insertion Y replacement K insertion 1-letter consonant doubled before -ing/-ed Silent e dropped before -ing and -ed e added after -s, -z, -x, -ch, -sh, before -s -y changes to -ie before -s, -i before -ed Verb ending with vowel + -c add -k beg/begging make/making watch/watches try/tries panic/panicked –These spelling changes can be thought as taking as input a simple concatenation of morphemes and producing as output a slightly-modified concatenation of morphemes.

Orthographic Rules and FSTs “insert an e on the surface tape just when the lexical tape has a morpheme ending in x (or z, etc) and the next morphemes is -s” x ε  e/ s ^ s# z a  b / c d “ rewrite a as b when it occurs between c and d ”

Orthographic Rules and FSTs The transducer for the E-insertion rule

Combining FST Lexicon and Rules

The power of FSTs is that the exact same cascade with the same state sequences is used when machine is generating the surface form from the lexical tape, or When it is parsing the lexical tape from the surface tape. Parsing can be slightly more complicated than generation, because of the problem of ambiguity. For example, foxes could be fox +V +3SG as well as fox +N +PL

Lexicon-Free FSTs: the Porter Stemmer Information retrieval One of the mostly widely used stemmming algorithms is the simple and efficient Porter (1980) algorithm, which is based on a series of simple cascaded rewrite rules. ATIONAL  ATE (e.g., relational  relate) ING  εif stem contains vowel (e.g., motoring  motor) Problem: Not perfect: error of commision, omission Experiments have been made Some improvement with smaller documents Any improvement is quite small