Morphological Parsing CS 4705 CS 4705
Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing: taking a word or string of words as input and identifying their stems and affixes (and sometimes interpreting these) E.g.: goose goose +N +SG or goose + V geese goose +N +PL gooses goose +V +3SG Bracketing: indecipherable [in [[de [cipher]] able]] Cipher from mfr cifre from arabic cifra (zero, nothing)
Why ‘parse’ words? To find stems Simple key to word similarity Yellow, yellowish, yellows, yellowed, yellowing… To find affixes and the information they convey ‘ed’ signals a verb ‘ish’ an adjective ‘s’? Morphological parsing provides information about a word’s semantics and the syntactic role it plays in a sentence
Some Practical Applications For spell-checking Is muncheble a legal word? To identify a word’s part-of-speech (pos) For sentence parsing, for machine translation, … To identify a word’s stem For information retrieval Why not just list all word forms in a lexicon?
What do we need to build a morphological parser? Lexicon: list of stems and affixes (w/ corresponding p.o.s.) Morphotactics of the language: model of how and which morphemes can be affixed to a stem Orthographic rules: spelling modifications that may occur when affixation occurs in il in context of l (in- + legal) Most morphological phenomena can be described with regular expressions – so finite state techniques often used to represent morphological processes
Using FSAs to Represent English Plural Nouns English nominal inflection plural (-s) reg-n q0 q1 q2 irreg-pl-n irreg-sg-n Inputs: cats, geese, goose
Derivational morphology: adjective fragment adj-root1 -er, -ly, -est un- q0 q1 q2 adj-root1 q5 q3 q4 -er, -est adj-root2 What will happen if we use only the FSA defined by the purple nodes? Will allow unbig, unred,… Solution: define classes of adjective stems NFSA: easier and more intuitive to define Adj-root1: clear, happy, real (clearly) Adj-root2: big, red (*bigly)
FSAs can also represent the Lexicon Expand each non-terminal arc in the previous FSA into a sub-lexicon FSA (e.g. adj_root2 = {big, red}) and then expand each of these stems into its letters (e.g. red r e d) to get a recognizer for adjectives e r q1 q2 un- q3 q7 q0 b d q4 -er, -est q5 i g q6
But….. Covering the whole lexicon this way will require very large FSAs with consequent search and maintenance problems Adding new items to the lexicon means recomputing the whole FSA Non-determinism FSAs tell us whether a word is in the language or not – but usually we want to know more: What is the stem? What are the affixes and what sort are they? We used this information to recognize the word: why can’t we store it? Adding new lexical items means we will need to determinize and minimize the FSA each time.
Parsing with Finite State Transducers cats cat +N +PL (a plural NP) Kimmo Koskenniemi’s two-level morphology Idea: word is a relationship between lexical level (its morphemes) and surface level (its orthography) Morphological parsing : find the mapping (transduction) between lexical and surface levels c a t +N +PL s lexical surface
Finite State Transducers can represent this mapping FSTs map between one set of symbols and another using a FSA whose alphabet is composed of pairs of symbols from input and output alphabets In general, FSTs can be used for Translators (Hello:Ciao) Parser/generators (Hello:How may I help you?) As well as Kimmo-style morphological parsing
FST is a 5-tuple consisting of Q: set of states {q0,q1,q2,q3,q4} : an alphabet of complex symbols, each an i/o pair s.t. i I (an input alphabet) and o O (an output alphabet) and is in I x O q0: a start state F: a set of final states in Q {q4} (q,i:o): a transition function mapping Q x to Q Emphatic Sheep Quizzical Cow a:o b:m a:o a:o !:? q0 q1 q2 q3 q4
FST for a 2-level Lexicon E.g. c:c a:a t:t q3 q0 q1 q2 g q4 q5 q6 q7 e e:o e:o s Reg-n Irreg-pl-n Irreg-sg-n c a t g o:e o:e s e g o o s e NB: by convention, a:a is written just a
FST for English Nominal Inflection reg-n +PL:^s# q1 q4 +SG:-# +N: irreg-n-sg q0 q2 q5 q7 +SG:-# irreg-n-pl q3 q6 +PL:-s# +N: s t a c +PL +N
Useful Operations on Transducers Cascade: running 2+ FSTs in sequence Intersection: represent the common transitions in FST1 and FST2 (ASR: finding pronunciations) Composition: apply FST2 transition function to result of FST1 transition function Inversion: exchanging the input and output alphabets (recognize and generate with same FST) cf AT&T FSM Toolkit and papers by Mohri, Pereira, and Riley
Orthographic Rules and FSTs Define additional FSTs to implement rules such as consonant doubling (beg begging), ‘e’ deletion (make making), ‘e’ insertion (watch watches), etc. Lexical f o x +N +PL Intermediate ^ s # Surface e
Porter Stemmer (1980) Used for tasks in which you only care about the stem IR, modeling given/new distinction, topic detection, document similarity Lexicon-free morphological analysis Cascades rewrite rules (e.g. misunderstanding --> misunderstand --> understand --> …) Easily implemented as an FST with rules e.g. ATIONAL ATE ING ε Not perfect …. Doing doe
Policy police Does stemming help? IR, little Topic detection, more
Summing Up FSTs provide a useful tool for implementing a standard model of morphological analysis, Kimmo’s two-level morphology But for many tasks (e.g. IR) much simpler approaches are still widely used, e.g. the rule-based Porter Stemmer Next time: Read Ch 3.10-11, 3.13 (new version)