Morphology See Harald Trost “Morphology”. Chapter 2 of R Mitkov (ed.) The Oxford Handbook of Computational Linguistics, Oxford (2004): OUP D Jurafsky & JH Martin: Speech and Language Processing, Upper Saddle River NJ (2000): Prentice Hall, Chapter 3 [quite technical]
2 Morphology - reminder Internal analysis of word forms morpheme – allomorphic variation Words usually consist of a root plus affix(es), though some words can have multiple roots, and some can be single morphemes lexeme – abstract notion of group of word forms that ‘belong’ together –lexeme ~ root ~ stem ~ base form ~ dictionary (citation) form
3 Role of morphology Commonly made distinction: inflectional vs derivational Inflectional morphology is grammatical –number, tense, case, gender Derivational morphology concerns word building –part-of-speech derivation –words with related meaning
4 Inflectional morphology Grammatical in nature Does not carry meaning, other than grammatical meaning Highly systematic, though there may be irregularities and exceptions –Simplifies lexicon, only exceptions need to be listed –Unknown words may be guessable Language-specific and sometimes idiosyncratic (Mostly) helpful in parsing
5 Derivational morphology Lexical in nature Can carry meaning Fairly systematic, and predictable up to a point –Simplifies description of lexicon: regularly derived words need not be listed –Unknown words may be guessable But … –Apparent derivations have specialised meaning –Some derivations missing Languages often have parallel derivations which may be translatable
6 Morphological processes Affixes: prefix, suffix, infix, circumfix Vowel change (umlaut, ablaut) Gemination, (partial) reduplication Root and pattern Stress (or tone) change Sandhi
7 Morphophonemics Morphemes and allomorphs –eg {plur}: +(e)s, vowel change, y ies, f ves, um a, ,... Morphophonemic variation –Affixes and stems may have variants which are conditioned by context eg +ing in lifting, swimming, boxing, raining, hoping, hopping –Rules may be generalisable across morphemes eg +(e)s in cats, boxes, tomatoes, matches, dishes, buses Applies to both {plur} (nouns) and {3 rd sing pres} (verbs)
8 Morphology in NLP Analysis vs synthesis –what does dogs mean? vs what is the plural of dog? Analysis –Need to identify lexeme Tokenization To access lexical information –Inflections (etc) carry information that will be needed by other processes (eg agreement useful in parsing, inflections can carry meaning (eg tense, number) –Morphology can be ambiguous May need other process to disambiguate (eg German –en) Synthesis –Need to generate appropriate inflections from underlying representation
9 Morphology in NLP String-handling programs can be written More general approach –formalism to write rules which express correspondence between surface and underlying form (eg dogs = dog +{plur}) –Computational algorithm (program) which can apply those rules to actual instances –Especially of interest if rules (though not program) is independent of direction: analysis or synthesis
10 Role of lexicon in morphology Rules interact with the lexicon –Obviously category information eg rules that apply to nouns –Note also morphology-related subcategories eg “er” verbs in French, rules for gender agreement –Other lexical information can impact on morphology eg all fish have two forms of the plural (+s and ) in Slavic languages case inflections differ for inanimate and animate nouns)
11 Problems with rules Exceptions have to be covered –Including systematic irregularities –May be a trade-off between treating something as a small group of irregularities or as a list of unrelated exceptions (eg French irregular verbs, English f ves) Rules must not over/under-generate –Must cover all and only the correct cases –May depend on what order the rules are applied in
12 Tokenization The simplest form of analysis is to reduce different word forms into tokens Also called “normalization” For example, if you want to count how many times a given ‘word’ occurs in a text Or you want to search for texts containing certain ‘words’ (e.g. Google)
13 Morphological processing Stemming String-handling approaches –Regular expressions –Mapping onto finite-state automata 2-level morphology –Mapping between surface form and lexical representation
14 Stemming Stemming is the particular case of tokenization which reduces inflected forms to a single base form or stem (Recall our discussion of stem ~ base form ~ dictionary form ~ citation form) Stemming algorithms are basic string- handling algorithms, which depend on rules which identify affixes that can be stripped
15 Finite state automata A finite state automaton is a simple and intuitive formalism with straightforward computational properties (so easy to implement) A bit like a flow chart, but can be used for both recognition (analysis) and generation FSAs have a close relationship with “regular expressions”, a formalism for expressing strings, mainly used for searching texts, or stipulating patterns of strings
16 Finite state automata A bit like a flow chart, but can be used for both recognition and generation “Transition network” Unique start point Series of states linked by transitions Transitions represent input to be accounted for, or output to be generated Legal exit-point(s) explicitly identified
17 Example Jurafsky & Martin, Figure 2.10 Loop on q 3 means that it can account for infinite length strings “Deterministic” because in any state, its behaviour is fully predictable q0q0 q1q1 q2q2 q3q3 q4q4 b aa! a
18 Non-deterministic FSA Jurafsky & Martin, Figure 2.18 At state q 2 with input “a” there is a choice of transitions We can also have “jump” arcs (or empty transitions), which also introduce non- determinism q0q0 q1q1 q2q2 q3q3 q4q4 b aa! a 2.19 ε
19 An FSA to handle morphology q0q0 q1q1 q2q2 q6q6 q3q3 f xoe c q5q5 q4q4 s r q7q7 y i Spot the deliberate mistake: overgeneration
20 Finite State Transducers A “transducer” defines a relationship (a mapping) between two things Typically used for “two-level morphology”, but can be used for other things Like an FSA, but each state transition stipulates a pair of symbols, and thus a mapping
21 Finite State Transducers Three functions: –Recognizer (verification): takes a pair of strings and verifies if the FST is able to map them onto each other –Generator (synthesis): can generate a legal pair of strings –Translator (transduction): given one string, can generate the corresponding string Mapping usually between levels of representation –spy+s : spies –Lexical:intermediate foxNPs : fox^s –Intermediate:surface fox^s : foxes
22 Some conventions Transitions are marked by “:” A non-changing transition “x:x” can be shown simply as “x” Wild-cards are shown as Empty string shown as “ε”
23 An example based on Trost p.42 spy:i+:es#:ε toy+:0s#:ε she+:es#:ε lf:v wi es#:ε #spy+s# : spies #toy+s# : toys
24 Using wild cards and loops spy:i+:es#:0 y:i+:e y +:0 s #:0 Can be collapsed into a single FST:
25 Another example ( J&M Fig. 3.9, p.74) q0q0 q6q6 q5q5 q4q4 q3q3 q2q2 q1q1 q7q7 f o x c a t d o g g o o s e s h e e p m o u s e g o:e o:e s e s h e e p m o:i u:εs:c e N:ε P:^ s # S:# P:# lexical:intermediate
26 q0q0 q1q1 f o x c a t d o g q0q0 q1q1 f s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 c d o a o x t g
27 q0q0 q6q6 q5q5 q4q4 q3q3 q2q2 q1q1 q7q7 g o o s e s h e e p m o u s e g o:e o:e s e s h e e p m o:i u:εs:c e N:ε P:^ s # S:# P:# [0] f:f o:o x:x [1] N:ε [4] P:^ s:s #:# [7] [0] f:f o:o x:x [1] N:ε [4] S:# [7] [0] c:c a:a t:t [1] N:ε [4] P:^ s:s #:# [7] [0] s:s h:h e:e p:p [2] N:ε [5] S:# [7] [0] g:g o:e o:e s:s e:e [3] N:ε [5] P:# [7] f o x N P s # : f o x ^ s # f o x N S : f o x # c a t N P s # : c a t ^ s # s h e e p N S : s h e e p # g o o s e N P : g e e s e # f o x c a t d o g
28 Lexical:surface mapping J&M Fig. 3.14, p.78 ε e / {x s z} ^ __ s # f o x N P s # : f o x ^ s # c a t N P s # : c a t ^ s # q5q5 q4q4 q0q0 q2q2 q3q3 q1q1 ^: ε # other z, s, x #, otherz, x ^:ε s ε:e s #
29 f o x ^ s # f o x e s # c a t ^ s # : c a t ^ s # q5q5 q4q4 q0q0 q2q2 q3q3 q1q1 ^: ε # other z, s, x #, otherz, x ^:ε s ε:e s # [0] f:f [0] o:o [0] x:x [1] ^:ε [2] ε:e [3] s:s [4] #:# [0] [0] c:c [0] a:a [0] t:t [0] ^:ε [0] s:s [0] #:# [0]
30 FST But you don’t have to draw all these FSTs They map neatly onto rule formalisms What is more, these can be generated automatically Therefore, slightly different formalism
31 FST compiler [d o g N P.x. d o g s ] | [c a t N P.x. c a t s ] | [f o x N P.x. f o x e s ] | [g o o s e N P.x. g e e s e] s0: c -> s1, d -> s2, f -> s3, g -> s4. s1: a -> s5. s2: o -> s6. s3: o -> s7. s4: -> s8. s5: t -> s9. s6: g -> s9. s7: x -> s10. s8: -> s11. s9: -> s12. s10: -> s13. s11: s -> s14. s12: -> fs15. s13: -> fs15. s14: e -> s16. fs15: (no arcs) s16: -> s12. s0s0 s3s3 s2s2 s1s1 s4s4 c d f g