Presentation is loading. Please wait.

Presentation is loading. Please wait.

Morphological Parsing

Similar presentations


Presentation on theme: "Morphological Parsing"— Presentation transcript:

1 Morphological Parsing
CS 4705 CS 4705

2 Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing: taking a word or string of words as input and identifying their stems and affixes (and sometimes interpreting these) E.g.: goose  goose +N +SG or goose + V geese  goose +N +PL gooses  goose +V +3SG Bracketing: indecipherable  [in [[de [cipher]] able]] Cipher from mfr cifre from arabic cifra (zero, nothing)

3 Why ‘parse’ words? To find stems
Simple key to word similarity Yellow, yellowish, yellows, yellowed, yellowing… To find affixes and the information they convey ‘ed’ signals a verb ‘ish’ an adjective ‘s’? Morphological parsing provides information about a word’s semantics and the syntactic role it plays in a sentence

4 Some Practical Applications
For spell-checking Is muncheble a legal word? To identify a word’s part-of-speech (pos) For sentence parsing, for machine translation, … To identify a word’s stem For information retrieval Why not just list all word forms in a lexicon?

5 What do we need to build a morphological parser?
Lexicon: list of stems and affixes (w/ corresponding p.o.s.) Morphotactics of the language: model of how and which morphemes can be affixed to a stem Orthographic rules: spelling modifications that may occur when affixation occurs in  il in context of l (in- + legal) Most morphological phenomena can be described with regular expressions – so finite state techniques often used to represent morphological processes

6 Using FSAs to Represent English Plural Nouns
English nominal inflection plural (-s) reg-n q0 q1 q2 irreg-pl-n irreg-sg-n Inputs: cats, geese, goose

7 Derivational morphology: adjective fragment
adj-root1 -er, -ly, -est un- q0 q1 q2 adj-root1 q5 q3 q4 -er, -est adj-root2 What will happen if we use only the FSA defined by the purple nodes? Will allow unbig, unred,… Solution: define classes of adjective stems NFSA: easier and more intuitive to define Adj-root1: clear, happy, real (clearly) Adj-root2: big, red (*bigly)

8 FSAs can also represent the Lexicon
Expand each non-terminal arc in the previous FSA into a sub-lexicon FSA (e.g. adj_root2 = {big, red}) and then expand each of these stems into its letters (e.g. red  r e d) to get a recognizer for adjectives e r q1 q2 un- q3 q7 q0 b d q4 -er, -est q5 i g q6

9 But….. Covering the whole lexicon this way will require very large FSAs with consequent search and maintenance problems Adding new items to the lexicon means recomputing the whole FSA Non-determinism FSAs tell us whether a word is in the language or not – but usually we want to know more: What is the stem? What are the affixes and what sort are they? We used this information to recognize the word: why can’t we store it? Adding new lexical items means we will need to determinize and minimize the FSA each time.

10 Parsing with Finite State Transducers
cats cat +N +PL (a plural NP) Kimmo Koskenniemi’s two-level morphology Idea: word is a relationship between lexical level (its morphemes) and surface level (its orthography) Morphological parsing : find the mapping (transduction) between lexical and surface levels c a t +N +PL s lexical surface

11 Finite State Transducers can represent this mapping
FSTs map between one set of symbols and another using a FSA whose alphabet  is composed of pairs of symbols from input and output alphabets In general, FSTs can be used for Translators (Hello:Ciao) Parser/generators (Hello:How may I help you?) As well as Kimmo-style morphological parsing

12 FST is a 5-tuple consisting of
Q: set of states {q0,q1,q2,q3,q4} : an alphabet of complex symbols, each an i/o pair s.t. i  I (an input alphabet) and o  O (an output alphabet) and  is in I x O q0: a start state F: a set of final states in Q {q4} (q,i:o): a transition function mapping Q x  to Q Emphatic Sheep  Quizzical Cow a:o b:m a:o a:o !:? q0 q1 q2 q3 q4

13 FST for a 2-level Lexicon
E.g. c:c a:a t:t q3 q0 q1 q2 g q4 q5 q6 q7 e e:o e:o s Reg-n Irreg-pl-n Irreg-sg-n c a t g o:e o:e s e g o o s e NB: by convention, a:a is written just a

14 FST for English Nominal Inflection
reg-n +PL:^s# q1 q4 +SG:-# +N: irreg-n-sg q0 q2 q5 q7 +SG:-# irreg-n-pl q3 q6 +PL:-s# +N: s t a c +PL +N

15 Useful Operations on Transducers
Cascade: running 2+ FSTs in sequence Intersection: represent the common transitions in FST1 and FST2 (ASR: finding pronunciations) Composition: apply FST2 transition function to result of FST1 transition function Inversion: exchanging the input and output alphabets (recognize and generate with same FST) cf AT&T FSM Toolkit and papers by Mohri, Pereira, and Riley

16 Orthographic Rules and FSTs
Define additional FSTs to implement rules such as consonant doubling (beg  begging), ‘e’ deletion (make  making), ‘e’ insertion (watch  watches), etc. Lexical f o x +N +PL Intermediate ^ s # Surface e

17 Porter Stemmer (1980) Used for tasks in which you only care about the stem IR, modeling given/new distinction, topic detection, document similarity Lexicon-free morphological analysis Cascades rewrite rules (e.g. misunderstanding --> misunderstand --> understand --> …) Easily implemented as an FST with rules e.g. ATIONAL  ATE ING  ε Not perfect …. Doing  doe

18 Policy  police Does stemming help? IR, little Topic detection, more

19 Summing Up FSTs provide a useful tool for implementing a standard model of morphological analysis, Kimmo’s two-level morphology But for many tasks (e.g. IR) much simpler approaches are still widely used, e.g. the rule-based Porter Stemmer Next time: Read Ch , 3.13 (new version)


Download ppt "Morphological Parsing"

Similar presentations


Ads by Google