Morphology See Harald Trost “Morphology”. Chapter 2 of R Mitkov (ed.) The Oxford Handbook of Computational Linguistics, Oxford (2004): OUP D Jurafsky &

Slides:



Advertisements
Similar presentations
Finite State Automata. A very simple and intuitive formalism suitable for certain tasks A bit like a flow chart, but can be used for both recognition.
Advertisements

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Natural Language Processing Lecture 3—9/3/2013 Jim Martin.
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
The Study Of Language Unit 7 Presentation By: Elham Niakan Zahra Ghana’at Pisheh.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
October 2006Advanced Topics in NLP1 Finite State Machinery Xerox Tools.
Brief introduction to morphology
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
1 Regular Expressions and Automata September Lecture #2-2.
Finite-State Automata Shallow Processing Techniques for NLP Ling570 October 5, 2011.
Stemming, tagging and chunking Text analysis short of parsing.
Computational language: week 9 Finish finite state machines FSA’s for modelling word structure Declarative language models knowledge representation and.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Morphological analysis
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
1 Foundations of Software Design Lecture 23: Finite Automata and Context-Free Grammars Marti Hearst Fall 2002.
Regular Expressions and Automata Chapter 2. Regular Expressions Standard notation for characterizing text sequences Used in all kinds of text processing.
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
Introduction to English Morphology Finite State Transducers
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Morphology (CS ) By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya.
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
Ch4 – Features Consider the following data from Mokilese
Finite State Automata and Tries Sambhav Jain IIIT Hyderabad.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 3 27 July 2005.
October 2004CSA3050 NL Algorithms1 CSA3050: Natural Language Algorithms Words, Strings and Regular Expressions Finite State Automota.
Chapter 3: Morphology and Finite State Transducer
Finite State Transducers
Chapter 3: Morphology and Finite State Transducer Heshaam Faili University of Tehran.
Finite State Transducers for Morphological Parsing
Morphology A Closer Look at Words By: Shaswar Kamal Mahmud.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Natural Language Processing Lecture 2—1/15/2015 Susan W. Brown.
2. Regular Expressions and Automata 2007 년 3 월 31 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.33 ~ 56.
CSA3050: Natural Language Algorithms Finite State Devices.
The Simplest NL Applications: Text Searching and Pattern Matching Read J & M Chapter 2.
November 2003CSA4050: Computational Morphology IV 1 CSA405: Advanced Topics in NLP Computational Morphology IV: xfst.
Natural Language Processing Chapter 2 : Morphology.
October 2007Natural Language Processing1 CSA3050: Natural Language Algorithms Words and Finite State Machinery.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
November 2003Computational Morphology III1 CSA405: Advanced Topics in NLP Xerox Notation.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
November 2003Computational Morphology VI1 CSA4050 Advanced Topics in NLP Non-Concatenative Morphology – Reduplication – Interdigitation.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
عمادة التعلم الإلكتروني والتعليم عن بعد
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Introduction to Linguistics
Speech and Language Processing
Chapter 6 Morphology.
Morphology: Parsing Words
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
Ambiguity At last, a computer that understands you like your mother.
Morphological Parsing
Presentation transcript:

Morphology See Harald Trost “Morphology”. Chapter 2 of R Mitkov (ed.) The Oxford Handbook of Computational Linguistics, Oxford (2004): OUP D Jurafsky & JH Martin: Speech and Language Processing, Upper Saddle River NJ (2000): Prentice Hall, Chapter 3 [quite technical]

2 Morphology - reminder Internal analysis of word forms morpheme – allomorphic variation Words usually consist of a root plus affix(es), though some words can have multiple roots, and some can be single morphemes lexeme – abstract notion of group of word forms that ‘belong’ together –lexeme ~ root ~ stem ~ base form ~ dictionary (citation) form

3 Role of morphology Commonly made distinction: inflectional vs derivational Inflectional morphology is grammatical –number, tense, case, gender Derivational morphology concerns word building –part-of-speech derivation –words with related meaning

4 Inflectional morphology Grammatical in nature Does not carry meaning, other than grammatical meaning Highly systematic, though there may be irregularities and exceptions –Simplifies lexicon, only exceptions need to be listed –Unknown words may be guessable Language-specific and sometimes idiosyncratic (Mostly) helpful in parsing

5 Derivational morphology Lexical in nature Can carry meaning Fairly systematic, and predictable up to a point –Simplifies description of lexicon: regularly derived words need not be listed –Unknown words may be guessable But … –Apparent derivations have specialised meaning –Some derivations missing Languages often have parallel derivations which may be translatable

6 Morphological processes Affixes: prefix, suffix, infix, circumfix Vowel change (umlaut, ablaut) Gemination, (partial) reduplication Root and pattern Stress (or tone) change Sandhi

7 Morphophonemics Morphemes and allomorphs –eg {plur}: +(e)s, vowel change, y  ies, f  ves, um  a, ,... Morphophonemic variation –Affixes and stems may have variants which are conditioned by context eg +ing in lifting, swimming, boxing, raining, hoping, hopping –Rules may be generalisable across morphemes eg +(e)s in cats, boxes, tomatoes, matches, dishes, buses Applies to both {plur} (nouns) and {3 rd sing pres} (verbs)

8 Morphology in NLP Analysis vs synthesis –what does dogs mean? vs what is the plural of dog? Analysis –Need to identify lexeme Tokenization To access lexical information –Inflections (etc) carry information that will be needed by other processes (eg agreement useful in parsing, inflections can carry meaning (eg tense, number) –Morphology can be ambiguous May need other process to disambiguate (eg German –en) Synthesis –Need to generate appropriate inflections from underlying representation

9 Morphology in NLP String-handling programs can be written More general approach –formalism to write rules which express correspondence between surface and underlying form (eg dogs = dog +{plur}) –Computational algorithm (program) which can apply those rules to actual instances –Especially of interest if rules (though not program) is independent of direction: analysis or synthesis

10 Role of lexicon in morphology Rules interact with the lexicon –Obviously category information eg rules that apply to nouns –Note also morphology-related subcategories eg “er” verbs in French, rules for gender agreement –Other lexical information can impact on morphology eg all fish have two forms of the plural (+s and  ) in Slavic languages case inflections differ for inanimate and animate nouns)

11 Problems with rules Exceptions have to be covered –Including systematic irregularities –May be a trade-off between treating something as a small group of irregularities or as a list of unrelated exceptions (eg French irregular verbs, English f  ves) Rules must not over/under-generate –Must cover all and only the correct cases –May depend on what order the rules are applied in

12 Tokenization The simplest form of analysis is to reduce different word forms into tokens Also called “normalization” For example, if you want to count how many times a given ‘word’ occurs in a text Or you want to search for texts containing certain ‘words’ (e.g. Google)

13 Morphological processing Stemming String-handling approaches –Regular expressions –Mapping onto finite-state automata 2-level morphology –Mapping between surface form and lexical representation

14 Stemming Stemming is the particular case of tokenization which reduces inflected forms to a single base form or stem (Recall our discussion of stem ~ base form ~ dictionary form ~ citation form) Stemming algorithms are basic string- handling algorithms, which depend on rules which identify affixes that can be stripped

15 Finite state automata A finite state automaton is a simple and intuitive formalism with straightforward computational properties (so easy to implement) A bit like a flow chart, but can be used for both recognition (analysis) and generation FSAs have a close relationship with “regular expressions”, a formalism for expressing strings, mainly used for searching texts, or stipulating patterns of strings

16 Finite state automata A bit like a flow chart, but can be used for both recognition and generation “Transition network” Unique start point Series of states linked by transitions Transitions represent input to be accounted for, or output to be generated Legal exit-point(s) explicitly identified

17 Example Jurafsky & Martin, Figure 2.10 Loop on q 3 means that it can account for infinite length strings “Deterministic” because in any state, its behaviour is fully predictable q0q0 q1q1 q2q2 q3q3 q4q4 b aa! a

18 Non-deterministic FSA Jurafsky & Martin, Figure 2.18 At state q 2 with input “a” there is a choice of transitions We can also have “jump” arcs (or empty transitions), which also introduce non- determinism q0q0 q1q1 q2q2 q3q3 q4q4 b aa! a 2.19 ε

19 An FSA to handle morphology q0q0 q1q1 q2q2 q6q6 q3q3 f xoe c q5q5 q4q4 s r q7q7 y i Spot the deliberate mistake: overgeneration

20 Finite State Transducers A “transducer” defines a relationship (a mapping) between two things Typically used for “two-level morphology”, but can be used for other things Like an FSA, but each state transition stipulates a pair of symbols, and thus a mapping

21 Finite State Transducers Three functions: –Recognizer (verification): takes a pair of strings and verifies if the FST is able to map them onto each other –Generator (synthesis): can generate a legal pair of strings –Translator (transduction): given one string, can generate the corresponding string Mapping usually between levels of representation –spy+s : spies –Lexical:intermediate foxNPs : fox^s –Intermediate:surface fox^s : foxes

22 Some conventions Transitions are marked by “:” A non-changing transition “x:x” can be shown simply as “x” Wild-cards are shown as Empty string shown as “ε”

23 An example based on Trost p.42 spy:i+:es#:ε toy+:0s#:ε she+:es#:ε lf:v wi es#:ε #spy+s# : spies #toy+s# : toys

24 Using wild cards and loops spy:i+:es#:0 y:i+:e y +:0 s #:0 Can be collapsed into a single FST:

25 Another example ( J&M Fig. 3.9, p.74) q0q0 q6q6 q5q5 q4q4 q3q3 q2q2 q1q1 q7q7 f o x c a t d o g g o o s e s h e e p m o u s e g o:e o:e s e s h e e p m o:i u:εs:c e N:ε P:^ s # S:# P:# lexical:intermediate

26 q0q0 q1q1 f o x c a t d o g q0q0 q1q1 f s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 c d o a o x t g

27 q0q0 q6q6 q5q5 q4q4 q3q3 q2q2 q1q1 q7q7 g o o s e s h e e p m o u s e g o:e o:e s e s h e e p m o:i u:εs:c e N:ε P:^ s # S:# P:# [0] f:f o:o x:x [1] N:ε [4] P:^ s:s #:# [7] [0] f:f o:o x:x [1] N:ε [4] S:# [7] [0] c:c a:a t:t [1] N:ε [4] P:^ s:s #:# [7] [0] s:s h:h e:e p:p [2] N:ε [5] S:# [7] [0] g:g o:e o:e s:s e:e [3] N:ε [5] P:# [7] f o x N P s # : f o x ^ s # f o x N S : f o x # c a t N P s # : c a t ^ s # s h e e p N S : s h e e p # g o o s e N P : g e e s e # f o x c a t d o g

28 Lexical:surface mapping J&M Fig. 3.14, p.78 ε  e / {x s z} ^ __ s # f o x N P s # : f o x ^ s # c a t N P s # : c a t ^ s # q5q5 q4q4 q0q0 q2q2 q3q3 q1q1 ^: ε # other z, s, x #, otherz, x ^:ε s ε:e s #

29 f o x ^ s # f o x e s # c a t ^ s # : c a t ^ s # q5q5 q4q4 q0q0 q2q2 q3q3 q1q1 ^: ε # other z, s, x #, otherz, x ^:ε s ε:e s # [0] f:f [0] o:o [0] x:x [1] ^:ε [2] ε:e [3] s:s [4] #:# [0] [0] c:c [0] a:a [0] t:t [0] ^:ε [0] s:s [0] #:# [0]

30 FST But you don’t have to draw all these FSTs They map neatly onto rule formalisms What is more, these can be generated automatically Therefore, slightly different formalism

31 FST compiler [d o g N P.x. d o g s ] | [c a t N P.x. c a t s ] | [f o x N P.x. f o x e s ] | [g o o s e N P.x. g e e s e] s0: c -> s1, d -> s2, f -> s3, g -> s4. s1: a -> s5. s2: o -> s6. s3: o -> s7. s4: -> s8. s5: t -> s9. s6: g -> s9. s7: x -> s10. s8: -> s11. s9: -> s12. s10: -> s13. s11: s -> s14. s12: -> fs15. s13: -> fs15. s14: e -> s16. fs15: (no arcs) s16: -> s12. s0s0 s3s3 s2s2 s1s1 s4s4 c d f g