CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:

Slides:



Advertisements
Similar presentations
Finite-state automata and Morphology
Advertisements

Jing-Shin Chang1 Morphology & Finite-State Transducers Morphology: the study of constituents of words Word = {a set of morphemes, combined in language-dependent.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
Morphology.
CS 4705 Morphology: Words and their Parts CS 4705.
1 Morphology September 2009 Lecture #4. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 Morphology September 4, 2012 Lecture #3. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011.
5/16/ ICS 482 Natural Language Processing Words & Transducers-Morphology - 1 Muhammed Al-Mulhem March 1, 2009.
Brief introduction to morphology
BİL711 Natural Language Processing1 Morphology Morphology is the study of the way words are built from smaller meaningful units called morphemes. We can.
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
Morphological analysis
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Some Basic Concepts: Morphology.
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
CS 4705 Morphology: Words and their Parts CS 4705.
Introduction to English Morphology Finite State Transducers
Chapter 3. Morphology and Finite-State Transducers From: Chapter 3 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Morphology (CS ) By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya.
Morphology: Words and their Parts CS 4705 Slides adapted from Jurafsky, Martin Hirschberg and Dorr.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.
9/11/ Morphology and Finite-state Transducers: Part 1 ICS 482: Natural Language Processing Lecture 5 Husni Al-Muhtaseb.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model.
Chapter 3: Morphology and Finite State Transducer
Finite State Transducers
Chapter 3: Morphology and Finite State Transducer Heshaam Faili University of Tehran.
Finite State Transducers for Morphological Parsing
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 3 27 July 2007.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Chapter III morphology by WJQ. Morphology Morphology refers to the study of the internal structure of words, and the rules by which words are formed.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
CSA3050: Natural Language Algorithms Finite State Devices.
Natural Language Processing Chapter 2 : Morphology.
MORPHOLOGY definition; variability among languages.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
Chapter 3 Word Formation I This chapter aims to analyze the morphological structures of words and gain a working knowledge of the different word forming.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Lecture 7 Summary Survey of English morphology
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Speech and Language Processing
Morphology: Parsing Words
Morphology: Words and their Parts
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
CPSC 503 Computational Linguistics
Morphological Parsing
CSCI 5832 Natural Language Processing
Presentation transcript:

CS 4705 Lecture 3 Morphology: Parsing Words

What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems: children, undoubtedly, –Affixes (prefixes, suffixes, circumfixes, infixes) Immaterial Trying Gesagt Absobl**dylutely –Concatenative vs. non-concatenative (e.g. Arabic root- and-pattern) morphological systems

Morphology Helps Define Word Classes AKA morphological classes, parts-of-speech Closed vs. open (function vs. content) class words –Pronoun, preposition, conjunction, determiner,… –Noun, verb, adverb, adjective,…

(English) Inflectional Morphology Word stem + grammatical morpheme –Usually produces word of same classclass –Usually serves a syntactic function (e.g. agreement) like  likes or liked bird  birds Nominal morphology –Plural forms s or es Irregular forms (goose/geese) Mass vs. count nouns (fish/fish, or s?) –Possessives (cat’s, cats’)

Verbal inflection –Main verbs (sleep, like, fear) verbs relatively regular -s, ing, ed And productive: ed, instant-messaged, faxed, homered But some are not regular: eat/ate/eaten, catch/caught/caught –Primary (be, have, do) and modal verbs (can, will, must) often irregular and not productive Be: am/is/are/were/was/been/being –Irregular verbs few (~250) but frequently occurring –So….English inflectional morphology is fairly easy to model….with some special cases...

(English) Derivational Morphology Word stem + grammatical morpheme –Usually produces word of different class –More complicated than inflectional E.g. verbs --> nouns –-ize verbs  -ation nouns –generalize, realize  generalization, realization E.g.: verbs, nouns  adjectives –embrace, pity  embraceable, pitiable –care, wit  careless, witless

E.g.: adjective  adverb –happy  happily But “rules” have many exceptions –Less productive: *evidence-less, *concern-less, *go- able, *sleep-able –Meanings of derived terms harder to predict by rule clueless, careless, nerveless

Parsing Taking a surface input and identifying its components and underlying structure Morphological parsing: parsing a word into stem and affixes, identifying its parts and their relationships –Stem and features: goose  goose +N +SG or goose + V geese  goose +N +PL gooses  goose +V +3SG –Bracketing: indecipherable  [in [[de [cipher]] able]]

Why parse words? For spell-checking –Is muncheble a legal word? To identify a word’s part-of-speech (pos) –For sentence parsing, for machine translation, … To identify a word’s stem –For information retrieval Why not just list all word forms in a lexicon?

How do people represent words? Hypotheses: –Full listing hypothesis: words listed –Minimum redundancy hypothesis: morphemes listed Experimental evidence: –Priming experiments (Does seeing/hearing one word facilitate recognition of another?) suggest neither –Regularly inflected forms prime stem but not derived forms –But spoken derived words can prime stems if they are semantically close (e.g. government/govern but not department/depart)

Speech errors suggest affixes must be represented separately in the mental lexicon –easy enoughly

What do we need to build a morphological parser? Lexicon: list of stems and affixes (w/ corresponding pos) Morphotactics of the language: model of how and which morphemes can be affixed to a stem Orthographic rules: spelling modifications that may occur when affixation occurs –in  il in context of l (in- + legal)

Using FSAs to Represent English Plural Nouns English nominal inflection q0q2q1 plural (-s) reg-n irreg-sg-n irreg-pl-n Inputs: cats, geese, goose

Derivational morphology: adjective fragment q3 q5 q4 q0 q1q2 un- adj-root 1 -er, -ly, -est  adj-root 1 adj-root 2 -er, -est Adj-root 1 : clear, happy, real (clearly) Adj-root 2 : big, red (~bigly)

FSAs can also represent the Lexicon Expand each non-terminal arc in the previous FSA into a sub-lexicon FSA (e.g. adj_root 2 = {big, red}) and then expand each of these stems into its letters (e.g. red  r e d) to get a recognizer for adjectives q0 q1 un- r e q2 q4 q3 -er, -est d b g q5 q6i q7

But….. Covering the whole lexicon this way will require very large FSAs with consequent search and maintenance problems –Adding new items to the lexicon means recomputing the whole FSA –Non-determinism FSAs tell us whether a word is in the language or not – but usually we want to know more: –What is the stem? –What are the affixes and what sort are they? –We used this information to recognize the word: can we get it back?

Parsing with Finite State Transducers cats  cat +N +PL (a plural NP) Koskenniemi’s two-level morphology –Idea: word is a relationship between lexical level (its morphemes) and surface level (its orthography) –Morphological parsing : find the mapping (transduction) between lexical and surface levels cat+N+PL cats

Finite State Transducers can represent this mapping FSTs map between one set of symbols and another using an FSA whose alphabet  is composed of pairs of symbols from input and output alphabets In general, FSTs can be used for –Translators (Hello:Ciao) –Parser/generator s(Hello:How may I help you?) –As well as Kimmo-style morphological parsing

FST is a 5-tuple consisting of –Q: set of states {q0,q1,q2,q3,q4} –  : an alphabet of complex symbols, each an i/o pair s.t. i  I (an input alphabet) and o  O (an output alphabet) and  is in I x O –q0: a start state –F: a set of final states in Q {q4} –  (q,i:o): a transition function mapping Q x  to Q –Emphatic Sheep  Quizzical Cow q0 q4 q1q2q3 b:ma:o !:?

FST for a 2-level Lexicon E.g. Reg-nIrreg-pl-nIrreg-sg-n c a tg o:e o:e s eg o o s e q0q1q2 q3 c:ca:at:t q4q6q7q5 se:o eg

FST for English Nominal Inflection q0q7 +PL:^s# q1q4 q2q5 q3q6 reg-n irreg-n-sg irreg-n-pl +N:  +PL:-s# +SG:-# +N:  stac c+PL+Nta

Useful Operations on Transducers Cascade: running 2+ FSTs in sequence Intersection: represent the common transitions in FST1 and FST2 (ASR: finding pronunciations) Composition: apply FST2 transition function to result of FST1 transition function Inversion: exchanging the input and output alphabets (recognize and generate with same FST) cf AT&T FSM Toolkit and papers by Mohri, Pereira, and RileyAT&T FSM ToolkitMohri Pereira

Orthographic Rules and FSTs Define additional FSTs to implement rules such as consonant doubling (beg  begging), ‘e’ deletion (make  making), ‘e’ insertion (watch  watches), etc. Lexical fox+N+PL Intermediate fox^s# Surface foxes

Porter Stemmer Used for tasks in which you only care about the stem –IR, modeling given/new distinction, topic detection, document similarity Rewrite rules (e.g. misunderstanding --> misunderstand --> understand --> …) Not perfect …. But sometimes it doesn’t matter too much Fast and easy

Summing Up FSTs provide a useful tool for implementing a standard model of morphological analysis, Kimmo’s two-level morphology But for many tasks (e.g. IR) much simpler approaches are still widely used, e.g. the rule- based Porter Stemmer Next time: –Read Ch 4 –Read over HW1 and ask questions nowHW1