CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.

Slides:



Advertisements
Similar presentations
Finite-state automata and Morphology
Advertisements

Jing-Shin Chang1 Morphology & Finite-State Transducers Morphology: the study of constituents of words Word = {a set of morphemes, combined in language-dependent.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Natural Language Processing Lecture 3—9/3/2013 Jim Martin.
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
CS 4705 Morphology: Words and their Parts CS 4705.
1 Morphology September 2009 Lecture #4. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 Morphology September 4, 2012 Lecture #3. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011.
5/16/ ICS 482 Natural Language Processing Words & Transducers-Morphology - 1 Muhammed Al-Mulhem March 1, 2009.
Brief introduction to morphology
BİL711 Natural Language Processing1 Morphology Morphology is the study of the way words are built from smaller meaningful units called morphemes. We can.
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
CS4705 Natural Language Processing.  Regular Expressions  Finite State Automata ◦ Determinism v. non-determinism ◦ (Weighted) Finite State Transducers.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.
Midterm Review CS4705 Natural Language Processing.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
Morphological analysis
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:
CS 4705 Some slides adapted from Hirschberg, Dorr/Monz, Jurafsky.
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
CS 4705 Morphology: Words and their Parts CS 4705.
Introduction to English Morphology Finite State Transducers
Chapter 3. Morphology and Finite-State Transducers From: Chapter 3 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Morphology: Words and their Parts CS 4705 Slides adapted from Jurafsky, Martin Hirschberg and Dorr.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 3 27 July 2005.
Finite State Transducers
Finite State Transducers for Morphological Parsing
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
Morphology A Closer Look at Words By: Shaswar Kamal Mahmud.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 3 27 July 2007.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Chapter III morphology by WJQ. Morphology Morphology refers to the study of the internal structure of words, and the rules by which words are formed.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
CSA3050: Natural Language Algorithms Finite State Devices.
Natural Language Processing Chapter 2 : Morphology.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Levels of Linguistic Analysis
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
November 2003Computational Morphology VI1 CSA4050 Advanced Topics in NLP Non-Concatenative Morphology – Reduplication – Interdigitation.
Chapter 3 Word Formation I This chapter aims to analyze the morphological structures of words and gain a working knowledge of the different word forming.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Speech and Language Processing
Morphology: Parsing Words
Morphology: Words and their Parts
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
CS4705 Natural Language Processing
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Morphological Parsing
Presentation transcript:

CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg

Words In formal languages, words are arbitrary strings In natural languages, words are made up of meaningful subunits called morphemes –Morphemes are abstract concepts denoting entities or relationships –Morphemes may be Stems: the main morpheme of the word Affixes: convey the word’s role, number, gender, etc. cats == cat [stem] + s [suffix] undo == un [prefix] + do [stem]

Why do we need to do Morphological Analysis? The study of how words are composed from smaller, meaning-bearing units (morphemes) Applications: –Spelling correction: referece –Hyphenation algorithms: refer-ence –Part-of-speech analysis: googler [N], googling [V] –Text-to-speech: grapheme-to-phoneme conversion hothouse (/T/ or /D/)

–Let’s us guess the meaning of unknown words ‘Twas brillig and the slithy toves… Muggles moogled migwiches

Morphotactics What are the ‘rules’ for constructing a word in a given language? –Pseudo-intellectual vs. *intellectual-pseudo –Rational-ize vs *ize-rational –Cretin-ous vs. *cretin-ly vs. *cretin-acious Possible ‘rules’ –Suffixes are suffixes and prefixes are prefixes –Certain affixes attach to certain types of stems (nouns, verbs, etc.) –Certain stems can/cannot take certain affixes

Semantics: In English, un- cannot attach to adjectives that already have a negative connotation: –Unhappy vs. *unsad –Unhealthy vs. *unsick –Unclean vs. *undirty Phonology: In English, -er cannot attach to words of more than two syllables –great, greater –Happy, happier –Competent, *competenter –Elegant, *eleganter –Unruly, ?unrulier

Regular –Walk, walks, walking, walked, walked –Table, tables Irregular –Eat, eats, eating, ate, eaten –Catch, catches, catching, caught, caught –Cut, cuts, cutting, cut, cut –Goose, geese Regular and Irregular Morphology

Morphological Parsing Algorithms to use these regularities and known irregularities to parse words into their morphemes Cats cat +N +PL Cat cat +N +SG Cities city +N +PL Merging merge +V +Present-participle Caught catch +V +past-participle

Morphology and Finite State Automata We can use the machinery provided by FSAs to capture facts about morphology Accept strings that are in the language Reject strings that are not Do this in a way that doesn’t require us to in effect list all the words in the language

How do we build a Morphological Analyzer? Lexicon: list of stems and affixes (w/ corresponding part of speech (p.o.s.)) Morphotactics of the language: model of how and which morphemes can be affixed to a stem Orthographic rules: spelling modifications that may occur when affixation occurs –in  il in context of l (in- + legal) Most morphological phenomena can be described with regular expressions – so finite state techniques often used to represent morphological processes

A Simple Example Regular singular nouns are ok Regular plural nouns have an -s on the end Irregulars are ok as is

Simple English NP FSA

Expand the Arcs with Stems and Affixes cat dog child

We can now run strings through these machines to recognize strings in the language Accept words that are ok Reject words that are not But is this enough? We often want to know the structure of a word (parsing) Or we may have a stem and want to produce a surface form (production/generation) Example From “cats” to “cat +N +PL” From “cat + N + PL” to “cats”

Finite State Transducers (FSTs) The simple story Add another tape Add extra symbols to the transitions On one tape we read “cats” -- on the other we write “cat +N +PL” Or vice versa…

Kimmo Koskenniemi’s two-level morphology Idea: a word is a relationship between lexical level (its morphemes) and surface level (its orthography) Koskenniemi 2-level Morphology

c:c means read a c on one tape and write a c on the other +N:ε means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s c:ca:at:t +N:ε + PL:s

Not So Simple Of course, its not as easy as “cat +N +PL” “cats” What do we do about geese, mice, oxen? Many spelling/pronunciation changes that go along with inflectional changes, e.g. Fox and Foxes

Multi-Tape Machines Solution for complex changes: –Add more tapes –Use output of one tape machine as input to the next So to handle irregular spelling changes, add intermediate tapes with intermediate symbols

Example of a Multi-Tape Machine We use one machine to transduce between the lexical and the intermediate level, and another to transduce between the intermediate and the surface tapes

FST Fragment: Lexical to Intermediate

FST Fragment: Intermediate to Surface The add an “e” rule as in fox^s# foxes#

Putting Them Together

Practical Uses The kind of parsing we’re talking about is normally called morphological analysis It can be An important stand-alone component of an application (spelling correction, information retrieval, part-of-speech tagging,…) Or simply a link in a chain of processing (machine translation, parsing,…)

Porter Stemmer Standard, very popular and usable stemmer (IR, IE) Sequence of cascaded rewrite rules, e.g. –IZE  ε (e.g. unionize  union) –CY  T (e.g. frequency  frequent) –ING  ε, if stem contains vowel (motoring  motor) Can be implemented as a lexicon-free FST (many implementations available on the web)

Note: Morphology Differs by Language Languages differ in how they encode morphological information –Isolating languages (e.g. Cantonese) have no affixes: each word usually has 1 morpheme –Agglutinative languages (e.g. Finnish, Turkish) are composed of prefixes and suffixes added to a stem (like beads on a string) – each feature realized by a single affix, e.g. Finnish epäjärjestelmällistyttämättömyydellänsäkäänköhän ‘Wonder if he can also... with his capability of not causing things to be unsystematic’

–Polysynthetic languages (e.g. Inuit languages) express much of their syntax in their morphology, incorporating a verb’s arguments into the verb, e.g. Western Greenlandic Aliikusersuillammassuaanerartassagaluarpaalli. aliiku-sersu-i-llammas-sua-a-nerar-ta-ssa-galuar-paal-li entertainment-provide-SEMITRANS-one.good.at-COP- say.that-REP-FUT-sure.but-3.PL.SUBJ/3SG.OBJ-but 'However, they will say that he is a great entertainer, but...'SEMITRANSCOPFUTOBJ –So….different languages may require very different morphological analyzers

Summing Up Regular expressions and FSAs can represent subsets of natural language as well as regular languages –Both representations may be difficult for humans to understand for any real subset of a language –Can be hard to scale up: e.g., when many choices at any point (e.g. surnames) –But quick, powerful and easy to use for small problems –AT&T Finite State Toolkit does scale Next class: –Read Ch 4

Morphological Representations: Evidence from Human Performance Hypotheses: –Full listing hypothesis: words listed –Minimum redundancy hypothesis: morphemes listed Experimental evidence: –Priming experiments (Does seeing/hearing one word facilitate recognition of another?) suggest neither –Regularly inflected forms (e.g. cars) prime stem (car) but not derived forms (e.g. management, manage)

–But spoken derived words can prime stems if they are semantically close (e.g. government/govern but not department/depart) Speech errors suggest affixes must be represented separately in the mental lexicon –‘easy enoughly’ for ‘easily enough’

Concatenative vs. Non-concatenative Morphology Semitic root-and-pattern morphology –Root (2-4 consonants) conveys basic semantics (e.g. Arabic /ktb/) –Vowel pattern conveys voice and aspect –Derivational template (binyan) identifies word class

TemplateVowel Pattern activepassive CVCVCkatabkutibwrite CVCCVCkattabkuttibcause to write CVVCVCka:tabku:tibcorrespond tVCVVCVCtaka:tabtuku:tib write each other nCVVCVCnka:tabnku:tibsubscribe CtVCVCktatabktutibwrite stVCCVCstaktabstuktibdictate