Morphological analysis

Slides:



Advertisements
Similar presentations
Finite-state automata and Morphology
Advertisements

Jing-Shin Chang1 Morphology & Finite-State Transducers Morphology: the study of constituents of words Word = {a set of morphemes, combined in language-dependent.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
Morphology.
1 Morphology September 2009 Lecture #4. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 Morphology September 4, 2012 Lecture #3. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011.
5/16/ ICS 482 Natural Language Processing Words & Transducers-Morphology - 1 Muhammed Al-Mulhem March 1, 2009.
Brief introduction to morphology
BİL711 Natural Language Processing1 Morphology Morphology is the study of the way words are built from smaller meaningful units called morphemes. We can.
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.
Announcements  Revised Final Exam date:  THURSDAY 03/15/ :30-10:20 BAG 131.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Some Basic Concepts: Morphology.
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
Introduction to English Morphology Finite State Transducers
Morphology 1 Introduction Morphology Morphological Analysis (MA)
Chapter 3. Morphology and Finite-State Transducers From: Chapter 3 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Morphology (CS ) By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya.
Finite-state automata 3 Morphology Day 14 LING Computational Linguistics Harry Howard Tulane University.
Morphology: Words and their Parts CS 4705 Slides adapted from Jurafsky, Martin Hirschberg and Dorr.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 3 27 July 2005.
Reasons to Study Lexicography  You love words  It can help you evaluate dictionaries  It might make you more sensitive to what dictionaries have in.
Finite State Transducers
Finite State Transducers for Morphological Parsing
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
Morphological Processing & Stemming Using FSAs/FSTs.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 3 27 July 2007.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Chapter III morphology by WJQ. Morphology Morphology refers to the study of the internal structure of words, and the rules by which words are formed.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
Linguistics The ninth week. Chapter 3 Morphology  3.1 Introduction  3.2 Morphemes.
Morphological typology
Natural Language Processing Chapter 2 : Morphology.
III. MORPHOLOGY. III. Morphology 1. Morphology The study of the internal structure of words and the rules by which words are formed. 1.1 Open classes.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Introduction to Linguistics Unit Four Morphology, Part One Dr. Judith Yoel.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
Basic Text Processing: Morphology Word Stemming
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Lecture 7 Summary Survey of English morphology
Speech and Language Processing
Chapter 6 Morphology.
Morphology: Parsing Words
Morphology.
Speech and Language Processing
CSCI 5832 Natural Language Processing
Token generation - stemming
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
CPSC 503 Computational Linguistics
Morphological Parsing
Basic Text Processing: Morphology Word Stemming
Presentation transcript:

Morphological analysis LING 570 Fei Xia Week 3: 10/14/09 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA

Outline The task Porter stemmer FST morphological analyzer: J&M 3.1-3.8

The task To break word down into component morphemes and build a structured representation A morpheme is the minimal meaning-bearing unit in a language. Stem: the morpheme that forms the central meaning unit in a word Affix: prefix, suffix, infix, circumfix Prefix: e.g., possible  impossible Suffix: e.g., walk  walking Infix: e.g., hingi  humingi (Tagalog) Circumfix: e.g., sagen  gesagt (German)

Two slightly different tasks Stemming: Ex: writing  writ + ing (or write + ing) Lemmatization: Ex1: writing  write +V +Prog Ex2: books  book +N +Pl Ex3: writes  write +V +3Per +Sg

Ambiguity in morphology flies  fly +N +PL flies  fly +V +3rd +Sg

Language variation Isolated languages: e.g., Chinese Morphologically poor languages: e.g., English Morphologically complex languages: e.g., Turkish

Ways to combine morphemes to form words Inflection: stem + gram. morpheme  same class Ex: help + ed  helped Derivation: stem + gram. morpheme  different class Ex: civil + -zation  civilization Compounding: multiple stems Ex: cabdriver, doghouse Cliticization: stem + clitic Ex: they’ll, she’s (*I don’t know who she is)

Porter stemmer

Porter stemmer The algorithm was introduced in 1980 by Martin Porter. http://www.tartarus.org/~martin/PorterStemmer/def.txt Purpose: to improve IR. It removes suffixes only. Ex: civilization  civil It is rule-based, and does not require a lexicon.

How does it work? The format of rules: (condition) S1  S2 Ex: (m>1) ZATION  ² Rules are partially ordered: Step 1a: -s Step 1b: -ed, -ing Step 2-4: derivational suffixes Step 5: some final fixes How well does it work? What are the main problems with this kind of approach?

FST morphological analyzer

FST morphological analysis Read J&M Chapter 3 English morphology: FSA acceptor: Ex: cats  yes/no, foxs  yes/no FSTs for morphological analysis: Ex: fox +N +PL  fox^s# Adding orthographic rules: Ex: fox^s#  foxes#

English morphology  For now, we will focus on inflection only. Affixes: prefixes, suffixes; no infixes, circumfixes. Inflectional: Noun: -s Verbs: -s, -ing, -ed, -ed Adjectives: -er, -est Derivational: Ex: V + suf  N computerize + -ation  computerization kill + er  killer Compound: pickup, database, heartbroken, etc. Cliticization: ‘m, ‘ve, ‘re, etc.  For now, we will focus on inflection only.

Three components Lexicon: the list of stems and affixes, with associated features. Ex: book: N -s: +PL Morphotactics: Ex: +PL follows a noun Orthographic rules (spelling rules): to handle exceptions that can be dealt with by rules. Ex1: y  ie: fly + -s  flies Ex2: ²  e: fox + -s  foxes Ex2’: ²  e / x^_s#

An example Task: foxes  fox +N +PL Surface: foxes Intermediate: fox s Lexical: fox +N +pl Orthographic rules Lexicon + morphotactics

Three levels

The lexicon (in general) The role of the lexicon is to associate linguistic information with words of the language. Many words are ambiguous: with more than one entry in the lexicon. Information associated with a word in a lexicon is called a lexical entry.

The lexicon (cont) fly: v, +base fly: n, +sg fox: n, +sg fly: (NP, V) fly: (NP, V, NP) Should the following be included in the lexicon? flies: v, +sg +3rd flies: n, +pl foxes: n, +pl flew: v, +past

The lexicon for English noun inflection fox: n, +sg, +reg  reg-noun goose: n, +sg, -reg  irreg-sg-noun geese: n, +pl, -reg  irreg-pl-noun

An acceptor

Expanded FSA q1 q0 q2

Lexicon for English verbs fly: v, +base, +irreg  irreg-verb-stem flew: v, +past, +irreg  irreg-past-verb walk: v, +base, +reg  reg-verb-stem

An FSA for the English verb

An FSA for English derivational morphology

So far Ex: cats Ex: civilize Remaining issues: Have the entry “cat: reg-noun” in the lexicon A path: q0  q1  q2 Result: cats  cat s  cat^s# Ex: civilize Have the entry “civil: noun1” in the lexicon Result: civilize  civil^ize# Remaining issues: cat^s#  cat +N +PL spelling changes: foxes  fox^s#

FST morphological analysis English morphology: J&M 3.1 FSA acceptor: J&M 3.3 Ex: cats  yes/no, foxs  yes/no FSTs for morphological analysis: J&M 3.5 Ex: fox +N +PL  fox^s# Adding orthographic rules: J&M 3.6-3.7 Ex: fox^s#  foxes#

Three levels Lexical level: Intermediate level: Surface level:

An acceptor

An FST cat +N +PL  cat^s# cat +N +Sg  cat#

The lexicon for FST reg-non Irreg-pl-noun Irreg-sg-noun fox g o:e o:e s e goose cat sheep aardvark m o:i u:² s:c e mouse goose  geese mouse  mice

Expanding FST fox +N + Pl  fox^s# cat +N +Pl  cat^s# goose +N +Sg  goose# goose +N +Pl  geese#

FST morphological analysis English morphology: J&M 3.1 FSA acceptor: J&M 3.3 Ex: cats  yes/no, foxs  yes/no FSTs for morphological analysis: J&M 3.5 Ex: fox +N +PL  fox^s# Adding orthographic rules: J&M 3.6-3.7 Ex: fox^s#  foxes#

Orthographic rules E insertion: fox  foxes 1st try: ²  e “e” is added after -s, -x, -z, etc. before -s 2nd try: ²  e / (s|x|z|) _ s Problem? Ex: glass  glases 3rd try: ²  e / (s|x|z)^_ s#

Rewrite rules Format: Rewrite rules can be optional or obligatory Rewrite rules can be ordered to reduce ambiguity. Under some conditions, these rewrite rules are equivalent to FSTs. ® is not allowed to match something introduced in the previous rule application

Representing orthographic rules as FSTs ²  e / (s|x|z)^_ s# Input: …(s|x|z)^s# immediate level Output: …(s|x|z)es# surface level To reject (fox^s, foxs)

(fox, fox): q0, q0, q0, q1 (fox#, fox#): q0, q0, q0, q1, q0 (fox^z#, foxz#), q0, q0, q0, q1, q2, q1, q0 (fox^s#, foxes#): q0, q0, q0, q1, q2, q3, q4, q0 (fox^s, foxs): q0, q0, q0, q1, q2, q5

What would the FST accept? (f, f) (fox, fox) (fox#, fox#) (fox^z#, foxz#) (fox^s#, foxes#) It will reject: (fox^s, foxs)

Combining lexicon and rules Lexical level: Intermediate level: Surface level:

Summary of FST morphological analyzer Three components: Lexicon Morphotactics Orthographic rules Representing morphotactics as FST and expand it with the lexicon entries. Representing orthographic rules as FSTs. Combining all FSTs with operations such as composition. Giving the three components, creating and combining FSTs can be done automatically.

Hw4: Q1-Q3

Q1-Q3 (cont) Q1: expand_fsm1.sh Q2: morph_acceptor1.sh cuts => yes cuted => no Q3: expand_fsm2.sh and morph_acceptor2.sh cuts => cut/irreg_verb_stem s/3sg cuted => *NONE*

Hw4: Q4

Hw5: Q5 Compare Porter’s stemmer with FST morphological analyzer

Remaining issues Creating the three components by hand is time consuming.  unsupervised morphological induction How would a morphological analyzer help a particular application (e.g., IR, MT)?

How does the induction work? Start from a simple list of words and their frequencies: Ex: play 27 played 100 walked 40 Try to find the most efficient way to encode the wordlist: Ex: minimum description length (MDL)

General approach Initialize: start from an initial set of “words” and find the description length of this set Repeat until convergence Generate a candidate set of new “words” that will each enable a reduction in the description length Ex: walk, walked, play, played four words two words (walk and play) and a suffix (-ed)