Morphological Parsing

Slides:



Advertisements
Similar presentations
Finite-state automata and Morphology
Advertisements

Finite State Automata. A very simple and intuitive formalism suitable for certain tasks A bit like a flow chart, but can be used for both recognition.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
1 Morphology September 2009 Lecture #4. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 Morphology September 4, 2012 Lecture #3. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011.
5/16/ ICS 482 Natural Language Processing Words & Transducers-Morphology - 1 Muhammed Al-Mulhem March 1, 2009.
BİL711 Natural Language Processing1 Morphology Morphology is the study of the way words are built from smaller meaningful units called morphemes. We can.
Regular Expressions (RE) Used for specifying text search strings. Standarized and used widely (UNIX: vi, perl, grep. Microsoft Word and other text editors…)
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
Computational language: week 9 Finish finite state machines FSA’s for modelling word structure Declarative language models knowledge representation and.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
Morphological analysis
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:
LING 438/538 Computational Linguistics Sandiway Fong Lecture 14: 10/12.
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
Introduction to English Morphology Finite State Transducers
Chapter 3. Morphology and Finite-State Transducers From: Chapter 3 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Morphology (CS ) By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
Finite State Automata and Tries Sambhav Jain IIIT Hyderabad.
Phonemes A phoneme is the smallest phonetic unit in a language that is capable of conveying a distinction in meaning. These units are identified within.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Finite State Transducers
Finite State Transducers for Morphological Parsing
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
CSA3050: Natural Language Algorithms Finite State Devices.
The Simplest NL Applications: Text Searching and Pattern Matching Read J & M Chapter 2.
Natural Language Processing Chapter 2 : Morphology.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
NATURAL LANGUAGE PROCESSING
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Lecture 7 Summary Survey of English morphology
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Speech and Language Processing
Finite-State Machines (FSMs)
Basic Parsing with Context Free Grammars Chapter 13
Chapter 3 Morphology Without grammar, little can be conveyed. Without vocabulary, nothing can be conveyed. (David Wilkins ,1972) Morphology refers to.
Composition is Our Friend
Finite-State Machines (FSMs)
Morphology: Parsing Words
Morphology: Words and their Parts
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
CSCI 5832 Natural Language Processing
Introduction to Linguistics
Presentation transcript:

Morphological Parsing CS 4705 CS 4705

Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing: taking a word or string of words as input and identifying their stems and affixes (and sometimes interpreting these) E.g.: goose  goose +N +SG or goose + V geese  goose +N +PL gooses  goose +V +3SG Bracketing: indecipherable  [in [[de [cipher]] able]] Cipher from mfr cifre from arabic cifra (zero, nothing)

Why ‘parse’ words? To find stems Simple key to word similarity Yellow, yellowish, yellows, yellowed, yellowing… To find affixes and the information they convey ‘ed’ signals a verb ‘ish’ an adjective ‘s’? Morphological parsing provides information about a word’s semantics and the syntactic role it plays in a sentence

Some Practical Applications For spell-checking Is muncheble a legal word? To identify a word’s part-of-speech (pos) For sentence parsing, for machine translation, … To identify a word’s stem For information retrieval Why not just list all word forms in a lexicon?

What do we need to build a morphological parser? Lexicon: list of stems and affixes (w/ corresponding p.o.s.) Morphotactics of the language: model of how and which morphemes can be affixed to a stem Orthographic rules: spelling modifications that may occur when affixation occurs in  il in context of l (in- + legal) Most morphological phenomena can be described with regular expressions – so finite state techniques often used to represent morphological processes

Using FSAs to Represent English Plural Nouns English nominal inflection plural (-s) reg-n q0 q1 q2 irreg-pl-n irreg-sg-n Inputs: cats, geese, goose

Derivational morphology: adjective fragment adj-root1 -er, -ly, -est un- q0 q1 q2 adj-root1 q5 q3 q4  -er, -est adj-root2 What will happen if we use only the FSA defined by the purple nodes? Will allow unbig, unred,… Solution: define classes of adjective stems NFSA: easier and more intuitive to define Adj-root1: clear, happy, real (clearly) Adj-root2: big, red (*bigly)

FSAs can also represent the Lexicon Expand each non-terminal arc in the previous FSA into a sub-lexicon FSA (e.g. adj_root2 = {big, red}) and then expand each of these stems into its letters (e.g. red  r e d) to get a recognizer for adjectives e r q1 q2 un- q3 q7 q0 b d q4 -er, -est q5 i g q6

But….. Covering the whole lexicon this way will require very large FSAs with consequent search and maintenance problems Adding new items to the lexicon means recomputing the whole FSA Non-determinism FSAs tell us whether a word is in the language or not – but usually we want to know more: What is the stem? What are the affixes and what sort are they? We used this information to recognize the word: why can’t we store it? Adding new lexical items means we will need to determinize and minimize the FSA each time.

Parsing with Finite State Transducers cats cat +N +PL (a plural NP) Kimmo Koskenniemi’s two-level morphology Idea: word is a relationship between lexical level (its morphemes) and surface level (its orthography) Morphological parsing : find the mapping (transduction) between lexical and surface levels c a t +N +PL s lexical surface

Finite State Transducers can represent this mapping FSTs map between one set of symbols and another using a FSA whose alphabet  is composed of pairs of symbols from input and output alphabets In general, FSTs can be used for Translators (Hello:Ciao) Parser/generators (Hello:How may I help you?) As well as Kimmo-style morphological parsing

FST is a 5-tuple consisting of Q: set of states {q0,q1,q2,q3,q4} : an alphabet of complex symbols, each an i/o pair s.t. i  I (an input alphabet) and o  O (an output alphabet) and  is in I x O q0: a start state F: a set of final states in Q {q4} (q,i:o): a transition function mapping Q x  to Q Emphatic Sheep  Quizzical Cow a:o b:m a:o a:o !:? q0 q1 q2 q3 q4

FST for a 2-level Lexicon E.g. c:c a:a t:t q3 q0 q1 q2 g q4 q5 q6 q7 e e:o e:o s Reg-n Irreg-pl-n Irreg-sg-n c a t g o:e o:e s e g o o s e NB: by convention, a:a is written just a

FST for English Nominal Inflection reg-n +PL:^s# q1 q4 +SG:-# +N: irreg-n-sg q0 q2 q5 q7 +SG:-# irreg-n-pl q3 q6 +PL:-s# +N: s t a c +PL +N

Useful Operations on Transducers Cascade: running 2+ FSTs in sequence Intersection: represent the common transitions in FST1 and FST2 (ASR: finding pronunciations) Composition: apply FST2 transition function to result of FST1 transition function Inversion: exchanging the input and output alphabets (recognize and generate with same FST) cf AT&T FSM Toolkit and papers by Mohri, Pereira, and Riley

Orthographic Rules and FSTs Define additional FSTs to implement rules such as consonant doubling (beg  begging), ‘e’ deletion (make  making), ‘e’ insertion (watch  watches), etc. Lexical f o x +N +PL Intermediate ^ s # Surface e

Porter Stemmer (1980) Used for tasks in which you only care about the stem IR, modeling given/new distinction, topic detection, document similarity Lexicon-free morphological analysis Cascades rewrite rules (e.g. misunderstanding --> misunderstand --> understand --> …) Easily implemented as an FST with rules e.g. ATIONAL  ATE ING  ε Not perfect …. Doing  doe

Policy  police Does stemming help? IR, little Topic detection, more

Summing Up FSTs provide a useful tool for implementing a standard model of morphological analysis, Kimmo’s two-level morphology But for many tasks (e.g. IR) much simpler approaches are still widely used, e.g. the rule-based Porter Stemmer Next time: Read Ch 3.10-11, 3.13 (new version)