Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005.

Slides:



Advertisements
Similar presentations
Finite-state automata and Morphology
Advertisements

Jing-Shin Chang1 Morphology & Finite-State Transducers Morphology: the study of constituents of words Word = {a set of morphemes, combined in language-dependent.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Natural Language Processing Lecture 3—9/3/2013 Jim Martin.
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
Morphology.
1 Morphology September 2009 Lecture #4. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
1 Morphology September 4, 2012 Lecture #3. 2 What is Morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units.
Finite-State Transducers Shallow Processing Techniques for NLP Ling570 October 10, 2011.
Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011.
5/16/ ICS 482 Natural Language Processing Words & Transducers-Morphology - 1 Muhammed Al-Mulhem March 1, 2009.
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
Brief introduction to morphology
BİL711 Natural Language Processing1 Morphology Morphology is the study of the way words are built from smaller meaningful units called morphemes. We can.
6/2/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
6/10/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 3 Giuseppe Carenini.
LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.
Computational language: week 9 Finish finite state machines FSA’s for modelling word structure Declarative language models knowledge representation and.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
Morphological analysis
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:
Finite State Transducers The machine model we will study for morphological parsing is called the finite state transducer (FST) An FST has two tapes –input.
CS 4705 Morphology: Words and their Parts CS 4705 Julia Hirschberg.
Introduction to English Morphology Finite State Transducers
Morphology 1 Introduction Morphology Morphological Analysis (MA)
Chapter 3. Morphology and Finite-State Transducers From: Chapter 3 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Morphology (CS ) By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya.
Morphology: Words and their Parts CS 4705 Slides adapted from Jurafsky, Martin Hirschberg and Dorr.
October 2006Advanced Topics in NLP1 CSA3050: NLP Algorithms Finite State Transducers for Morphological Parsing.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Morphological Recognition We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
10/8/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 3 27 July 2005.
Finite State Transducers
Chapter 3: Morphology and Finite State Transducer Heshaam Faili University of Tehran.
Finite State Transducers for Morphological Parsing
Words: Surface Variation and Automata CMSC Natural Language Processing April 3, 2003.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 3 27 July 2007.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
Artificial Intelligence: Natural Language
CSA3050: Natural Language Algorithms Finite State Devices.
Natural Language Processing Chapter 2 : Morphology.
1/11/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 2 Giuseppe Carenini.
CSA4050: Advanced Topics in NLP Computational Morphology II Introduction 2 Level Morphology.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
CIS, Ludwig-Maximilians-Universität München Computational Morphology
CPSC 503 Computational Linguistics
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Speech and Language Processing
CPSC 503 Computational Linguistics
Morphology: Parsing Words
CPSC 503 Computational Linguistics
Natural Language Processing
CSCI 5832 Natural Language Processing
Speech and Language Processing
CSCI 5832 Natural Language Processing
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Dan Jurafsky 11/24/2018 LING 138/238 Autumn 2004.
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Morphological Parsing
Presentation transcript:

Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 4 28 July 2005

Lecture 3, 7/27/2005Natural Language Processing2 Morphology Study of the rules that govern the combination of morphemes. Inflection: same word, different syntactic information  Run/runs/running, book/books Derivation: new word, different meaning  Often different part of speech, but not always  Possible/possibly/impossible, happy/happiness Compounding: new word, each part is a word  Blackboard, bookshelf  lAlakamala, banabAsa

3 Morphology Level: The Mapping Formally: A +    2 (L,C 1,C 2,...,Cn)  A is the alphabet of phonemes (A + denotes any non-empty sequence of phonemes)  L is the set of possible lemmas, uniquely identified  C i are morphological categories, such as:  grammatical number, gender, case  person, tense, negation, degree of comparison, voice, aspect,...  tone, politeness,...  part of speech (not quite morphological category, but...)  A, L and C i are obviously language-dependent

Lecture 3, 7/27/2005Natural Language Processing4 Bengali/Hindi Inflectional Morphology Certain languages encode more syntax in morphology than in syntax Some of inflectional suffixes that nouns can have:  singular/plural :  Gender  possessive markers :  case markers :  Different Karakas Inflectional suffixes that verbs can have:  Hindi: Tense, aspect, modality, person, gender, number  Bengali: Tense, aspect, modality, person Order among inflectional suffixes (morphotactics )  Chhelederke  baigulokei

Lecture 3, 7/27/2005Natural Language Processing5 Bengali/ Hindi Derivational Morphology Derivational morphology is very rich.

Lecture 3, 7/27/2005Natural Language Processing6 English Inflectional Morphology Nouns have simple inflectional morphology.  plural -- cat / cats  possessive -- John / John’s Verbs have slightly more complex inflectional, but still relatively simple inflectional morphology.  past form -- walk / walked  past participle form -- walk / walked  gerund -- walk / walking  singular third person -- walk / walks Verbs can be categorized as:  main verbs  modal verbs -- can, will, should  primary verbs -- be, have, do Regular and irregular verbs: walk / walked -- go / went

Lecture 3, 7/27/2005Natural Language Processing7 Regulars and Irregulars Some words misbehave (refuse to follow the rules)  Mouse/mice, goose/geese, ox/oxen  Go/went, fly/flew The terms regular and irregular will be used to refer to words that follow the rules and those that don’t.

Lecture 3, 7/27/2005Natural Language Processing8 Regular and Irregular Verbs Regulars…  Walk, walks, walking, walked, walked Irregulars  Eat, eats, eating, ate, eaten  Catch, catches, catching, caught, caught  Cut, cuts, cutting, cut, cut

Lecture 3, 7/27/2005Natural Language Processing9 Derivational Morphology  Quasi-systematicity  Irregular meaning change  Changes of word class Some English derivational affixes  - ation : transport / transportation  -er : kill / killer  -ness : fuzzy / fuzziness  -al : computation / computational  -able : break / breakable  -less : help / helpless  un : do / undo  re : try / retry renationalizationability

Lecture 3, 7/27/2005Natural Language Processing10 Derivational Examples Verb/Adj to Noun -ationcomputerizecomputerization -eeappointappointee -erkillkiller -nessfuzzyfuzziness

Lecture 3, 7/27/2005Natural Language Processing11 Derivational Examples Noun/Verb to Adj -alComputationComputational -ableEmbraceEmbraceable -lessClueClueless

Lecture 3, 7/27/2005Natural Language Processing12 Compute Many paths are possible… Start with compute  Computer -> computerize -> computerization  Computation -> computational  Computer -> computerize -> computerizable  Compute -> computee

Lecture 3, 7/27/2005Natural Language Processing13 Parts of A Morphological Processor For a morphological processor, we need at least followings:  Lexicon : The list of stems and affixes together with basic information about them such as their main categories (noun, verb, adjective, …) and their sub-categories (regular noun, irregular noun, …).  Morphotactics : The model of morpheme ordering that explains which classes of morphemes can follow other classes of morphemes inside a word.  Orthographic Rules (Spelling Rules) : These spelling rules are used to model changes that occur in a word (normally when two morphemes combine).

Lecture 3, 7/27/2005Natural Language Processing14 Lexicon A lexicon is a repository for words (stems). They are grouped according to their main categories.  noun, verb, adjective, adverb, … They may be also divided into sub-categories.  regular-nouns, irregular-singular nouns, irregular-plural nouns, … The simplest way to create a morphological parser, put all possible words (together with its inflections) into a lexicon.  We do not this because their numbers are huge (theoretically for Turkish, it is infinite)

Lecture 3, 7/27/2005Natural Language Processing15 Morphotactics Which morphemes can follow which morphemes. Lexicon: regular-noun irregular-pl-noun irreg-sg-noun plural fox geesegoose -s cat sheepsheep dog mice mouse Simple English Nominal Inflection (Morphotactic Rules) reg-noun plural (-s) irreg-sg-noun irreg-pl-noun

Lecture 3, 7/27/2005Natural Language Processing16 Combine Lexicon and Morphotactics f o x s c at d og s h e e p g o e e o s e m ous i c e This only says yes or no. Does not give lexical representation. It accepts a wrong word (foxs).

Lecture 3, 7/27/2005Natural Language Processing17 FSAs and the Lexicon This will actual require a kind of FSA : the Finite State Transducer (FST) We will give a quick overview First we’ll capture the morphotactics  The rules governing the ordering of affixes in a language. Then we’ll add in the actual words

Lecture 3, 7/27/2005Natural Language Processing18 Simple Rules

Lecture 3, 7/27/2005Natural Language Processing19 Adding the Words

Lecture 3, 7/27/2005Natural Language Processing20 Derivational Rules

Lecture 3, 7/27/2005Natural Language Processing21 Parsing/Generation vs. Recognition Recognition is usually not quite what we need.  Usually if we find some string in the language we need to find the structure in it (parsing)  Or we have some structure and we want to produce a surface form (production/generation) Example  From “ cats” to “ cat +N +PL”

Lecture 3, 7/27/2005Natural Language Processing22 Why care about morphology? `Stemming’ in information retrieval  Might want to search for “aardvark” and find pages with both “aardvark” and “aardvarks” Morphology in machine translation  Need to know that the Spanish words quiero and quieres are both related to querer ‘want’ Morphology in spell checking  Need to know that misclam and antiundoggingly are not words despite being made up of word parts

Lecture 3, 7/27/2005Natural Language Processing23 Can’t just list all words Turkish word Uygarlastiramadiklarimizdanmissinizcasin `(behaving) as if you are among those whom we could not civilize’  Uygar `civilized’ +  las `become’ +  tir `cause’ +  ama `not able’ +  dik `past’ +  lar ‘plural’+  imiz ‘p1pl’ +  dan ‘abl’ +  mis ‘past’ +  siniz ‘2pl’ +  casina ‘as if’

Lecture 3, 7/27/2005Natural Language Processing24 Finite State Transducers The simple story  Add another tape  Add extra symbols to the transitions  On one tape we read “ cats ”, on the other we write “ cat +N +PL ”

Lecture 3, 7/27/2005Natural Language Processing25 Transitions c:c means read a c on one tape and write a c on the other +N:ε means read a +N symbol on one tape and write nothing on the other +PL:s means read +PL and write an s c:ca:at:t +N:ε + PL:s

Lecture 3, 7/27/2005Natural Language Processing26 Lexical to Intermediate Level

Lecture 3, 7/27/2005Natural Language Processing27 FST Properties FSTs are closed under: union, inversion, and composition. union : The union of two regular relations is also a regular relation. inversion : The inversion of a FST simply switches the input and output labels.  This means that the same FST can be used for both directions of a morphological processor. composition : If T 1 is a FST from I 1 to O 1 and T 2 is a FST from O 1 to O 2, then composition of T 1 and T 2 (T 1 oT 2 ) maps from I 1 to O 2. We use these properties of FSTs in the creation of the FST for a morphological processor.

Lecture 3, 7/27/2005Natural Language Processing28 A FST for Simple English Nominals reg-noun irreg-sg-noun irreg-pl-noun +N: є +S:# +PL:^s# +SG:# +PL:#

Lecture 3, 7/27/2005Natural Language Processing29 FST for stems A FST for stems which maps roots to their root-class reg-noun irreg-pl-noun irreg-sg-noun fox g o:e o:e segoose cat sheepsheep dog m o:i u:є s:c e mouse fox stands for f:f o:o x:x When these two transducers are composed, we have a FST which maps lexical forms to intermediate forms of words for simple English noun inflections. Next thing that we should handle is to design the FSTs for orthographic rules, and combine all these transducers.

Lecture 3, 7/27/2005Natural Language Processing30 Multi-Level Multi-Tape Machines A frequently use FST idiom, called cascade, is to have the output of one FST read in as the input to a subsequent machine. So, to handle spelling we use three tapes:  lexical, intermediate and surface We need one transducer to work between the lexical and intermediate levels, and a second (a bunch of FSTs) to work between intermediate and surface levels to patch up the spelling. +PL+Ngod sgod s #^god lexical intermediate surface

Lecture 3, 7/27/2005Natural Language Processing31 Lexical to Intermediate FST

Lecture 3, 7/27/2005Natural Language Processing32 Orthographic Rules We need FSTs to map intermediate level to surface level. For each spelling rule we will have a FST, and these FSTs run parallel. Some of English Spelling Rules:  consonant doubling -- 1-letter consonant doubled before ing/ed -- beg/begging  E deletion - Silent e dropped before ing and ed -- make/making  E insertion -- e added after s, z, x, ch, sh before s -- watch/watches  Y replacement -- y changes to ie before s, and to i before ed -- try/tries  K insertion -- verbs ending with vowel+c we add k -- panic/panicked We represent these rules using two-level morphology rules:  a => b / c__d rewrite a as b when it occurs between c and d.

Lecture 3, 7/27/2005Natural Language Processing33 FST for E-Insertion Rule E-insertion rule: є => e / {x,s,z}^ __ s#

Lecture 3, 7/27/2005Natural Language Processing34 Generating or Parsing with FST Lexicon and Rules

Lecture 3, 7/27/2005Natural Language Processing35 Accepting Foxes

Lecture 3, 7/27/2005Natural Language Processing36 Intersection We can intersect all rule FSTs to create a single FST. Intersection algorithm just takes the Cartesian product of states.  For each state q i of the first machine and q j of the second machine, we create a new state q ij  For input symbol a, if the first machine would transition to state q n and the second machine would transition to q m the new machine would transition to q nm.

Lecture 3, 7/27/2005Natural Language Processing37 Composition Cascade can turn out to be somewhat pain.  it is hard to manage all tapes  it fails to take advantage of restricting power of the machines So, it is better to compile the cascade into a single large machine. Create a new state (x,y) for every pair of states x є Q 1 and y є Q 2. The transition function of composition will be defined as follows: δ((x,y),i:o) = (v,z) if there exists c such that δ 1 (x,i:c) = v and δ 2 (y,c:o) = z

Lecture 3, 7/27/2005Natural Language Processing38 Intersect Rule FSTs lexical tape LEXICON-FST intermediate tape FST 1 … FST n surface tape => FST R = FST 1 ^ … ^ FST n

Lecture 3, 7/27/2005Natural Language Processing39 Compose Lexicon and Rule FSTs lexical tape LEXICON-FST intermediate tape surface tape FST R = FST 1 ^ … ^ FST n => LEXICON-FST o FST R lexical tape surface level

Lecture 3, 7/27/2005Natural Language Processing40 Porter Stemming Some applications (some informational retrieval applications) do not the whole morphological processor. They only need the stem of the word. A stemming algorithm (Port Stemming algorithm) is a lexicon-free FST. It is just a cascaded rewrite rules. Stemming algorithms are efficient but they may introduce errors because they do not use a lexicon.