Morphology Reading: Chap 3, Jurafsky & Martin Instructor: Paul Tarau, based on Rada Mihalcea’s original slides Note: Some of the material in this slide.

Slides:



Advertisements
Similar presentations
1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.
Advertisements

Text Processing. Slide 1 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation ( ), numbers (1999),
Information Retrieval and Web Search Text processing Instructor: Rada Mihalcea Class web page:
Jing-Shin Chang1 Morphology & Finite-State Transducers Morphology: the study of constituents of words Word = {a set of morphemes, combined in language-dependent.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Natural Language Processing Lecture 3—9/3/2013 Jim Martin.
Computational Morphology. Morphology S.Ananiadou2 Outline What is morphology? –Word structure –Types of morphological operation – Levels of affixation.
Morphology.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
 Christel Kemke 1 Morphology COMP 4060 Natural Language Processing Morphology, Word Classes, POS Tagging.
Morphology Nuha Alwadaani.
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
Brief introduction to morphology
Natural Language Processing - English Grammar -
Lecture -3 Week 3 Introduction to Linguistics – Level-5 MORPHOLOGY
LIN3022 Natural Language Processing Lecture 3 Albert Gatt 1LIN3022 Natural Language Processing.
Stemming, tagging and chunking Text analysis short of parsing.
1 Morphological analysis LING 570 Fei Xia Week 4: 10/15/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Learning Bit by Bit Class 3 – Stemming and Tokenization.
Morphological analysis
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Introduction to English Morphology Finite State Transducers
Morphology and Finite-State Transducers. Why this chapter? Hunting for singular or plural of the word ‘woodchunks’ was easy, isn’t it? Lets consider words.
Chapter 4 Morphology. Morphology. This term, which literally means ‘the study of forms’ refers to the linguistic study of the different forms of a word,
Morphology (CS ) By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya.
English Lexicology Morphological Structure of English Words Week 3: Mar. 10, 2009 Instructor: Liu Hongyong.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture4 1 August 2007.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
LING 388: Language and Computers Sandiway Fong Lecture 22: 11/10.
Phonemes A phoneme is the smallest phonetic unit in a language that is capable of conveying a distinction in meaning. These units are identified within.
Session 11 Morphology and Finite State Transducers Introduction to Speech Natural and Language Processing (KOM422 ) Credits: 3(3-0)
Reasons to Study Lexicography  You love words  It can help you evaluate dictionaries  It might make you more sensitive to what dictionaries have in.
Morphology A Closer Look at Words By: Shaswar Kamal Mahmud.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Chapter III morphology by WJQ. Morphology Morphology refers to the study of the internal structure of words, and the rules by which words are formed.
M ORPHOLOGY Lecturer/ Najla AlQahtani. W HAT IS MORPHOLOGY ? It is the study of the basic forms in a language. A morpheme is “a minimal unit of meaning.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Natural Language Processing Chapter 2 : Morphology.
MORPHOLOGY. Morphology The study of internal structure of words, and of the rules by which words are formed.
III. MORPHOLOGY. III. Morphology 1. Morphology The study of the internal structure of words and the rules by which words are formed. 1.1 Open classes.
October 2004CSA3050 NLP Algorithms1 CSA3050: Natural Language Algorithms Morphological Parsing.
MORPHOLOGY : THE STRUCTURE OF WORDS. MORPHOLOGY Morphology deals with the syntax of complex words and parts of words, also called morphemes, as well as.
Chapter 3 Word Formation I This chapter aims to analyze the morphological structures of words and gain a working knowledge of the different word forming.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Morphology 1 : the Morpheme
King Faisal University [ ] 1 E-learning and Distance Education Deanship Department of English Language College of Arts King Faisal University Introduction.
Two Level Morphology Alexander Fraser & Liane Guillou CIS, Ludwig-Maximilians-Universität München Computational Morphology.
Basic Text Processing: Morphology Word Stemming
CIS, Ludwig-Maximilians-Universität München Computational Morphology
Lecture 7 Summary Survey of English morphology
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
ENGLISH MORPHOLOGY Week 4.
Lecture -3 Week 3 Introduction to Linguistics – Level-5 MORPHOLOGY
Introduction to Linguistics
Speech and Language Processing
عمادة التعلم الإلكتروني والتعليم عن بعد
Chapter 3 Morphology Without grammar, little can be conveyed. Without vocabulary, nothing can be conveyed. (David Wilkins ,1972) Morphology refers to.
Chapter 6 Morphology.
Morphology.
Speech and Language Processing
CSCI 5832 Natural Language Processing
Token generation - stemming
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Basic Text Processing Word tokenization.
Morphological Parsing
Introduction to English morphology
Basic Text Processing: Morphology Word Stemming
Presentation transcript:

Morphology Reading: Chap 3, Jurafsky & Martin Instructor: Paul Tarau, based on Rada Mihalcea’s original slides Note: Some of the material in this slide set was adapted from Christel Kemke (U. Manitoba) slides on morphology

Slide 1 Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using morphemes –base form (stem), e.g., believe –affixes (suffixes, prefixes, infixes), e.g., un-, -able, -ly Morphological parsing = the task of recognizing the morphemes inside a word –e.g., hands, foxes, children Important for many tasks –machine translation –information retrieval –lexicography –any further processing (e.g., part-of-speech tagging)

Slide 2 Morphemes and Words Combine morphemes to create words Inflection combination of a word stem with a grammatical morpheme same word class, e.g. clean (verb), clean-ing (verb) Derivation combination of a word stem with a grammatical morpheme Yields different word class, e.g. clean (verb), clean-ing (noun) Compounding combination of multiple word stems Cliticization combination of a word stem with a clitic different words from different syntactic categories, e.g. I’ve = I + have

Slide 3 Inflectional Morphology word stem + grammatical morphemecat + s only for nouns, verbs, and some adjectives Nouns plural: regular: +s, +es irregular: mouse - mice; ox - oxen rules for exceptions: e.g. -y -> -ieslike: butterfly - butterflies possessive: +'s, +' Verbs main verbs (sleep, eat, walk) modal verbs (can, will, should) primary verbs (be, have, do)

Slide 4 Inflectional Morphology (verbs) Verb Inflections for: main verbs (sleep, eat, walk); primary verbs (be, have, do) Morpholog. FormRegularly Inflected Form stemwalkmerge trymap -s formwalksmerges triesmaps -ing participlewalkingmerging tryingmapping past; -ed participlewalkedmerged triedmapped Morph. FormIrregularly Inflected Form stemeatcatch cut -s formeatscatches cuts -ing participleeatingcatching cutting -ed pastatecaught cut -ed participle eatencaught cut

Slide 5 Noun Inflections for: regular nouns (cat, hand); irregular nouns(child, ox) Morpholog. FormRegularly Inflected Form stemcathand plural formcatshands Morph. FormIrregularly Inflected Form stemchildox plural formchildrenoxen Inflectional Morphology (nouns)

Slide 6 Inflectional and Derivational Morphology (adjectives) Adjective Inflections and Derivations: prefixun-unhappyadjective, negation suffix-lyhappilyadverb, mode -erhappieradjective, comparative 1 -esthappiestadjective, comparative 2 suffix-nesshappinessnoun plus combinations, like unhappiest, unhappiness. Distinguish different adjective classes, which can or cannot take certain inflectional or derivational forms, e.g. no negation for big.

Slide 7 Derivational Morphology (nouns)

Slide 8 Derivational Morphology (adjectives)

Slide 9 Verb Clitics

Methods, Algorithms

Slide 11 Stemming Stemming algorithms strip off word affixes yield stem only, no additional information (like plural, 3 rd person etc.) used, e.g. in web search engines famous stemming algorithm: the Porter stemmer

Slide 12 Stemming Reduce tokens to “root” form of words to recognize morphological variation. “computer”, “computational”, “computation” all reduced to same token “compute” Correct morphological analysis is language specific and can be complex. Stemming “blindly” strips off known affixes (prefixes and suffixes) in an iterative fashion. for example compressed and compression are both accepted as equivalent to compress. for exampl compres and compres are both accept as equival to compres.

Slide 13 Porter Stemmer Simple procedure for removing known affixes in English without using a dictionary. Can produce unusual stems that are not English words: “computer”, “computational”, “computation” all reduced to same token “comput” May conflate (reduce to the same token) words that are actually distinct. Does not recognize all morphological derivations Typical rules in Porter stemmer sses  ss ies  i ational  ate tional  tion ing → 

Slide 14 Stemming Problems Errors of ComissionErrors of Omission organizationorganEuropeanEurope doingdoeanalysisanalyzes GeneralizationGenericMatricesmatrix NumericalnumerousNoisenoisy Policypolicesparsesparsity

Slide 15 Tokenization, Word Segmentation Tokenization or word segmentation separate out “words” (lexical entries) from running text expand abbreviated terms E.g. I’m into I am, it’s into it is collect tokens forming single lexical entry E.g. New York marked as one single entry More of an issue in languages like Chinese

Slide 16 Simple Tokenization Analyze text into a sequence of discrete tokens (words). Sometimes punctuation ( ), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token. However, frequently they are not. Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens. More careful approach: Separate ? ! ; : “ ‘ [ ] ( ) Care with. - why? when? Care with … ??

Slide 17 Punctuation Children’s: use language-specific mappings to normalize (e.g. Anglo- Saxon genitive of nouns, verb contractions: won’t -> wo ‘nt) State-of-the-art: break up hyphenated sequence. U.S.A. vs. USA a.out

Slide 18 Numbers 3/12/91 Mar. 12, B.C. B Generally, don’t index as text Creation dates for docs

Slide 19 Lemmatization Reduce inflectional/derivational forms to base form Direct impact on vocabulary size E.g., am, are, is  be car, cars, car's, cars'  car the boy's cars are different colors  the boy car be different color How to do this? Need a list of grammatical rules + a list of irregular words Children  child, spoken  speak … Practical implementation: use WordNet’s morphstr function Perl: WordNet::QueryData (first returned value from validForms function)

Slide 20 Morphological Processing Knowledge lexical entry: stem plus possible prefixes, suffixes plus word classes, e.g. endings for verb forms (see tables above) rules: how to combine stem and affixes, e.g. add s to form plural of noun as in dogs orthographic rules: spelling, e.g. double consonant as in mapping Processing: Finite State Transducers take information above and analyze word token / generate word form

Slide 21 Fig. 3.3FSA for verb inflection.

Slide 22 Fig. 3.5More detailed FSA for adjective inflection. Fig. 3.4Simple FSA for adjective inflection.

Slide 23 Fig. 3.7 Compiled FSA for noun inflection.