CSC 594 Topics in AI – Natural Language Processing

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Advertisements

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Almen sproglig viden og metode (General Linguistics)
Morphology.
Morphological Analysis Chapter 3. Morphology Morpheme = "minimal meaning-bearing unit in a language" Morphology handles the formation of words by using.
Statistical NLP: Lecture 3
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Ana Bertha Camargo Mejía
Morphology Chapter 7 Prepared by Alaa Al Mohammadi.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Outline of English Syntax.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Some Basic Concepts: Morphology.
Grammatical frameworks Inflectional morphology. Grammar In the Middle Ages, grammatica […] chiefly meant the knowledge or study of Latin, and were hence.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Syntax The number of words in a language is finite
Chapter 4 Basics of English Grammar Business Communication Copyright 2010 South-Western Cengage Learning.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Parts of Speech Sudeshna Sarkar 7 Aug 2008.
Introduction Morphology is the study of the way words are built from smaller units: morphemes un-believe-able-ly Two broad classes of morphemes: stems.
NLP LINGUISTICS 101 David Kauchak CS457 – Fall 2011 some slides adapted from Ray Mooney.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.
Phonemes A phoneme is the smallest phonetic unit in a language that is capable of conveying a distinction in meaning. These units are identified within.
Morphology A Closer Look at Words By: Shaswar Kamal Mahmud.
CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Morphological typology
Natural Language Processing Chapter 2 : Morphology.
Part-of-speech tagging
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Basic Syntactic Structures of English CSCI-GA.2590 – Lecture 2B Ralph Grishman NYU.
Word classes and part of speech tagging Chapter 5.
MORPHOLOGY. PART 1: INTRODUCTION Parts of speech 1. What is a part of speech?part of speech 1. Traditional grammar classifies words based on eight parts.
Chapter 3 Word Formation I This chapter aims to analyze the morphological structures of words and gain a working knowledge of the different word forming.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Review and preview Phonology– production and analysis of the sounds of language Semantics – words and their meanings Today – Morphology and Syntax Huennekens.
Grammar and Composition Review
Introduction to Linguistics
Sentence Structure By: Lisa Crawford, Edited by: UWC staff
Lecture 9: Part of Speech
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Lecture -3 Week 3 Introduction to Linguistics – Level-5 MORPHOLOGY
Introduction to Linguistics
عمادة التعلم الإلكتروني والتعليم عن بعد
Statistical NLP: Lecture 3
Revision Outcome 1, Unit 1 The Nature and Functions of Language
LIN1300 What is language? Dr Marie-Claude Tremblay 1.
Sentence Structure By: Lisa Crawford, Edited by: UWC staff
Chapter 3 Morphology Without grammar, little can be conveyed. Without vocabulary, nothing can be conveyed. (David Wilkins ,1972) Morphology refers to.
Chapter 6 Morphology.
LING/C SC/PSYC 438/538 Lecture 21 Sandiway Fong.
CSC 594 Topics in AI – Natural Language Processing
Morphology: Parsing Words
LING/C SC/PSYC 438/538 Lecture 20 Sandiway Fong.
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
Chapter 4 Basics of English Grammar
CSCI 5832 Natural Language Processing
A Systematic Framework for Language Analysis
By Mugdha Bapat Under the guidance of Prof. Pushpak Bhattacharyya
Língua Inglesa - Aspectos Morfossintáticos
English parts of speech
Linguistic Essentials
Chapter 4 Basics of English Grammar
Natural Language Processing
Introduction to English morphology
Part-of-Speech Tagging Using Hidden Markov Models
Introduction to Linguistics
Presentation transcript:

CSC 594 Topics in AI – Natural Language Processing Spring 2016/17 2. Linguistic Essentials (Some slides adapted from Ralph Grishman at NYU, Joyce Choi at Michigan State and Andrew McCallum, UMass Amherst)

Levels of Language Analysis Phonology study of sound systems of languages Morphology study of structure of words: the structure of words in a language, including patterns of inflections and derivations Syntax study of organization of words in sentences: the ordering of and relationship between the words in phrases and sentences Semantics study of meaning in language: the study of how meaning in language is created Pragmatics study of language in use: the branch of linguistics that studies language use rather than language structure Discourse study of language, especially the type of language used in a particular context or subject World/Common-sense Knowledge

Parts of Speech There are eight major parts of speech for words in the English language: noun, pronoun, verb, adjective, adverb, preposition, conjunction and interjection. The part of speech indicates how the word functions in meaning as well as grammatically within a sentence. Noun: people, animals, concepts, things (e.g. “birds”) Pronoun: a word used in place of a noun (e.g. “it”, “they”, “I”, “she”) Verb: express action in the sentence (e.g. “sing”) Adjective: describe properties of nouns (e.g. “yellow”) Adverb: modifies or describes a verb, an adjective, or another adverb (e.g. “extremely”, “slowly”) Preposition: a word placed before a noun/pronoun to form a phrase modifying another word/phrase (e.g. “in”, “for”, “without”) Conjunction: a word that connects/conjoins another sentence or phrase (e.g. “and”, “or”, “but”, “so”, “Since”) Interjection: a word that expresses spontaneous feeling (e.g. “uh”).

Nouns can form plural and/or possessive countable nouns vs. mass nouns cat  cats, cat’s countable nouns vs. mass nouns singular countable nouns must appear with a determiner: Cats sleep. * Cat sleeps. The cat sleeps. Ralph Grishman at NYU

Nouns can form plural and/or possessive countable nouns vs. mass nouns cat  cats, cat’s countable nouns vs. mass nouns singular countable nouns must appear with a determiner: Cats sleep. * Cat sleeps. (* indicates this is not a grammatical sentence) The cat sleeps. Ralph Grishman at NYU

Nouns can form plural and/or possessive countable nouns vs. mass nouns cat  cats, cat’s countable nouns vs. mass nouns singular countable nouns must appear with a determiner (mainly articles (“a”, “the”) and possessive pronouns (“my”, “his”)): Cats sleep. * Cat sleeps. The cat sleeps. Ralph Grishman at NYU

Verbs Most verbs can appear in “They must _____ (it).” Verbs can occur in different (inflected) forms: base or infinitive ("be", "eat", "sleep") present tense ("is", "am", "are";  "eats", "eat"; "sleeps", "sleep") past tense ("was", "were";  "ate"; "slept") present participle ("being", "eating"; "sleeping") past participle ("been", "eaten"; "slept") Ralph Grishman at NYU

Adjectives Adjectives can appear in comparative or superlative forms: happy  happier, happiest and with an intensifier: happy  very happy Ralph Grishman at NYU

Adjectives vs. nouns we will not consider a word an adjective just because it appears as a modifier to the left of a noun: “the brick wall” most nouns can appear in this position Ralph Grishman at NYU

Adverbs Can move within sentence: He ate the brownie quickly. He quickly ate the brownie. Quickly, he ate the brownie. Ralph Grishman at NYU

Personal Pronouns personal pronouns occur in nominative (“I”, “he”) and accusative (“me”, “him”) last remaining evidence of case in English Ralph Grishman at NYU

Quiz! Identify and tag all words which are the basic eight parts-of-speech in the following sentences. “The student put books on the table.” “We may also collect information you voluntarily add to your profile, such as your mobile phone number and mobile service provider.”

Source: Joyce Choi, CSE 842, Michigan State University Morphology The study of how words are composed of morphemes (the smallest meaning-bearing units of a language) Two broad classes of morphemes: Stems: “main” morpheme of the word, supplying meaning Affixes: Bits and pieces that combine with stems to modify their meanings and grammatical functions (prefixes, suffixes, circumfixes, infixes) Unlike Trying Multiple affixes Unreadable Source: Joyce Choi, CSE 842, Michigan State University

Source: Joyce Choi, CSE 842, Michigan State University Ways to Form Words Inflection: new forms of the same word (usually in the same class) Tense, number, mood, voice marking in verbs Number, gender marking in nominals Comparison of adjectives Derivation: yield different words in different class Deverbal nominals Denominal adjectives and verbs Compounding: new words out of two or more other words Noun-noun compounding (e.g., doghouse) Cliticization: combine a word with a clitic (which acts syntactically like a word but in a reduced form, e.g., I’ve) Source: Joyce Choi, CSE 842, Michigan State University

English Inflectional Morphology Word stem combines with grammatical morpheme Usually produces word of same class Usually serves a grammatical role that the stem could not (e.g. agreement) like -> likes or liked bird -> birds Nouns have a simple inflectional morphology: markers for plural and markers for possessives Verbs are slightly more complex: Source: Joyce Choi, CSE 842, Michigan State University

Source: Joyce Choi, CSE 842, Michigan State University Nominal Inflection Nominal morphology Plural forms s or es Irregular forms, e.g., Goose/Geese, Mouse/Mice Possessives children’s Source: Joyce Choi, CSE 842, Michigan State University

Source: Joyce Choi, CSE 842, Michigan State University Verbal Inflection Main verbs (walk, like) are relatively regular -s, ing, ed And productive: Emailed, instant-messaged, faxed But eat/ate/eaten, catch/caught/caught Primary (be, have, do) and modal verbs (can, will, must) are often irregular and not productive Be: am/is/are/were/was/been/being Irregular verbs few (~250) but frequently occurring English verbal inflection is much simpler than e.g. Latin Source: Joyce Choi, CSE 842, Michigan State University

Source: Joyce Choi, CSE 842, Michigan State University

English Derivational Morphology Word stem combines with grammatical morpheme Usually produces word of different class More complicated than inflectional Example: nominalization -ize verbs -> -ation nouns generalize, realize -> generalization, realization Example: verbs, nouns -> adjectives embrace, pity-. embraceable, pitiable care, wit -> careless, witless Source: Joyce Choi, CSE 842, Michigan State University

Source: Joyce Choi, CSE 842, Michigan State University Example: adjective -> adverb happy -> happily More complicated to model than inflection Less productive: *science-less, *concern-less, *go-able, *sleep-able Meanings of derived terms harder to predict by rule Source: Joyce Choi, CSE 842, Michigan State University

Morphological Analysis Tools E.g. Porter Stemmer A simple approach: just hack off the end of the word! Does NOT convert a word to its base form!!! Frequently used in Information Retrieval, but results are pretty ugly! Original ***************************** Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a nonexecutive director of this British industrial conglomerate . A form of asbestos once used to make Kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than 30 years ago , Results ******************************* Rudolph Agnew , 55 year old and former chairman of Consolid Gold Field PLC , wa name a nonexecut director of thi British industri conglomer . A form of asbesto onc use to make Kent cigarett filter ha caus a high percentag of cancer death among a group of worker expos to it more than 30 year ago , Source: Marti Hearst, i256, at UC Berkeley

Stemming vs. Lemmatization The purpose of both stemming and lemmatization is to reduce morphological variation. Stemming reduces word-forms to (pseudo)stems, whereas lemmatization reduces the word-forms to linguistically valid lemmas (morphological stems). Stemming: car, cars, car's, cars' => car Lemmatizing: am, are, is => be ; drive, drives, drove, driven => drive In a way, lemmatization deals only with inflectional variance, whereas stemming may also deal with derivational variance;

Is Stemming/Lemmatization Useful? Both help reduce the size of vocabulary. However problems… Stemming can conflate semantically different words E.g. “Gallery” and “gall” may both be stemmed to “gall” Also truncated stems can be intelligible to users Lemmatization is better, but it only deals with inflectional variance (e.g. “go”, “went”, “gone” => “go”, but not “attend”/verb, “attendance”/noun) Despite the problems, stemming is done often in Information Retrieval (IR) and Text Mining.

Quiz! The following pairs of words are stemmed to the same form by the Porter stemmer. Which pairs, would you agree, should NOT be conflated? Give your reasoning. abandon / abandonment marketing / markets university / universe volume / volumes FYI: Porter Stemmer Online (http://9ol.es/porter_js_demo.html) Introduction to Information Retrieval, C. Manning, P. Raghavan and H. Schutze, 2008

Source: Jurafsky & Martin “Speech and Language Processing” POS Tagging The process of assigning a part-of-speech or lexical class marker to each word in a sentence (and all sentences in a collection). Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj Source: Jurafsky & Martin “Speech and Language Processing”

Why is POS Tagging Useful? First step of a vast number of practical tasks Helps in stemming/lemmatization Parsing Need to know if a word is an N or V before you can parse Parsers can build trees directly on the POS tags instead of maintaining a lexicon Information Extraction Finding names, relations, etc. Machine Translation Selecting words of specific Parts of Speech (e.g. nouns) in pre-processing documents (for IR etc.) Source: Jurafsky & Martin “Speech and Language Processing”

POS Tagging Choosing a Tagset To do POS tagging, we need to choose a standard set of tags to work with Could pick very coarse tagsets N, V, Adj, Adv. More commonly used set is finer grained, the “Penn TreeBank tagset”, 45 tags PRP$, WRB, WP$, VBG Even more fine-grained tagsets exist Source: Jurafsky & Martin “Speech and Language Processing”

Penn TreeBank POS Tagset

Difficulties with POS Tagging Words often have more than one POS – ambiguity: The back door = JJ On my back = NN Win the voters back = RB Promised to back the bill = VB The POS tagging problem is to determine the POS tag for a particular instance of a word. Another example of Part-of-speech ambiguities NNP NNS NNS NNS CD NN VBZ VBZ VBZ VB “Fed raises interest rates 0.5 % in effort to control inflation” Source: Jurafsky & Martin “Speech and Language Processing”, Andrew McCallum, UMass Amherst

POS Tagging Techniques Rule-based Hand-coded rules Probabilistic/Stochastic Sequence (n-gram) models; machine learning HMM (Hidden Markov Model) MEMMs (Maximum Entropy Markov Models) Transformation-based Rules + n-gram machine learning Brill tagger Source: Jurafsky & Martin “Speech and Language Processing”, Andrew McCallum, UMass Amherst

Source: Andrew McCallum, UMass Amherst Current Performance Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj Using state-of-the-art automated method, how many tags are correct? About 97% currently But baseline is already 90% Baseline is performance of simplest possible method: Tag every word with its most frequent tag, and Tag unknown words as nouns Source: Andrew McCallum, UMass Amherst

Quiz! Find one tagging error in each of the following sentences that are tagged with the Penn treebank tagset. I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NN. Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS ? Tag each word in the following sentence with the Penn Treebank tagset. “We may also collect information you voluntarily add to your profile, such as your mobile phone number and mobile service provider.”