1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries.

Slides:

Advertisements

Similar presentations

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.

Advertisements

CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.

Statistical NLP: Lecture 3

Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,

For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.

1 Words and the Lexicon September 10th 2009 Lecture #3.

Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Part-Of-Speech (POS) Tagging.

 Christel Kemke 2007/08 COMP 4060 Natural Language Processing Word Classes and English Grammar.

Stemming, tagging and chunking Text analysis short of parsing.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Named Entity Recognition.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.

NLP and Speech 2004 English Grammar

Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.

CS 4705 Lecture 3 Morphology: Parsing Words. What is morphology? The study of how words are composed from smaller, meaning-bearing units (morphemes) –Stems:

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Outline of English Syntax.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Some Basic Concepts: Morphology.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/2010 Overview of NLP tasks (text pre-processing)

TopicTrend By: Jovian Lin Discover Emerging and Novel Research Topics.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.

Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.

Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.

Parts of Speech Sudeshna Sarkar 7 Aug 2008.

For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.

CSC 594 Topics in AI – Text Mining and Analytics

CS : Language Technology for the Web/Natural Language Processing Pushpak Bhattacharyya CSE Dept., IIT Bombay Constituent Parsing and Algorithms (with.

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.

10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.

Copyright © Curt Hill Languages and Grammars This is not English Class. But there is a resemblance.

CS 4705 Lecture 3 Morphology. What is morphology? The study of how words are composed of morphemes (the smallest meaning-bearing units of a language)

Information extraction 2 Day 37 LING Computational Linguistics Harry Howard Tulane University.

Word classes and part of speech tagging Chapter 5.

Linguistic Essentials

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.

Natural Language Processing

Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.

WORDS The term word is much more difficult to define in a technical sense, and like many other linguistic terms, there are often arguments about what exactly.

Morphological typology

Natural Language Processing Chapter 2 : Morphology.

For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.

Part-of-speech tagging

CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.

Shallow Parsing for South Asian Languages -Himanshu Agrawal.

◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Word classes and part of speech tagging Chapter 5.

MORPHOLOGY. PART 1: INTRODUCTION Parts of speech 1. What is a part of speech?part of speech 1. Traditional grammar classifies words based on eight parts.

King Faisal University جامعة الملك فيصل Deanship of E-Learning and Distance Education عمادة التعلم الإلكتروني والتعليم عن بعد [ ] 1 جامعة الملك فيصل عمادة.

Chapter 3 Word Formation I This chapter aims to analyze the morphological structures of words and gain a working knowledge of the different word forming.

Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,

Lecture 9: Part of Speech

CSC 594 Topics in AI – Natural Language Processing

Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.

CSC 594 Topics in AI – Natural Language Processing

Statistical NLP: Lecture 3

Chapter 3 Morphology Without grammar, little can be conveyed. Without vocabulary, nothing can be conveyed. (David Wilkins ,1972) Morphology refers to.

CSC 594 Topics in AI – Natural Language Processing

CSC 594 Topics in AI – Natural Language Processing

Machine Learning in Natural Language Processing

CSC 594 Topics in AI – Natural Language Processing

BBI 3212 ENGLISH SYNTAX AND MORPHOLOGY

Língua Inglesa - Aspectos Morfossintáticos

Linguistic Essentials

Natural Language Processing

Introduction to Linguistics

Presentation transcript:

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 2. Linguistic Essentials and Text Mining Preliminaries

The Description of Language Language = Words and Rules  Dictionary (vocabulary) + Grammar Dictionary –set of words defined in the language –open (dynamic) Grammar –set of rules which describe what is allowable in a language Classic/empirical Grammars –definitions and rules are mainly supported by examples –no (or almost no) formal description tools Explicit/formal Grammar (CFG, Dependency Grammars etc.) –formal description –can be programmed & tested on data (texts) 2

3 Levels of Language Analysis 1.Phonology study of sound systems of languages 2.Morphology study of structure of words: the structure of words in a language, including patterns of inflections and derivations 3.Syntax study of organization of words in sentences: the ordering of and relationship between the words in phrases and sentences 4.Semantics study of meaning in language: the study of how meaning in language is created 5.Pragmatics study of language in use: the branch of linguistics that studies language use rather than language structure 6.Discourse study of language, especially the type of language used in a particular context or subject 7.World Knowledge

Parts of Speech There are eight basic parts of speech for words in the English language: noun, pronoun, verb, adjective, adverb, preposition, conjunction, and interjection. The part of speech indicates how the word functions in meaning as well as grammatically within a sentence. 1.Noun: people, animals, concepts, things (e.g. “birds”) 2.Pronoun: a word used in place of a noun (e.g. “it”, “they”, “I”, “she”) 3.Verb: express action in the sentence (e.g. “sing”) 4.Adjective: describe properties of nouns (e.g. “yellow”) 5.Adverb: modifies or describes a verb, an adjective, or another adverb (e.g. “extremely”, “slowly”) 6.Preposition: a word placed before a noun/pronoun to form a phrase modifying another word/phrase (e.g. “in”, “for”, “without”) 4

Quiz! Identify all words which are the basic eight parts-of- speech in the following sentences. –The student put books on the table.” –“We may also collect information you voluntarily add to your profile, such as your mobile phone number and mobile service provider.” 5

6 Morphology The study of how words are composed of morphemes (the smallest meaning-bearing units of a language) Two broad classes of morphemes: –Stems: “main” morpheme of the word, supplying meaning –Affixes: Bits and pieces that combine with stems to modify their meanings and grammatical functions (prefixes, suffixes, circumfixes, infixes) Unlike Trying Multiple affixes – Unreadable Source: Joyce Choi, CSE 842, Michigan State University

7 Ways to Form Words Inflection: new forms of the same word (usually in the same class) –Tense, number, mood, voice marking in verbs –Number, gender marking in nominals –Comparison of adjectives Derivation: yield different words in different class –Deverbal nominals –Denominal adjectives and verbs Compounding: new words out of two or more other words –Noun-noun compounding (e.g., doghouse) Cliticization: combine a word with a clitic (which acts syntactically like a word but in a reduced form, e.g., I’ve) Source: Joyce Choi, CSE 842, Michigan State University

8 English Inflectional Morphology Word stem combines with grammatical morpheme –Usually produces word of same class –Usually serves a grammatical role that the stem could not (e.g. agreement) like -> likes or liked bird -> birds Nouns have a simple inflectional morphology: markers for plural and markers for possessives Verbs are slightly more complex: Source: Joyce Choi, CSE 842, Michigan State University

9 Nominal Inflection Nominal morphology –Plural forms s or es Irregular forms, e.g., Goose/Geese, Mouse/Mice –Possessives children’s Source: Joyce Choi, CSE 842, Michigan State University

10 Verbal Inflection Main verbs (walk, like) are relatively regular –-s, ing, ed –And productive: ed, instant-messaged, faxed –But eat/ate/eaten, catch/caught/caught Primary (be, have, do) and modal verbs (can, will, must) are often irregular and not productive –Be: am/is/are/were/was/been/being Irregular verbs few (~250) but frequently occurring English verbal inflection is much simpler than e.g. Latin Source: Joyce Choi, CSE 842, Michigan State University

11 Source: Joyce Choi, CSE 842, Michigan State University

12 English Derivational Morphology Word stem combines with grammatical morpheme –Usually produces word of different class –More complicated than inflectional Example: nominalization – -ize verbs -> -ation nouns – generalize, realize -> generalization, realization Example: verbs, nouns -> adjectives –embrace, pity-. embraceable, pitiable –care, wit -> careless, witless Source: Joyce Choi, CSE 842, Michigan State University

13 Example: adjective -> adverb –happy -> happily More complicated to model than inflection –Less productive: *science-less, *concern-less, *go-able, *sleep- able –Meanings of derived terms harder to predict by rule Source: Joyce Choi, CSE 842, Michigan State University

14 Morphological Analysis Tools E.g. Porter Stemmer –A simple approach: just hack off the end of the word! Does NOT convert a word to its base form!!! –Frequently used in Information Retrieval, but results are pretty ugly! Source: Marti Hearst, i256, at UC Berkeley Original ***************************** Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a nonexecutive director of this British industrial conglomerate. A form of asbestos once used to make Kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than 30 years ago, Results ******************************* Rudolph Agnew, 55 year old and former chairman of Consolid Gold Field PLC, wa name a nonexecut director of thi British industri conglomer. A form of asbesto onc use to make Kent cigarett filter ha caus a high percentag of cancer death among a group of worker expos to it more than 30 year ago,

Stemming vs. Lemmatization The purpose of both stemming and lemmatization is to reduce morphological variation. Stemming reduces word-forms to (pseudo)stems, whereas lemmatization reduces the word-forms to linguistically valid lemmas (morphological stems). –Stemming: car, cars, car's, cars' => car –Lemmatizing: am, are, is => be ; drive, drives, drove, driven => drive In a way, lemmatization deals only with inflectional variance, whereas stemming may also deal with derivational variance; 15

Is Stemming/Lemmatization Useful? Both help reduce the size of vocabulary. However problems… –Stemming can conflate semantically different words E.g. “Gallery” and “gall” may both be stemmed to “gall” –Also truncated stems can be intelligible to users –Lemmatization is better, but it only deals with inflectional variance (e.g. “go”, “went”, “gone” => “go”, but not “attend”/verb, “attendance”/noun) Despite the problems, stemming is done often in Information Retrieval (IR) and Text Mining. 16

Quiz! The following pairs of words are stemmed to the same form by the Porter stemmer. Which pairs, would you agree, should NOT be conflated? Give your reasoning. –abandon / abandonment –marketing / markets –university / universe –volume / volumes FYI: Porter Stemmer Online ( ) Introduction to Information Retrieval, C. Manning, P. Raghavan and H. Schutze,

18 POS Tagging The process of assigning a part-of-speech or lexical class marker to each word in a sentence (and all sentences in a collection). Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj Source: Jurafsky & Martin “Speech and Language Processing”

19 Why is POS Tagging Useful? First step of a vast number of practical tasks Helps in stemming/lemmatization Parsing –Need to know if a word is an N or V before you can parse –Parsers can build trees directly on the POS tags instead of maintaining a lexicon Information Extraction –Finding names, relations, etc. Machine Translation Selecting words of specific Parts of Speech (e.g. nouns) in pre-processing documents (for IR etc.) Source: Jurafsky & Martin “Speech and Language Processing”

20 POS Tagging Choosing a Tagset To do POS tagging, we need to choose a standard set of tags to work with Could pick very coarse tagsets –N, V, Adj, Adv. More commonly used set is finer grained, the “Penn TreeBank tagset”, 45 tags –PRP$, WRB, WP$, VBG Even more fine-grained tagsets exist Source: Jurafsky & Martin “Speech and Language Processing”

21 Penn TreeBank POS Tagset

22 Difficulties with POS Tagging Words often have more than one POS – ambiguity: –The back door = JJ –On my back = NN –Win the voters back = RB –Promised to back the bill = VB The POS tagging problem is to determine the POS tag for a particular instance of a word. Another example of Part-of-speech ambiguities NNP NNS NNS NNS CD NN VBZ VBZ VBZ VB “Fed raises interest rates 0.5 % in effort to control inflation” Source: Jurafsky & Martin “Speech and Language Processing”, Andrew McCallum, UMass Amherst

23 POS Tagging Techniques Source: Jurafsky & Martin “Speech and Language Processing”, Andrew McCallum, UMass Amherst 1.Rule-based Hand-coded rules 2.Probabilistic/Stochastic Sequence (n-gram) models; machine learning  HMM (Hidden Markov Model)  MEMMs (Maximum Entropy Markov Models) 3.Transformation-based Rules + n-gram machine learning  Brill tagger

24 Current Performance Input: the lead paint is unsafe Output: the/Det lead/N paint/N is/V unsafe/Adj Using state-of-the-art automated method, how many tags are correct? –About 97% currently –But baseline is already 90% Baseline is performance of simplest possible method: Tag every word with its most frequent tag, and Tag unknown words as nouns Source: Andrew McCallum, UMass Amherst

Quiz! 1.Find one tagging error in each of the following sentences that are tagged with the Penn treebank tagset. a.I/PRP need/VBP a/DT flight/NN from/IN Atlanta/NN. b.Can/VBP you/PRP list/VB the/DT nonstop/JJ afternoon/NN flights/NNS ? 2.Tag each word in the following sentence with the Penn Treebank tagset. –“We may also collect information you voluntarily add to your profile, such as your mobile phone number and mobile service provider.” 25

26 Named Entity Recognition (NER) Named Entities (NEs) are proper names in texts, i.e. the names of persons, organizations, locations, times and quantities NE Recognition (NER) is a sub-task of Information Extraction (IE) NER is to process a text and identify named entities –e.g. “U.N. official Ekeus heads for Baghdad.” NER is also an important task for texts in specific domains such as biomedical texts Source: J. Choi, CSE842, MSU; Marti Hearst, i256, at UC Berkeley

27 Common Entity Types NE TypeExamples ORGANIZATIONGeorgia-Pacific Corp., WHO PERSONEddy Bonte, President Obama LOCATIONMurray River, Mount Everest DATEJune, TIMEtwo fifty a m, 1:30 p.m. MONEY175 million Canadian Dollars, GBP PERCENTtwenty pct, % FACILITYWashington Monument, Stonehenge GPE (geo political entity)South East Asia, Midlothian

28 Difficulties with NER Names are too numerous to include in dictionaries Variations –e.g. “John Smith”, “Mr Smith”, “John” Changing constantly –new names invent unknown words Ambiguities –Same name refers to different entities, e.g. JFK – the former president JFK – his son JFK – airport in NY Multi-word entities – difficult to find boundaries –“DePaul University” –“Cecil H. Green Library and Escondido Village Conference Service Center”

29 Landscape of IE/NER Techniques Any of these models can be used to capture words, formatting or both. Lexicons Alabama Alaska … Wisconsin Wyoming Abraham Lincoln was born in Kentucky. member? Classify Pre-segmented Candidates Abraham Lincoln was born in Kentucky. Classifier which class? Sliding Window Abraham Lincoln was born in Kentucky. Classifier which class? Try alternate window sizes: Boundary Models Abraham Lincoln was born in Kentucky. Classifier which class? BEGINENDBEGINEND BEGIN Context Free Grammars Abraham Lincoln was born in Kentucky. NNPVPNPVNNP NP PP VP S Most likely parse? Finite State Machines Abraham Lincoln was born in Kentucky. Most likely state sequence? Source: Marti Hearst, i256, at UC Berkeley

30 NER State of the Art Performance Named entity recognition –Person, Location, Organization, … –F1 score (similar to accuracy) in high 80’s or low- to mid-90’s However, performance depends on the entity types [Wikipedia] At least two hierarchies of named entity types have been proposed in the literature. BBN categories [1], proposed in 2002, is used for Question Answering and consists of 29 types and 64 subtypes. Sekine's extended hierarchy [2], proposed in 2002, is made of 200 subtypes.hierarchiesBBN[1]Question Answering[2] Also, various domains use different entity types (e.g. concepts in biomedical texts)

Is NER Useful? Yes, especially when the text uses domain-specific vocabulary (e.g. legal, medical). 31

UIC - CS 594 Bing Liu Stop words Many of the most frequently used words in English are worthless in IR and text mining – these words are called stop words. –the, of, and, to, …. –Typically about 400 to 500 such words –For an application, an additional domain specific stop words list may be constructed Why do we want to remove stop words? –Reduce indexing (or data) file size stopwords accounts 20-30% of total word counts. –Improve efficiency stop words are not useful for searching or text mining stop words always have a large number of hits Example Stopword ListsStopword Lists

Difficulties with Stopwords Though stop words usually refer to the most common words in a language, there is no single universal list of stop words used by all processing of natural language tools, and indeed not all tools even use such a list. (Wikipedia) 33