1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004.

Slides:



Advertisements
Similar presentations
Albert Gatt Corpora and Statistical Methods Lecture 13.
Advertisements

CS276A Text Retrieval and Mining Lecture 17 [Borrows some slides from Ray Mooney]
PrasadL18SVM1 Support Vector Machines Adapted from Lectures by Raymond Mooney (UT Austin)
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
1 Probability Theory Bayes Theorem and Naïve Bayes classification.
1 Text categorization Feature selection: chi square test.
October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification.
S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.
I256 Applied Natural Language Processing Fall 2009 Lecture 1 Introduction Barbara Rosario.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
SIMS 290-2: Applied Natural Language Processing
1 I256: Applied Natural Language Processing Preslav Nakov and Marti Hearst October 16, 2006 (Many slides originally by Barbara Rosario, modified here)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Generative and Discriminative Models in Text Classification David D. Lewis Independent Consultant Chicago, IL, USA
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
1 I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006.
I256 Applied Natural Language Processing Fall 2009 Lecture 10 Classification Barbara Rosario.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to Machine Learning Approach Lecture 5.
Which World Language? Listen to each audio recording.
Advanced Multimedia Text Classification Tamara Berg.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Text Classification, Active/Interactive learning.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Text classification Day 35 LING Computational Linguistics Harry Howard Tulane University.
The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,
Name the Language How good is your knowledge of the world’s main languages?
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
SOCIAL NETWORKS ANALYSIS SEMINAR INTRODUCTORY LECTURE #2 Danny Hendler and Yehonatan Cohen Advanced Topics in on-line Social Networks Analysis.
Machine learning system design Prioritizing what to work on
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
CSC 594 Topics in AI – Text Mining and Analytics
Class Imbalance in Text Classification
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Text Annotation By: Harika kode Bala S Divakaruni.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Text Classification and Naïve Bayes Text Classification: Evaluation.
Text Classification Seminar Social Media Mining University UC3M
CSC 594 Topics in AI – Natural Language Processing
CSA3180: Natural Language Processing
Supervised Learning Seminar Social Media Mining University UC3M
Revision (Part II) Ke Chen
Text Categorization Rong Jin.
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Text Categorization Berlin Chen 2003 Reference:
Information Retrieval
Text Mining Application Programming Chapter 9 Text Categorization
NAÏVE BAYES CLASSIFICATION
Presentation transcript:

1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004

2 Classification Text categorization (and other applications) Various issues regarding classification Clustering vs. classification, binary vs. multi-way, flat vs. hierarchical classification… Introduce the steps necessary for a classification task Define classes Label text Features Training and evaluation of a classifier

3 From: Foundations of Statistical Natural Language Processing. Manning and Schutze Classification Goal: Assign ‘objects’ from a universe to two or more classes or categories Examples: Problem Object Categories Tagging Word POS Sense Disambiguation Word The word’s senses Information retrieval Document Relevant/not relevant Sentiment classification Document Positive/negative Author identification Document Authors

4 Author identification They agreed that Mrs. X should only hear of the departure of the family, without being alarmed on the score of the gentleman's conduct; but even this partial communication gave her a great deal of concern, and she bewailed it as exceedingly unlucky that the ladies should happen to go away, just as they were all getting so intimate together. Gas looming through the fog in divers places in the streets, much as the sun may, from the spongey fields, be seen to loom by husbandman and ploughboy. Most of the shops lighted two hours before their time--as the gas seems to know, for it has a haggard and unwilling look. The raw afternoon is rawest, and the dense fog is densest, and the muddy streets are muddiest near that leaden-headed old obstruction, appropriate ornament for the threshold of a leaden-headed old corporation, Temple Bar.

5 Author identification Jane Austen ( ), Pride and Prejudice Charles Dickens ( ), Bleak House

6 Mosteller, Frederick and Wallace, David L Inference and Disputed Authorship: The Federalist. Author identification Federalist papers 77 short essays written in by Hamilton, Jay and Madison to persuade NY to ratify the US Constitution; published under a pseudonym The authorships of 12 papers was in dispute (disputed papers) In 1964 Mosteller and Wallace * solved the problem They identified 70 function words as good candidates for authorships analysis Using statistical inference they concluded the author was Madison

7 Function words for Author Identification

8

9 From: Foundations os Statistical Natural Language Processing. Manning and Schutze Classification Goal: Assign ‘objects’ from a universe to two or more classes or categories Examples: Problem Object Categories Author identification Document Authors Language identification Document Language

10 Language identification Tutti gli esseri umani nascono liberi ed eguali in dignità e diritti. Essi sono dotati di ragione e di coscienza e devono agire gli uni verso gli altri in spirito di fratellanza. Alle Menschen sind frei und gleich an Würde und Rechten geboren. Sie sind mit Vernunft und Gewissen begabt und sollen einander im Geist der Brüderlichkeit begegnen. Universal Declaration of Human RightsUniversal Declaration of Human Rights, UN, in 363 languages

11 Language identification égaux eguali iguales edistämään Ü ¿

12 From: Foundations of Statistical Natural Language Processing. Manning and Schutze Classification Goal: Assign ‘objects’ from a universe to two or more classes or categories Examples: Problem Object Categories Author identification Document Authors Language identification Document Language Text categorization Document Topics

13 Text categorization Topic categorization: classify the document into semantics topics The U.S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan defeated Belarus's Max Mirnyi and Vladimir Voltchkov to give the Americans an unsurmountable 3-0 lead in the best-of-five semi-final tie. One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as the plodding approach of Hurricane Jeanne prompted evacuation orders for hundreds of thousands of Floridians and high wind warnings that stretched 350 miles from the swamp towns south of Miami to the historic city of St. Augustine.

14 Text categorization Reuters Collection of (21,578) newswire documents. For research purposes: a standard text collection to compare systems and algorithms 135 valid topics categories

15 Reuters Top topics in Reuters

16 Reuters 2-MAR :51:43.42 livestock hog AMERICAN PORK CONGRESS KICKS OFF TOMORROW CHICAGO, March 2 - The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said. A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

17 Text categorization: examples Topic categorization Reuters. Spam filtering Determine if a mail message is spam (or not) Customer service message classification

18 Classification vs. Clustering Classification assumes labeled data: we know how many classes there are and we have examples for each class (labeled data). Classification is supervised In Clustering we don’t have labeled data; we just assume that there is a natural division in the data and we may not know how many divisions (clusters) there are Clustering is unsupervised

19 Classification Class1 Class2

20 Classification Class1 Class2

21 Classification Class1 Class2

22 Classification Class1 Class2

23 Clustering

24 Clustering

25 Clustering

26 Clustering

27 Clustering

28 Categories (Labels, Classes) Labeling data 2 problems: Decide the possible classes (which ones, how many) Domain and application dependent Label text Difficult, time consuming, inconsistency between annotators

29 Reuters 2-MAR :51:43.42 livestock hog AMERICAN PORK CONGRESS KICKS OFF TOMORROW CHICAGO, March 2 - The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said. A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter Why not topic = policy ?

30 Binary vs. multi-way classification Binary classification: two classes Multi-way classification: more than two classes Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes

31 Flat vs. Hierarchical classification Flat classification: relations between the classes undetermined Hierarchical classification: hierarchy where each node is the sub-class of its parent’s node

32 Single- vs. multi-category classification In single-category text classification each text belongs to exactly one category In multi-category text classification, each text can have zero or more categories

33 LabeledText LabeledText class in NLTK LabeledText LabeledText class >>> text = "Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix." >>> label = “sport” >>> labeled_text = LabeledText(text, label) >>> labeled_text.text() “Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix.” >>> labeled_text.label() “sport”

34 NLTK: The Classifier Interface classify classify determines which label is most appropriate for a given text token, and returns a labeled text token with that label. labels labels returns the list of category labels that are used by the classifier. >>> token = Token(“The World Health Organization is recommending more importance be attached to the prevention of heart disease and other cardiovascular ailments rather than focusing on treatment.”) >>> my_classifier.classify(token) “The World Health Organization is recommending more importance be attached to the prevention of heart disease and other cardiovascular ailments rather than focusing on treatment.”/ health >>> my_classifier.labels() ("sport", "health", "world",…)

35 Features >>> text = "Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix." >>> label = “sport” >>> labeled_text = LabeledText(text, label) Here the classification takes as input the whole string What’s the problem with that? What are the features that could be useful for this example?

36 Feature terminology Feature: An aspect of the text that is relevant to the task Some typical features Words present in text Frequency of words Capitalization Are there NE? WordNet Others?

37 Feature terminology Feature: An aspect of the text that is relevant to the task Feature value: the realization of the feature in the text Words present in text : Kerry, Schumacher, China… Frequency of word: Kerry(10), Schumacher(1)… Are there dates? Yes/no Are there PERSONS? Yes/no Are there ORGANIZATIONS? Yes/no WordNet: Holonyms (China is part of Asia), Synonyms(China, People's Republic of China, mainland China)

38 Feature Types Boolean (or Binary) Features Features that generate boolean (binary) values. Boolean features are the simplest and the most common type of feature. f 1 (text) = 1 if text contain “Kerry” 0 otherwise f 2 (text) = 1 if text contain PERSON 0 otherwise

39 Feature Types Integer Features Features that generate integer values. Integer features can be used to give classifiers access to more precise information about the text. f 1 (text) = Number of times text contains “Kerry” f 2 (text) = Number of times text contains PERSON

40 Features in NLTK Feature Detectors Features can be defined using feature detector functions, which map LabeledTexts to values Method: detect, which takes a labeled text, and returns a feature value.detect >>> def ball(ltext): return (“ball” in ltext.text()) >>> fdetector = FunctionFeatureDetector(ball) >>> document1 = "John threw the ball over the fence".split() >>> fdetector.detect(LabeledText(document1) 1 >>> document2 = "Mary solved the equation".split() >>> fdetector.detect(LabeledText(document2) 0

41 Features in NLTK Feature Detector Lists: data structures that represent the feature detector functions for a set of features. Feature Value Lists

42 Feature selection How do we choose the “right” features?

43 Classification Define classes Label text Extract Features Choose a classifier >>> my_classifier.classify(token) The Naive Bayes Classifier NN (perceptron) SVM …. Train it (and test it) Use it to classify new examples

44 Training (We’ll see what we mean exactly with training when we’ll talk about the algorithms) Adaptation of the classifier to the data Usually the classifier is defined by a set of parameters Training is the procedure for finding a “good” set of parameters Goodness is determined by an optimization criterion such as misclassification rate Some classifiers are guaranteed to find the optimal set of parameters

45 Testing, evaluation of the classifier After choosing the parameters of the classifiers (i.e. after training it) we need to test how well it’s doing on a test set (not included in the training set) Calculate misclassification on the test set

46 Evaluating classifiers Contingency table for the evaluation of a binary classifier GREEN is correctRED is correct GREEN was assignedab RED was assignedcd Accuracy = (a+d)/(a+b+c+d) Precision: P_GREEN = a/(a+b), P_ RED = d/(c+d) Recall: R_GREEN = a/(a+c), R_ RED = d/(b+d)

47 *From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang Training size The more the better! (usually) Results for text classification *

48 *From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang Training size

49 *From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang Training size

50 Authorship Attribution a Comparison Of Three Methods, Matthew Care Training Size Author identification