1 I256: Applied Natural Language Processing Preslav Nakov and Marti Hearst October 16, 2006 (Many slides originally by Barbara Rosario, modified here)

Slides:



Advertisements
Similar presentations
Albert Gatt Corpora and Statistical Methods Lecture 13.
Advertisements

PrasadL18SVM1 Support Vector Machines Adapted from Lectures by Raymond Mooney (UT Austin)
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
1 Text categorization Feature selection: chi square test.
S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
FAO of the UN Library and Documentation Systems Division ECDL 2003 Trondheim August 03 Automatic multi-label subject indexing in a multilingual environment.
SIMS 290-2: Applied Natural Language Processing
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Search Engines and Information Retrieval
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Introduction to Text Categorization Lecturer: Paul Bennett : Web-Based Information Architectures July 23, 2002.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Presented by Zeehasham Rasheed
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
I256 Applied Natural Language Processing Fall 2009 Lecture 10 Classification Barbara Rosario.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to Machine Learning Approach Lecture 5.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 The BT Digital Library A case study in intelligent content management Paul Warren
1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004.
Text Classification, Active/Interactive learning.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Trevor Crum 04/23/2014 *Slides modified from Shamil Mustafayev’s 2013 presentation * 1.
Chapter Ⅳ. Categorization 2007 년 2 월 15 일 인공지능연구실 송승미 Text : THE TEXT MINING HANDBOOK Page. 64 ~ 81.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
Class Imbalance in Text Classification
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Text Annotation By: Harika kode Bala S Divakaruni.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Text Classification and Naïve Bayes Text Classification: Evaluation.
Information Organization: Overview
Text Categorization Rong Jin.
Text Categorization Assigning documents to a fixed set of categories
Introduction to Information Retrieval
Categorization: Information and Misinformation
Information Retrieval
Information Organization: Overview
Unsupervised Machine Learning: Clustering Assignment
Text Mining Application Programming Chapter 9 Text Categorization
NAÏVE BAYES CLASSIFICATION
Presentation transcript:

1 I256: Applied Natural Language Processing Preslav Nakov and Marti Hearst October 16, 2006 (Many slides originally by Barbara Rosario, modified here)

2 Today Classification Text categorization (and other applications) Various issues regarding classification Clustering vs. classification, binary vs. multi-way, flat vs. hierarchical classification… Introduce the steps necessary for a classification task Define classes Label text Features Training and evaluation of a classifier

3 From: Foundations of Statistical Natural Language Processing. Manning and Schutze Classification Goal: Assign ‘objects’ from a universe to two or more classes or categories Examples: Problem Object Categories Tagging Word POS Sense Disambiguation Word The word’s senses Information retrieval Document Relevant/not relevant Sentiment classification Document Positive/negative Author identification Document Authors

4 Slide adapted from Paul Bennet Text Categorization Applications Web pages organized into category hierarchies Journal articles indexed by subject categories (e.g., the Library of Congress, MEDLINE, etc.) Responses to Census Bureau occupations Patents archived using International Patent Classification Patient records coded using international insurance categories message filtering News events tracked and filtered by topics Spam vs. anti-palm

5 Yahoo News Categories

6 Slide adapted froml Paul Bennett Why not a semi-automatic text categorization tool? Humans can encode knowledge of what constitutes membership in a category. This encoding can then be automatically applied by a machine to categorize new examples. For example...

7 Slide adapted froml Paul Bennett Expert System (late 1980s)

8 Slide adapted froml Paul Bennett Rule-based Approach to Text Categorization Text in a Web Page “Saeco revolutionized espresso brewing a decade ago by introducing Saeco SuperAutomatic machines, which go from bean to coffee at the touch of a button. The all-new Saeco Vienna Super-Automatic home coffee and cappucino machine combines top quality with low price!” Rules Rule 1. (espresso or coffee or cappucino ) and machine* Coffee Maker Rule 2. automat* and answering and machine* Phone Rule...

9 Slide adapted froml Paul Bennett Defining Rules By Hand This is fine for low-stakes applications Google and Yahoo alerts allow users to automatically receive news articles containing certain keywords Called “filtering” or “routing” Works fine when it’s ok to miss some things But when high accuracy is required, experience has shown too time consuming too difficult inconsistency issues (as the rule set gets large)

10 Slide adapted froml Paul Bennett Replace Knowledge Engineering with a Statistical Learner

11 Slide adapted froml Paul Bennett Cost of Manual Text Categorization Yahoo!  200 (?) people for manual labeling of Web pages  using a hierarchy of 500,000 categories MEDLINE (National Library of Medicine)  $2 million/year for manual indexing of journal articles  using MEdical Subject Headings (18,000 categories) Mayo Clinic  $1.4 million annually for coding patient-record events  using the International Classification of Diseases (ICD) for billing insurance companies US Census Bureau decennial census (1990: 22 million responses)  232 industry categories and 504 occupation categories  $15 million if fully done by hand

12 Slide adapted froml Paul Bennett Knowledge Statistical Engineering Learning For US Census Bureau Decennial Census industry categories and 504 occupation categories $15 million if fully done by hand Define classification rules manually: Expert System AIOCS Development time: 192 person-months (2 people, 8 years) Accuracy = 47% Learn classification function Nearest Neighbor classification (Creecy ’92: 1-NN) Development time: 4 person-months (Thinking Machine) Accuracy = 60% vs.

13 Text Topic categorization Topic categorization: classify the document into semantics topics The U.S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan defeated Belarus's Max Mirnyi and Vladimir Voltchkov to give the Americans an unsurmountable 3-0 lead in the best-of-five semi-final tie. One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as the plodding approach of Hurricane Jeanne prompted evacuation orders for hundreds of thousands of Floridians and high wind warnings that stretched 350 miles from the swamp towns south of Miami to the historic city of St. Augustine.

14 The Reuters collection A gold standard Collection of (21,578) newswire documents. For research purposes: a standard text collection to compare systems and algorithms 135 valid topics categories

15 Reuters Top topics in Reuters

16 Reuters Document Example 2-MAR :51:43.42 livestock hog AMERICAN PORK CONGRESS KICKS OFF TOMORROW CHICAGO, March 2 - The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said. A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

17 Classification vs. Clustering Classification assumes labeled data: we know how many classes there are and we have examples for each class (labeled data). Classification is supervised In Clustering we don’t have labeled data; we just assume that there is a natural division in the data and we may not know how many divisions (clusters) there are Clustering is unsupervised

18 Classification Class1 Class2

19 Classification Class1 Class2

20 Classification Class1 Class2

21 Classification Class1 Class2

22 Clustering

23 Clustering

24 Clustering

25 Clustering

26 Clustering

27 Categories (Labels, Classes) Labeling data 2 problems: Decide the possible classes (which ones, how many) Domain and application dependent Label text Difficult, time consuming, inconsistency between annotators

28 Reuters Example, revisited 2-MAR :51:43.42 livestock hog AMERICAN PORK CONGRESS KICKS OFF TOMORROW CHICAGO, March 2 - The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said. A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter Why not topic = policy ?

29 Binary vs. multi-way classification Binary classification: two classes Multi-way classification: more than two classes Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes

30 Flat vs. Hierarchical classification Flat classification: relations between the classes undetermined Hierarchical classification: hierarchy where each node is the sub-class of its parent’s node

31 Single- vs. multi-category classification In single-category text classification each text belongs to exactly one category In multi-category text classification, each text can have zero or more categories

32 Features >>> text = "Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix." >>> label = “sport” >>> labeled_text = LabeledText(text, label) Here the classification takes as input the whole string What’s the problem with that? What are the features that could be useful for this example?

33 Feature terminology Feature: An aspect of the text that is relevant to the task Some typical features Words present in text Frequency of words Capitalization Are there NE? WordNet Others?

34 Feature terminology Feature: An aspect of the text that is relevant to the task Feature value: the realization of the feature in the text Words present in text : Kerry, Schumacher, China… Frequency of word: Kerry(10), Schumacher(1)… Are there dates? Yes/no Are there PERSONS? Yes/no Are there ORGANIZATIONS? Yes/no WordNet: Holonyms (China is part of Asia), Synonyms(China, People's Republic of China, mainland China)

35 Feature Types Boolean (or Binary) Features Features that generate boolean (binary) values. Boolean features are the simplest and the most common type of feature. f 1 (text) = 1 if text contain “Kerry” 0 otherwise f 2 (text) = 1 if text contain PERSON 0 otherwise

36 Feature Types Integer Features Features that generate integer values. Integer features can be used to give classifiers access to more precise information about the text. f 1 (text) = Number of times text contains “Kerry” f 2 (text) = Number of times text contains PERSON

37 Feature selection How do we choose the “right” features? A future lecture

38 Classification Define classes Label text Extract Features Choose a classifier >>> my_classifier.classify(token) The Naive Bayes Classifier NN (perceptron) SVM …. Train it (and test it) Use it to classify new examples

39 Training Usually the classifier is defined by a set of parameters Training is the procedure for finding a “good” set of parameters Goodness is determined by an optimization criterion such as misclassification rate Some classifiers are guaranteed to find the optimal set of parameters

40 Testing, evaluation of the classifier After choosing the parameters of the classifiers (i.e. after training it) we need to test how well it’s doing on a test set (not included in the training set) Calculate misclassification on the test set

41 Evaluating classifiers Contingency table for the evaluation of a binary classifier GREEN is correctRED is correct GREEN was assignedab RED was assignedcd Accuracy = (a+d)/(a+b+c+d) Precision: P_GREEN = a/(a+b), P_ RED = d/(c+d) Recall: R_GREEN = a/(a+c), R_ RED = d/(b+d)

42 *From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang Training size The more the better! (usually) Results for text classification *

43 *From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang Training size

44 *From: Improving the Performance of Naive Bayes for Text Classification, Shen and Yang Training size

45 Authorship Attribution a Comparison Of Three Methods, Matthew Care Training Size Author identification

46 Upcoming Classifiers Feature selection algorithms