1 Text categorization Feature selection: chi square test.

Slides:



Advertisements
Similar presentations
Contingency Tables For Tests of Independence. Multinomials Over Various Categories Thus far the situation where there are multiple outcomes for the qualitative.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Albert Gatt Corpora and Statistical Methods Lecture 13.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
PrasadL18SVM1 Support Vector Machines Adapted from Lectures by Raymond Mooney (UT Austin)
CHI-SQUARE(X2) DISTRIBUTION
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
Uncertainty Everyday reasoning and decision making is based on uncertain evidence and inferences. Classical logic only allows conclusions to be strictly.
1 Probability Theory Bayes Theorem and Naïve Bayes classification.
Text Categorization CSC 575 Intelligent Information Retrieval.
October 2005CSA3180: Text Processing III1 CSA3180: Natural Language Processing Text Processing 3 – Double Lecture Discovering Word Associations Text Classification.
S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.
SIMS 290-2: Applied Natural Language Processing
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
1 I256: Applied Natural Language Processing Preslav Nakov and Marti Hearst October 16, 2006 (Many slides originally by Barbara Rosario, modified here)
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
1 Lecture 8 Measures of association: chi square test, mutual information, binomial distribution and log likelihood ratio.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
I256 Applied Natural Language Processing Fall 2009 Lecture 10 Classification Barbara Rosario.
1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov Sept 29, 2004.
22-1 Copyright  2010 McGraw-Hill Australia Pty Ltd PowerPoint slides to accompany Croucher, Introductory Mathematics and Statistics, 5e Chapter 22 Analysis.
For Monday after Spring Break Read Homework: –Chapter 13, exercise 6 and 8 May be done in pairs.
1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.
Advanced Multimedia Text Classification Tamara Berg.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 CPE 641 Natural Language Processing Asst. Prof. Nuttanart Facundes Text Classification Adapted from Barbara Rosario’s slides – Sept. 27,2004.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 13: Nominal Variables: The Chi-Square and Binomial Distributions.
Text Classification, Active/Interactive learning.
Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Slide 26-1 Copyright © 2004 Pearson Education, Inc.
Statistical test for Non continuous variables. Dr L.M.M. Nunn.
BPS - 5th Ed. Chapter 221 Two Categorical Variables: The Chi-Square Test.
Copyright © 2010 Pearson Education, Inc. Slide
Chapter Seventeen. Figure 17.1 Relationship of Hypothesis Testing Related to Differences to the Previous Chapter and the Marketing Research Process Focus.
CHAPTER INTRODUCTORY CHI-SQUARE TEST Objectives:- Concerning with the methods of analyzing the categorical data In chi-square test, there are 2 methods.
Retain H o Refute hypothesis and model MODELS Explanations or Theories OBSERVATIONS Pattern in Space or Time HYPOTHESIS Predictions based on model NULL.
Section 12.2: Tests for Homogeneity and Independence in a Two-Way Table.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 11 Analyzing the Association Between Categorical Variables Section 11.2 Testing Categorical.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Outline of Today’s Discussion 1.The Chi-Square Test of Independence 2.The Chi-Square Test of Goodness of Fit.
CHAPTER INTRODUCTORY CHI-SQUARE TEST Objectives:- Concerning with the methods of analyzing the categorical data In chi-square test, there are 3 methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Chapter 8 – Naïve Bayes DM for Business Intelligence.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Text Classification and Naïve Bayes Text Classification: Evaluation.
Check your understanding: p. 684
Inferential Statistics 3: The Chi Square Test
Geog4B The Chi Square Test.
Chapter 12 Chi-Square Tests and Nonparametric Tests
5.1 INTRODUCTORY CHI-SQUARE TEST
Chapter 11 Chi-Square Tests.
Erasmus University Rotterdam
Association between two categorical variables
Text Categorization Rong Jin.
Two Categorical Variables: The Chi-Square Test
Overview and Chi-Square
Chapter 11 Chi-Square Tests.
Analyzing the Association Between Categorical Variables
Two-dimensional Chi-square
Fundamental Statistics for the Behavioral Sciences, 4th edition
Graphs and Chi Square.
Chapter 11 Chi-Square Tests.
NAÏVE BAYES CLASSIFICATION
Naïve Bayes Classifier
Presentation transcript:

1 Text categorization Feature selection: chi square test

2 Slides adapted from Mary Ellen Califf Joint Probability Distribution The joint probability distribution for a set of random variables X 1 …X n gives the probability of every combination of values P(X 1,...,X n ) Sneeze ¬Sneeze Cold ¬Cold The probability of all possible cases can be calculated by summing the appropriate subset of values from the joint distribution. All conditional probabilities can therefore also be calculated P(Cold | ¬Sneeze) BUT it’s often very hard to obtain all the probabilities for a joint distribution

3 Slides adapted from Mary Ellen Califf Bayes Independence Example Imagine there are diagnoses ALLERGY, COLD, and WELL and symptoms SNEEZE, COUGH, and FEVER Can these be correct numbers? Prob Well Cold Allergy P(d) P(sneeze|d) P(cough | d) P(fever | d)

4 KL divergence (relative entropy) Basis of comparing two probability distributions

5 Slide adapted from Paul Bennet Text Categorization Applications Web pages organized into category hierarchies Journal articles indexed by subject categories (e.g., the Library of Congress, MEDLINE, etc.) Responses to Census Bureau occupations Patents archived using International Patent Classification Patient records coded using international insurance categories message filtering News events tracked and filtered by topics Spam

6 Yahoo News Categories

7 Text Topic categorization Topic categorization: classify the document into semantics topics The U.S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan defeated Belarus's Max Mirnyi and Vladimir Voltchkov to give the Americans an unsurmountable 3-0 lead in the best-of-five semi-final tie. One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as the plodding approach of Hurricane Jeanne prompted evacuation orders for hundreds of thousands of Floridians and high wind warnings that stretched 350 miles from the swamp towns south of Miami to the historic city of St. Augustine.

8 The Reuters collection A gold standard Collection of (21,578) newswire documents. For research purposes: a standard text collection to compare systems and algorithms 135 valid topics categories

9 Reuters Top topics in Reuters

10 Reuters Document Example 2-MAR :51:43.42 livestock hog AMERICAN PORK CONGRESS KICKS OFF TOMORROW CHICAGO, March 2 - The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said. A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter

11 Classification vs. Clustering Classification assumes labeled data: we know how many classes there are and we have examples for each class (labeled data). Classification is supervised In Clustering we don’t have labeled data; we just assume that there is a natural division in the data and we may not know how many divisions (clusters) there are Clustering is unsupervised

12 Categories (Labels, Classes) Labeling data 2 problems: Decide the possible classes (which ones, how many) Domain and application dependent Label text Difficult, time consuming, inconsistency between annotators

13 Reuters Example, revisited 2-MAR :51:43.42 livestock hog AMERICAN PORK CONGRESS KICKS OFF TOMORROW CHICAGO, March 2 - The American Pork Congress kicks off tomorrow, March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positions on a number of issues, according to the National Pork Producers Council, NPPC. Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the future direction of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether to endorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said. A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry, the NPPC added. Reuter Why not topic = policy ?

14 Binary vs. multi-way classification Binary classification: two classes Multi-way classification: more than two classes Sometime it can be convenient to treat a multi-way problem like a binary one: one class versus all the others, for all classes

15 Features >>> text = "Seven-time Formula One champion Michael Schumacher took on the Shanghai circuit Saturday in qualifying for the first Chinese Grand Prix." >>> label = “sport” >>> labeled_text = LabeledText(text, label) Here the classification takes as input the whole string What’s the problem with that? What are the features that could be useful for this example?

16 Feature terminology Feature: An aspect of the text that is relevant to the task Some typical features Words present in text Frequency of words Capitalization Are there NE? WordNet Others?

17 Feature terminology Feature: An aspect of the text that is relevant to the task Feature value: the realization of the feature in the text Words present in text Frequency of word Are there dates? Yes/no Are there PERSONS? Yes/no Are there ORGANIZATIONS? Yes/no WordNet: Holonyms (China is part of Asia), Synonyms(China, People's Republic of China, mainland China)

18 Feature Types Boolean (or Binary) Features Features that generate boolean (binary) values. Boolean features are the simplest and the most common type of feature. f 1 (text) = 1 if text contain “elections” 0 otherwise f 2 (text) = 1 if text contain PERSON 0 otherwise

19 Feature Types Integer Features Features that generate integer values. Integer features can be used to give classifiers access to more precise information about the text. f 1 (text) = Number of times “elections” occurs f 2 (text) = Number of times PERSON occurs

20  2 statistic (pronounced “kai square”) A commonly used method of comparing proportions. Measures the lack of independence between a term and a category  2 statistic (CHI)

21 Is “jaguar” a good predictor for the “auto” class? We want to compare: the observed distribution above; and null hypothesis: that jaguar and auto are independent  2 statistic (CHI) Term = jaguar Term  jaguar Class = auto2500 Class  auto 39500

22 Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect? If independent: P r (j,a) = P r (j)  P r (a) So, there would be N  P r (j,a), i.e. N  P r (j)  P r (a) occurances of “jaguar” P r (j) = (2+3)/N; P r (a) = (2+500)/N; N= N(5/N)(502/N)=2510/N=2510/10005  0.25  2 statistic (CHI) Term = jaguar Term  jaguar Class = auto2500 Class  auto 39500

23 Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect?  2 statistic (CHI) Term = jaguar Term  jaguar Class = auto2(0.25)500 Class  auto expected: f e observed: f o

24 Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect?  2 statistic (CHI) Term = jaguar Term  jaguar Class = auto2(0.25)500(502) Class  auto 3(4.75)9500(9498) expected: f e observed: f o

25  2 is interested in (f o – f e ) 2 /f e summed over all table entries: The null hypothesis is rejected with confidence.999, since 12.9 > (the value for.999 confidence).  2 statistic (CHI) Term = jaguar Term  jaguar Class = auto2(0.25)500(502) Class  auto 3(4.75)9500(9498) expected: f e observed: f o

26 There is a simpler formula for  2 :  2 statistic (CHI) N = A + B + C + D A = #(t,c)C = #(¬t,c) B = #(t,¬c)D = #(¬t, ¬c)