1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

Slides:



Advertisements
Similar presentations
Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
Evaluation of Decision Forests on Text Categorization
Contingency Tables Chapters Seven, Sixteen, and Eighteen Chapter Seven –Definition of Contingency Tables –Basic Statistics –SPSS program (Crosstabulation)
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Hypothesis Testing IV Chi Square.
Learning for Text Categorization
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
March 25, 2004Columbia University1 Machine Learning with Weka Lokesh S. Shrestha.
1 Lecture 8 Measures of association: chi square test, mutual information, binomial distribution and log likelihood ratio.
A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning – 1997) Paper By: Yiming Yang,
1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov Sept 29, 2004.
Machine Learning with WEKA. WEKA: the bird Copyright: Martin Kramer
Chapter 5: Information Retrieval and Web Search
SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.
1 How to use Weka How to use Weka. 2 WEKA: the software Waikato Environment for Knowledge Analysis Collection of state-of-the-art machine learning algorithms.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Contributed by Yizhou Sun 2008 An Introduction to WEKA.
1 Psych 5500/6500 Chi-Square (Part Two) Test for Association Fall, 2008.
Copyright © 2010 Pearson Education, Inc. Warm Up- Good Morning! If all the values of a data set are the same, all of the following must equal zero except.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
Independent Samples t-Test (or 2-Sample t-Test)
Text Classification, Active/Interactive learning.
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
Machine Learning with Weka Cornelia Caragea Thanks to Eibe Frank for some of the slides.
Correlation Analysis. Correlation Analysis: Introduction Management questions frequently revolve around the study of relationships between two or more.
Vector Space Models.
N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.
Copyright © 2010 Pearson Education, Inc. Warm Up- Good Morning! If all the values of a data set are the same, all of the following must equal zero except.
Introduction to Weka Xingquan (Hill) Zhu Slides copied from Jeffrey Junfeng Pan (UST)
 A collection of open source ML algorithms ◦ pre-processing ◦ classifiers ◦ clustering ◦ association rule  Created by researchers at the University.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Section 12.2: Tests for Homogeneity and Independence in a Two-Way Table.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
Chi-Square Analyses.
Weka. Weka A Java-based machine vlearning tool Implements numerous classifiers and other ML algorithms Uses a common.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Machine Learning with WEKA - Yohan Chin. WEKA ? Waikato Environment for Knowledge Analysis A Collection of Machine Learning algorithms for data tasks.
Natural Language Processing Topics in Information Retrieval August, 2002.
Comparing Counts Chapter 26. Goodness-of-Fit A test of whether the distribution of counts in one categorical variable matches the distribution predicted.
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
@relation age sex { female, chest_pain_type { typ_angina, asympt, non_anginal,
An Introduction to WEKA
Plan for Today’s Lecture(s)
Chapter 11 Chi-Square Tests.
Hypothesis Testing Review
The Chi-Square Distribution and Test for Independence
Representation of documents and queries
Machine Learning with Weka
Text Categorization Assigning documents to a fixed set of categories
Machine Learning with Weka
Lecture 10 – Introduction to Weka
Feature Selection Methods
Chapter 26 Comparing Counts Copyright © 2009 Pearson Education, Inc.
Chapter 26 Comparing Counts.
Copyright: Martin Kramer
Presentation transcript:

1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

2 Today Feature selection TF.IDF Term Weighting Weka Input File Format

3 Features for Text Categorization Linguistic features Words – lowercase? (should we convert to?) – normalized? (e.g. “texts”  “text”) Phrases Word-level n-grams Character-level n-grams Punctuation Part of Speech Non-linguistic features document formatting informative character sequences (e.g. &lt)

4 If the algorithm cannot handle all possible features –e.g. language identification for 100 languages using all words –text classification using n-grams Good features can result in higher accuracy What if we just keep all features? –Even the unreliable features can be helpful. –But we need to weight them:  In the extreme case, the bad features can have a weight of 0 (or very close), which is… a form of feature selection! When Do We Need Feature Selection?

5 Why Feature Selection? Not all features are equally good! Bad features: best to remove –Infrequent  unlikely to be seen again  co-occurrence with a class can be due to chance –Too frequent  mostly function words –Uniform across all categories Good features: should be kept –Co-occur with a particular category –Do not co-occur with other categories The rest: good to keep

6 Types Of Feature Selection?  Feature selection reduces the number of features  Usually:  Eliminating features  Weighting features  Normalizing features  Sometimes by transforming parameters  e.g. Latent Semantic Indexing using Singular Value Decomposition  Method may depend on problem type  For classification and filtering, may want to use information from example documents to guide selection.

7 Feature Selection Task independent methods Document Frequency (DF) Term Strength (TS) Task-dependent methods Information Gain (IG) Mutual Information (MI)  2 statistic (CHI) Empirically compared by Yang & Pedersen (1997)

8 Pedersen & Yang Experiments Compared feature selection methods for text categorization 5 feature selection methods: –DF, MI, CHI, (IG, TS) –Features were just words, not phrases 2 classifiers: –kNN: k-Nearest Neighbor –LLSF: Linear Least Squares Fit 2 data collections: –Reuters –OHSUMED: subset of MEDLINE (1990&1991 used)

9 DF: number of documents a term appears in Based on Zipf’s Law Remove the rare terms: (seen 1-2 times) Spurious Unreliable – can be just noise Unlikely to appear in new documents Plus Easy to compute Task independent: do not need to know the classes Minus Ad hoc criterion For some applications, rare terms can be good discriminators (e.g., in IR) Document Frequency (DF)

10 Common words from a predefined list Mostly from closed-class categories: –unlikely to have a new word added –include: auxiliaries, conjunctions, determiners, prepositions, pronouns, articles But also some open-class words like numerals Bad discriminators uniformly spread across all classes can be safely removed from the vocabulary –Is this always a good idea? (e.g. author identification) Stop Word Removal

11  2 statistic (pronounced “kai square”) A commonly used method of comparing proportions. Measures the lack of independence between a term and a category (Yang & Pedersen)  2 statistic (CHI)

12 Is “jaguar” a good predictor for the “auto” class? We want to compare: the observed distribution above; and null hypothesis: that jaguar and auto are independent  2 statistic (CHI) Term = jaguar Term  jaguar Class = auto2500 Class  auto 39500

13 Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect? If independent: P r (j,a) = P r (j)  P r (a) So, there would be: N  P r (j,a), i.e. N  P r (j)  P r (a) P r (j) = (2+3)/N; P r (a) = (2+500)/N; N= Which = N(5/N)(502/N)=2510/N=2510/10005  0.25  2 statistic (CHI) Term = jaguar Term  jaguar Class = auto2500 Class  auto 39500

14 Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect?  2 statistic (CHI) Term = jaguar Term  jaguar Class = auto2(0.25)500 Class  auto expected: f e observed: f o

15 Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect?  2 statistic (CHI) Term = jaguar Term  jaguar Class = auto2(0.25)500(502) Class  auto 3(4.75)9500(9498) expected: f e observed: f o

16  2 is interested in (f o – f e ) 2 /f e summed over all table entries: The null hypothesis is rejected with confidence.999, since 12.9 > (the value for.999 confidence).  2 statistic (CHI) Term = jaguar Term  jaguar Class = auto2(0.25)500(502) Class  auto 3(4.75)9500(9498) expected: f e observed: f o

17 There is a simpler formula for  2 :  2 statistic (CHI) N = A + B + C + D A = #(t,c)C = #(¬t,c) B = #(t,¬c)D = #(¬t, ¬c)

18 How to use  2 for multiple categories? Compute  2 for each category and then combine: To require a feature to discriminate well across all categories, then we need to take the expected value of  2 : Or to weight for a single category, take the maximum:  2 statistic (CHI)

19 Pluses normalized and thus comparable across terms  2 (t,c) is 0, when t and c are independent can be compared to  2 distribution, 1 degree of freedom Minuses unreliable for low frequency terms  2 statistic (CHI)

20 Information Gain A measure of importance of the feature for predicting the presence of the class. Has an information theoretic justification Defined as: The number of “bits of information” gained by knowing the term is present or absent Based on Information Theory –We won’t go into this in detail here.

21 Information Gain (IG) IG: number of bits of information gained by knowing the term is present or absent t is the term being scored, c i is a class variable entropy: H(c) specific conditional entropy H(c|t) specific conditional entropy H(c|¬t)

22 The probability of seeing x and y together vs The probably of seeing x anywhere times the probability of seeing y anywhere (independently). MI = log ( P(x,y) / P(x)P(y) ) = log(P(x,y)) – log(P(x)P(y)) From Bayes law: P(x,y) = P(x|y)P(y) = log(P(x|y)P(y)) – log(P(x)P(y)) MI = log(P(x|y) – log(P(x)) Mutual Information (MI)

23 Approximation: Mutual Information (MI) A = #(t,c)C = #(¬t,c) B = #(t,¬c)D = #(¬t, ¬c) rare terms get higher scores does not use term absence N = A + B + C + D

24 Compute MI for each category and then combine If we want to discriminate well across all categories, then we need to take the expected value of MI: To discriminate well for a single category, then we take the maximum: Using Mutual Information

25 Mutual Information Pluses I(t,c) is 0, when t and c are independent Has a sound information-theoretic interpretation Minuses Small numbers produce unreliable results Does not use term absence

26 Mutual information Term strength From Yang & Pedersen ‘97 CHI max, IG, DF

27 DF, IG and CHI are good and strongly correlated thus using DF is good, cheap and task independent can be used when IG and CHI are too expensive MI is bad favors rare terms (which are typically bad) Feature Comparison

28 Term Weighting In the study just shown, terms were (mainly) treated as binary features If a term occurred in a document, it was assigned 1 Else 0 Often it us useful to weight the selected features Standard technique: tf.idf

29 TF: term frequency definition: TF = t ij –frequency of term i in document j purpose: makes the frequent words for the document more important IDF: inverted document frequency definition: IDF = log(N/n i ) –n i : number of documents containing term i –N : total number of documents purpose: makes rare words across documents more important TF.IDF (for term i in document j) definition: t ij  log(N/n i ) TF.IDF Term Weighting

30 Term Normalization Combine different words into a single representation Stemming/morphological analysis –bought, buy, buys -> buy General word categories –$23.45, 5.30 Yen -> MONEY –1984, 10,000 -> DATE, NUM –PERSON –ORGANIZATION  (Covered in Information Extraction segment) Generalize with lexical hierarchies –WordNet, MeSH  (Covered later in the semester)

31 1.Feature selection infrequent term removal infrequent across the whole collection (i.e. DF) seen in a single document most frequent term removal (i.e. stop words) 2.Normalization: 1.Stemming. (often) 2.Word classes (sometimes) 3.Feature weighting: TF.IDF or IDF 4.Dimensionality reduction (sometimes) What Do People Do In Practice?

32 Weka Java-based tool for large-scale machine-learning problems Tailored towards text analysis

33 Weka Input Format Expects a particular input file format Called ARFF: Attribute-Relation File Format Consists of a Header and a Data section

34 Slide adapted from Eibe age sex { female, chest_pain_type { typ_angina, asympt, non_anginal, cholesterol exercise_induced_angina { no, class { present, 63,male,typ_angina,233,no,not_present 67,male,asympt,286,yes,present 67,male,asympt,229,yes,present 38,female,non_anginal,?,no,not_present... WEKA File Format: ARFF Other attribute types: String Date Numerical attribute Nominal attribute Missing value

35 Value 0 is not represented explicitly Same header tags) section is different Instead 0, X, 0, Y, "class A" 0, 0, W, 0, "class B" We {1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"} This saves LOTS of space for text applications. Why? WEKA Sparse File Format

36 Next Time Wed: Guest lecture by Peter Jackson: Pure and Applied Research in NLP: The Good, the Bad, and the Lucky. Following week: Text Categorization Algorithms How to use Weka