Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.

Slides:



Advertisements
Similar presentations
Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
CHI-SQUARE(X2) DISTRIBUTION
15- 1 Chapter Fifteen McGraw-Hill/Irwin © 2005 The McGraw-Hill Companies, Inc., All Rights Reserved.
Chi Squared Tests. Introduction Two statistical techniques are presented. Both are used to analyze nominal data. –A goodness-of-fit test for a multinomial.
Chi Square Example A researcher wants to determine if there is a relationship between gender and the type of training received. The gender question is.
The Chi-Square Test for Association
Bivariate Analysis Cross-tabulation and chi-square.
Analysis of frequency counts with Chi square
K nearest neighbor and Rocchio algorithm
CJ 526 Statistical Analysis in Criminal Justice
Chi Square Test Dealing with categorical dependant variable.
1 SOC 3811 Basic Social Statistics. 2 Reminder  Hand in your assignment 5  Remember to pick up your previous homework  Final exam: May 12 th (Saturday),
A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning – 1997) Paper By: Yiming Yang,
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
Previous Lecture: Analysis of Variance
1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 12-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
Secondary Data, Measures, Hypothesis Formulation, Chi-Square Market Intelligence Julie Edell Britton Session 3 August 21, 2009.
The table shows a random sample of 100 hikers and the area of hiking preferred. Are hiking area preference and gender independent? Hiking Preference Area.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 11-1 Chapter 11 Chi-Square Tests Business Statistics, A First Course 4 th Edition.
HAWKES LEARNING SYSTEMS Students Matter. Success Counts. Copyright © 2013 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Section 10.7.
CJ 526 Statistical Analysis in Criminal Justice
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 13: Nominal Variables: The Chi-Square and Binomial Distributions.
Two Variable Statistics
Chapter 9: Non-parametric Tests n Parametric vs Non-parametric n Chi-Square –1 way –2 way.
1 Chi-Square Heibatollah Baghi, and Mastee Badii.
Chapter 14 Nonparametric Tests Part III: Additional Hypothesis Tests Renee R. Ha, Ph.D. James C. Ha, Ph.D Integrative Statistics for the Social & Behavioral.
Chapter 20 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 These tests can be used when all of the data from a study has been measured on.
Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26,
Chapter 16 The Chi-Square Statistic
Chi-Square X 2. Parking lot exercise Graph the distribution of car values for each parking lot Fill in the frequency and percentage tables.
1 In this case, each element of a population is assigned to one and only one of several classes or categories. Chapter 11 – Test of Independence - Hypothesis.
Education 793 Class Notes Presentation 10 Chi-Square Tests and One-Way ANOVA.
Statistical test for Non continuous variables. Dr L.M.M. Nunn.
CHI SQUARE TESTS.
HYPOTHESIS TESTING BETWEEN TWO OR MORE CATEGORICAL VARIABLES The Chi-Square Distribution and Test for Independence.
© 2000 Prentice-Hall, Inc. Statistics The Chi-Square Test & The Analysis of Contingency Tables Chapter 13.
Chi Square Classifying yourself as studious or not. YesNoTotal Are they significantly different? YesNoTotal Read ahead Yes.
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 11-1 Chapter 11 Chi-Square Tests Business Statistics: A First Course Fifth Edition.
Section 10.2 Independence. Section 10.2 Objectives Use a chi-square distribution to test whether two variables are independent Use a contingency table.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 11-1 Chapter 11 Chi-Square Tests and Nonparametric Tests Statistics for.
© Copyright McGraw-Hill CHAPTER 11 Other Chi-Square Tests.
CHAPTER INTRODUCTORY CHI-SQUARE TEST Objectives:- Concerning with the methods of analyzing the categorical data In chi-square test, there are 2 methods.
Chapter Outline Goodness of Fit test Test of Independence.
The table shows a random sample of 100 hikers and the area of hiking preferred. Are hiking area preference and gender independent? Hiking Preference Area.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Section 12.2: Tests for Homogeneity and Independence in a Two-Way Table.
ContentFurther guidance  Hypothesis testing involves making a conjecture (assumption) about some facet of our world, collecting data from a sample,
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
1 Week 3 Association and correlation handout & additional course notes available at Trevor Thompson.
Chapter 14 – 1 Chi-Square Chi-Square as a Statistical Test Statistical Independence Hypothesis Testing with Chi-Square The Assumptions Stating the Research.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 12 Tests of Goodness of Fit and Independence n Goodness of Fit Test: A Multinomial.
Section 10.2 Objectives Use a contingency table to find expected frequencies Use a chi-square distribution to test whether two variables are independent.
CHI SQUARE DISTRIBUTION. The Chi-Square (  2 ) Distribution The chi-square distribution is the probability distribution of the sum of several independent,
Test of independence: Contingency Table
Chapter 12 Chi-Square Tests and Nonparametric Tests
10 Chapter Chi-Square Tests and the F-Distribution Chapter 10
Chapter 11 Chi-Square Tests.
INTRODUCTORY STATISTICS FOR CRIMINAL JUSTICE
Qualitative data – tests of association
Hypothesis Testing Using the Chi Square (χ2) Distribution
The Chi-Square Distribution and Test for Independence
Consider this table: The Χ2 Test of Independence
Chapter 10 Analyzing the Association Between Categorical Variables
Chapter 11 Chi-Square Tests.
Feature Selection Methods
Chapter Outline Goodness of Fit test Test of Independence.
Chapter 11 Chi-Square Tests.
Hypothesis Testing - Chi Square
Presentation transcript:

Feature selection LING 572 Fei Xia Week 4: 1/29/08 1

Creating attribute-value table Choose features: –Define feature templates –Instantiate the feature templates –Dimensionality reduction: feature selection Feature weighting –The weight for f k : the whole column –The weight for f k in d i : a cell f1f1 f2f2 …fKfK y x1x1 x2x2 … 2

An example: text classification task Define feature templates: –One template only: word Instantiate the feature templates –All the words appeared in the training (and test) data Dimensionality reduction: feature selection –Remove stop words Feature weighting –Feature value: term frequency (tf), or tf-idf 3

Outline Dimensionality reduction Some scoring functions ** Chi-square score and Chi-square test Hw4 In this lecture, we will use “term” and “feature” interchangeably. 4

Dimensionality reduction (DR) 5

What is DR? –Given a feature set r, create a new set r’, s.t. r’ is much smaller than r, and the classification performance does not suffer too much. Why DR? –ML algorithms do not scale well. –DR can reduce overfitting. 6

Types of DR r is the original feature set, r’ is the one after DR. Local DR vs. Global DR –Global DR: r’ is the same for every category –Local DR: a different r’ for each category Term extraction vs. term selection 7

Term selection vs. extraction Term selection: r’ is a subset of r –Wrapping methods: score terms by training and evaluating classifiers.  expensive and classifier-dependent –Filtering methods Term extraction: terms in r’ are obtained by combinations or transformation of r terms. –Term clustering: –Latent semantic indexing (LSI) 8

Term selection by filtering Main idea: scoring terms according to predetermined numerical functions that measure the “importance” of the terms. It is fast and classifier-independent. Scoring functions: –Information Gain –Mutual information –chi square –…–… 9

Quick summary so far DR: to reduce the number of features –Local DR vs. global DR –Term extraction vs. term selection Term extraction –Term clustering: –Latent semantic indexing (LSI) Term selection Wrapping method Filtering method: different functions 10

Some scoring functions 11

Basic distributions (treating features as binary) Probability distributions on the event space of documents: 12

Calculating basic distributions 13

Term selection functions Intuition: for a category c i, the most valuable terms are those that are distributed most differently in the sets of possible and negative examples of c i. 14

Term selection functions 15

Information gain IG(Y|X): We must transmit Y. How many bits on average would it save us if both ends of the line knew X? Definition: IG (Y, X) = H(Y) – H(Y|X) 16

Information gain 17

More term selection functions** 18

More term selection functions** 19

Global DR For local DR, calculate f(t k, c i ). For global DR, calculate one of the following: 20 |C| is the number of classes

Which function works the best? It depends on –Classifiers –Data –…–… According to (Yang and Pedersen 1997): 21

Feature weighting 22

Alternative feature values Binary features: 0 or 1. Term frequency (TF): the number of times that t k appears in d i. Inversed document frequency (IDF): log |D| /d k, where d k is the number of documents that contain t k. TFIDF = TF * IDF Normalized TFIDF: 23

Feature weights Feature weight 2 {0,1}: same as DR Feature weight 2 R: iterative approach: –Ex: MaxEnt  Feature selection is a special case of feature weighting. 24

Summary so far Curse of dimensionality  dimensionality reduction (DR) DR: –Term extraction –Term selection Wrapping method Filtering method: different functions 25

Summary (cont) Functions: –Document frequency –Mutual information –Information gain –Gain ratio –Chi square –…–… 26

Chi square 27

Chi square An example: is gender a good feature for predicting footwear preference? –A: gender –B: footwear preference Bivariate tabular analysis: –Is there a relationship between two random variables A and B in the data? –How strong is the relationship? –What is the direction of the relationship? 28

Raw frequencies sandalsneakerLeather shoe bootsothers male female Feature: male/female Classes: {sandal, sneaker, ….}

Two distributions SandalSneakerLeatherBootOthersTotal Male Female Total SandalSneakerLeatherBootOthersTotal Male50 Female50 Total Observed distribution (O): Expected distribution (E): 30

Two distributions SandalSneakerLeatherBootOthersTotal Male Female Total SandalSneakerLeatherBootOthersTotal Male Female Total Observed distribution (O): Expected distribution (E): 31

Chi square Expected value = row total * column total / table total  2 =  ij (O ij - E ij ) 2 / E ij  2 = (6-9.5) 2 /9.5 + (17-11) 2 /11+ …. =

Calculating  2 Fill out a contingency table of the observed values  O Compute the row totals and column totals Calculate expected value for each cell assuming no association  E Compute chi square: (O-E) 2 /E 33

When r=2 and c=2 O = E = 34

 2 test 35

Basic idea Null hypothesis (the tested hypothesis): no relation exists between two random variables. Calculate the probability of having the observation with that  2 value, assuming the hypothesis is true. If the probability is too small, reject the hypothesis. 36

Requirements The events are assumed to be independent and have the same distribution. The outcomes of each event must be mutually exclusive. At least 5 observations per cell. Collect raw frequencies, not percentages 37

Degree of freedom Degree of freedom df = (r – 1) (c – 1) r: # of rows c: # of columns In this Ex: df=(2-1) (5-1)=4 38

 2 distribution table … df=4 and >  p<0.01  there is a significant relation 39

 2 to P Calculator 40

Steps of  2 test Select significance level p 0 Calculate  2 Compute the degree of freedom df = (r-1)(c-1) Calculate p given  2 value (or get the  2 0 for p 0 ) if p  2 0 ) then reject the null hypothesis. 41

Summary of  2 test A very common method for significant test Many good tutorials online –Ex: square_distributionhttp://en.wikipedia.org/wiki/Chi- square_distribution 42

Hw4 43

Hw4 Q1-Q3: kNN Q4: chi-square for feature selection Q5-Q6: The effect of feature selection on kNN Q7: Conclusion 44

Q1-Q3: kNN The choice of k The choice of similarity function: –Euclidean distance: choose the smallest ones –Cosine function: choose the largest ones Binary vs. real-valued features 45

Q4-Q6 Rank features by chi-square scores Remove non-relevant features from the vector files Run kNN using the newly processed data Compare the results with or without feature selection. 46