Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
CHI-SQUARE(X2) DISTRIBUTION
Chi Square Example A researcher wants to determine if there is a relationship between gender and the type of training received. The gender question is.
Dimensionality Reduction PCA -- SVD
Chapter 13: The Chi-Square Test
What is Statistical Modeling
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Learning for Text Categorization
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Ch 4: Information Retrieval and Text Mining
CJ 526 Statistical Analysis in Criminal Justice
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning – 1997) Paper By: Yiming Yang,
1 SIMS 290-2: Applied Natural Language Processing Preslav Nakov Sept 29, 2004.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
1 of 27 PSYC 4310/6310 Advanced Experimental Methods and Statistics © 2013, Michael Kalsher Michael J. Kalsher Department of Cognitive Science Adv. Experimental.
Advanced Multimedia Text Classification Tamara Berg.
Chapter 2 Dimensionality Reduction. Linear Methods
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CJ 526 Statistical Analysis in Criminal Justice
Processing of large document collections Part 2 (Text categorization, term selection) Helena Ahonen-Myka Spring 2005.
Text Classification, Active/Interactive learning.
Copyright © 2012 by Nelson Education Limited. Chapter 10 Hypothesis Testing IV: Chi Square 10-1.
Multinomial Distribution
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Chapter 20 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 These tests can be used when all of the data from a study has been measured on.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Feature Selection & Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 26,
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Processing of large document collections Part 3 (Evaluation of text classifiers, term selection) Helena Ahonen-Myka Spring 2006.
SINGULAR VALUE DECOMPOSITION (SVD)
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 5. Document Representation and Information Retrieval.
HYPOTHESIS TESTING BETWEEN TWO OR MORE CATEGORICAL VARIABLES The Chi-Square Distribution and Test for Independence.
Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc. Chap 11-1 Chapter 11 Chi-Square Tests Business Statistics: A First Course Fifth Edition.
Section 10.2 Independence. Section 10.2 Objectives Use a chi-square distribution to test whether two variables are independent Use a contingency table.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.
Chapter 11: Chi-Square  Chi-Square as a Statistical Test  Statistical Independence  Hypothesis Testing with Chi-Square The Assumptions Stating the Research.
1 G Lect 7a G Lecture 7a Comparing proportions from independent samples Analysis of matched samples Small samples and 2  2 Tables Strength.
Section 12.2: Tests for Homogeneity and Independence in a Two-Way Table.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
Feature Selection and Dimensionality Reduction. “Curse of dimensionality” – The higher the dimensionality of the data, the more data is needed to learn.
Chapter 14 – 1 Chi-Square Chi-Square as a Statistical Test Statistical Independence Hypothesis Testing with Chi-Square The Assumptions Stating the Research.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Section 10.2 Objectives Use a contingency table to find expected frequencies Use a chi-square distribution to test whether two variables are independent.
CHI-SQUARE(X2) DISTRIBUTION
Document Clustering Based on Non-negative Matrix Factorization
Semantic Processing with Context Analysis
Chapter 11 Chi-Square Tests.
Making Use of Associations Tests
Association between two categorical variables
Hypothesis Testing Review
Qualitative data – tests of association
Roberto Battiti, Mauro Brunato
K Nearest Neighbor Classification
The Chi-Square Distribution and Test for Independence
Consider this table: The Χ2 Test of Independence
Representation of documents and queries
Text Categorization Assigning documents to a fixed set of categories
Chapter 10 Analyzing the Association Between Categorical Variables
Chapter 11 Chi-Square Tests.
Feature Selection Methods
Inference for Two Way Tables
Chapter 11 Chi-Square Tests.
Hypothesis Testing - Chi Square
Lecture 16. Classification (II): Practical Considerations
Lecture 46 Section 14.5 Wed, Apr 13, 2005
Presentation transcript:

Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Roadmap Feature representations: Features in attribute-value matrices Motivation: text classification Managing features General approaches Feature selection techniques Feature scoring measures Alternative feature weighting Chi-squared feature selection

Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc Spam x 2 =Doc Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates

Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc Spam x 2 =Doc Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features

Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc Spam x 2 =Doc Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features Perform dimensionality reduction

Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc Spam x 2 =Doc Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features Perform dimensionality reduction Weighting features: increase/decrease feature import

Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc Spam x 2 =Doc Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features Perform dimensionality reduction Weighting features: increase/decrease feature import Global feature weighting: weight whole column Local feature weighting: weight cell, conditions

Feature Selection Example Task: Text classification Feature template definition:

Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation:

Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation: Words from training (and test?) data Feature selection:

Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation: Words from training (and test?) data Feature selection: Stopword removal: remove top K (~100) highest freq Words like: the, a, have, is, to, for,… Feature weighting:

Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation: Words from training (and test?) data Feature selection: Stopword removal: remove top K (~100) highest freq Words like: the, a, have, is, to, for,… Feature weighting: Apply tf*idf feature weighting tf = term frequency; idf = inverse document frequency

The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions

The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size

The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic:

The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness

The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness Hard to create valid model Hard to predict and generalize – think kNN

The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness Hard to create valid model Hard to predict and generalize – think kNN Leads to high computational cost

The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness Hard to create valid model Hard to predict and generalize – think kNN Leads to high computational cost Leads to difficulty with estimation/learning More dimensions  more samples needed to learn model

Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance

Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf.

Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf. Functionally, Many ML algorithms do not scale well

Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf. Functionally, Many ML algorithms do not scale well Expensive: Training cost, training cost Poor prediction: overfitting, sparseness

Dimensionality Reduction Given an initial feature set r, Create a feature set r’ s.t. |r| < |r’| Approaches:

Dimensionality Reduction Given an initial feature set r, Create a feature set r’ s.t. |r| < |r’| Approaches: r’: same for all classes (aka global), vs r’: different for each class (aka local)

Dimensionality Reduction Given an initial feature set r, Create a feature set r’ s.t. |r| < |r’| Approaches: r’: same for all classes (aka global), vs r’: different for each class (aka local) Feature selection/filtering, vs Feature mapping (aka extraction)

Feature Selection Feature selection: r’ is a subset of r How can we pick features?

Feature Selection Feature selection: r’ is a subset of r How can we pick features? Extrinsic ‘wrapper’ approaches:

Feature Selection Feature selection: r’ is a subset of r How can we pick features? Extrinsic ‘wrapper’ approaches: For each subset of features: Build, evaluate classifier for some task Pick subset of features with best performance

Feature Selection Feature selection: r’ is a subset of r How can we pick features? Extrinsic ‘wrapper’ approaches: For each subset of features: Build, evaluate classifier for some task Pick subset of features with best performance Intrinsic ‘filtering’ methods: Use some intrinsic (statistical?) measure Pick features with highest scores

Feature Selection Wrapper approach: Pros:

Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons:

Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons: Computationally intractable: 2 |r’| *(training + testing) Specific to task, classifier; ad-hov Filtering approach: Pros

Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons: Computationally intractable: 2 |r’| *(training + testing) Specific to task, classifier; ad-hov Filtering approach: Pros: theoretical basis, less task, classifier specific Cons:

Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons: Computationally intractable: 2 |r’| *(training + testing) Specific to task, classifier; ad-hov Filtering approach: Pros: theoretical basis, less task, classifier specific Cons: Doesn’t always boost task performance

Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r

Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated

Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated Map to new concept representing all big, large, huge, gigantic, enormous  concept of ‘bigness’ Examples:

Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated Map to new concept representing all big, large, huge, gigantic, enormous  concept of ‘bigness’ Examples: Term classes: e.g. class-based n-grams Derived from term clusters

Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated Map to new concept representing all big, large, huge, gigantic, enormous  concept of ‘bigness’ Examples: Term classes: e.g. class-based n-grams Derived from term clusters Dimensions in Latent Semantic Analysis (LSA/LSI) Result of Singular Value Decomposition (SVD) on matrix Produces ‘closest’ rank r’ approximation of original

Feature Mapping Pros:

Feature Mapping Pros: Data-driven Theoretical basis – guarantees on matrix similarity Not bound by initial feature space Cons:

Feature Mapping Pros: Data-driven Theoretical basis – guarantees on matrix similarity Not bound by initial feature space Cons: Some ad-hoc factors: e.g. # of dimensions Resulting feature space can be hard to interpret

Feature Filtering Filtering approaches: Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class

Feature Filtering Filtering approaches: Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class Fairly fast and classifier-independent

Feature Filtering Filtering approaches: Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class Fairly fast and classifier-independent Many different measures: Mutual information Information gain Chi-squared etc…

Feature Scoring Measures

Basic Notation, Distributions Assume binary representation of terms, classes t k : term in T; c i : class in C

Basic Notation, Distributions Assume binary representation of terms, classes t k : term in T; c i : class in C P(t k ): proportion of documents in which t k appears P(c i ): proportion of documents of class c i Binary so have

Basic Notation, Distributions Assume binary representation of terms, classes t k : term in T; c i : class in C P(t k ): proportion of documents in which t k appears P(c i ): proportion of documents of class c i Binary so have

Setting Up !c i cici !t k ab tktk cd

Setting Up !c i cici !t k ab tktk cd

Setting Up !c i cici !t k ab tktk cd

Setting Up !c i cici !t k ab tktk cd

Setting Up !c i cici !t k ab tktk cd

Setting Up !c i cici !t k ab tktk cd

Setting Up !c i cici !t k ab tktk cd

Feature Selection Functions Question: What makes a good features?

Feature Selection Functions Question: What makes a good features? Perspective: Best features: Features that are most DIFFERENTLY distributed across classes

Feature Selection Functions Question: What makes a good features? Perspective: Best features: Features that are most DIFFERENTLY distributed across classes I.e. features are best that most effectively differentiate between classes

Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears

Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold

Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold Intuition: Very rare terms: won’t help with categorization Or not useful globally Pros:

Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold Intuition: Very rare terms: won’t help with categorization Or not useful globally Pros: Easy to implement, scalable Cons:

Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold Intuition: Very rare terms: won’t help with categorization Or not useful globally Pros: Easy to implement, scalable Cons: Ad-hoc, low DF terms ‘topical’

Term Selection Functions: MI Pointwise Mutual Information (MI)

Term Selection Functions: MI Pointwise Mutual Information (MI) MI=0 if t,c independent

Term Selection Functions: MI Pointwise Mutual Information (MI) MI=0 if t,c independent Issue: Can be heavily influenced by marginal Problem comparing terms of differing frequencies

Term Selection Functions: IG Information Gain: Intuition: Transmitting Y, how many bits can we save if we know X? IG(Y,X i ) = H(Y)-H(Y|X)

Information Gain: Derivation From F. Xia, ‘11

More Feature Selection GSS coefficient: From F. Xia, ‘11

More Feature Selection GSS coefficient: NGL coefficient: N : # of docs From F. Xia, ‘11

More Feature Selection GSS coefficient: NGL coefficient: N : # of docs Chi-square: From F. Xia, ‘11

More Feature Selection GSS coefficient: NGL coefficient: N : # of docs Chi-square: From F. Xia, ‘11

More Term Selection Relevancy score: From F. Xia, ‘11

More Term Selection Relevancy score: Odds Ratio: From F. Xia, ‘11

Global Selection Previous measures compute class-specific selection What if you want to filter across ALL classes? Compute an aggregate measure across classes Sum: Average: Max: From F. Xia, ‘11

What’s the best? Answer: It depends on …. Classifiers Type of data … According to (Yang and Pedersen, 1997): {OR,NGL,GSS} > {X 2 max,Ig sum }> {# avg }>>{MI} On text classification tasks Using kNN From F. Xia, ‘11

Feature Weighting For text classification, typical weights include: Binary: weights in {0,1} Term frequency (tf): # occurrences of t k in document d i Inverse document frequency (idf): df k : # of docs in which t k appears; N: # docs idf = log (N/(1+df k )) tfidf = tf*idf

Chi Square Tests for presence/absence of relation random variables Bivariate analysis tests 2 random variables Can test strength of relationship (Strictly speaking) doesn’t test direction

Chi Square Example Can gender predict shoe choice? A: male/female  Features B: shoe choice  Classes: {sandal, sneaker,…} Due to F. Xia sandalsneakerleather shoe bootother Male Female

Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male Female sandalsneakerleather shoe boototherTotal Male50 Female50 Total Due to F. Xia

Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male Female sandalsneakerleather shoe boototherTotal Male Female Total Due to F. Xia

Computing Chi Square Expected value for cell= row_total*column_total/table_total X 2 =(6-9.5) 2 /9.5+(17-11) 2 /11+.. =

Calculating X 2 Tabulate contigency table of observed values: O Compute row, column totals Compute table of expected values, given row/col Assuming no association Compute X 2

For 2x2 Table O: E: !c i cici !t k ab tktk cd !c i cici Total !t k (a+b)(a+c)/N(a+b)(b+d)/Na+b tktk (c+d)(a+c)/N(c+d)(b+d)/Nc+d totala+cb+dN

X 2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X 2 statistic: Compute degrees of freedom df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4 Test probability of X 2 statistic value X 2 table If probability is low – below some significance level Can reject null hypothesis

Requirements for X 2 Test Events assumed independent, same distribution Outcomes must be mutually exclusive Raw frequencies, not percentages Sufficient values per cell: > 5