Download presentation
Presentation is loading. Please wait.
Published byWill Pine Modified over 10 years ago
1
Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012
2
Roadmap Feature representations: Features in attribute-value matrices Motivation: text classification Managing features General approaches Feature selection techniques Feature scoring measures Alternative feature weighting Chi-squared feature selection
3
Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates
4
Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features
5
Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features Perform dimensionality reduction
6
Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features Perform dimensionality reduction Weighting features: increase/decrease feature import
7
Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features Perform dimensionality reduction Weighting features: increase/decrease feature import Global feature weighting: weight whole column Local feature weighting: weight cell, conditions
8
Feature Selection Example Task: Text classification Feature template definition:
9
Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation:
10
Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation: Words from training (and test?) data Feature selection:
11
Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation: Words from training (and test?) data Feature selection: Stopword removal: remove top K (~100) highest freq Words like: the, a, have, is, to, for,… Feature weighting:
12
Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation: Words from training (and test?) data Feature selection: Stopword removal: remove top K (~100) highest freq Words like: the, a, have, is, to, for,… Feature weighting: Apply tf*idf feature weighting tf = term frequency; idf = inverse document frequency
13
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions
14
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size
15
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic:
16
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness
17
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness Hard to create valid model Hard to predict and generalize – think kNN
18
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness Hard to create valid model Hard to predict and generalize – think kNN Leads to high computational cost
19
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness Hard to create valid model Hard to predict and generalize – think kNN Leads to high computational cost Leads to difficulty with estimation/learning More dimensions more samples needed to learn model
20
Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance
21
Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf.
22
Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf. Functionally, Many ML algorithms do not scale well
23
Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf. Functionally, Many ML algorithms do not scale well Expensive: Training cost, training cost Poor prediction: overfitting, sparseness
24
Dimensionality Reduction Given an initial feature set r, Create a feature set r’ s.t. |r| < |r’| Approaches:
25
Dimensionality Reduction Given an initial feature set r, Create a feature set r’ s.t. |r| < |r’| Approaches: r’: same for all classes (aka global), vs r’: different for each class (aka local)
26
Dimensionality Reduction Given an initial feature set r, Create a feature set r’ s.t. |r| < |r’| Approaches: r’: same for all classes (aka global), vs r’: different for each class (aka local) Feature selection/filtering, vs Feature mapping (aka extraction)
27
Feature Selection Feature selection: r’ is a subset of r How can we pick features?
28
Feature Selection Feature selection: r’ is a subset of r How can we pick features? Extrinsic ‘wrapper’ approaches:
29
Feature Selection Feature selection: r’ is a subset of r How can we pick features? Extrinsic ‘wrapper’ approaches: For each subset of features: Build, evaluate classifier for some task Pick subset of features with best performance
30
Feature Selection Feature selection: r’ is a subset of r How can we pick features? Extrinsic ‘wrapper’ approaches: For each subset of features: Build, evaluate classifier for some task Pick subset of features with best performance Intrinsic ‘filtering’ methods: Use some intrinsic (statistical?) measure Pick features with highest scores
31
Feature Selection Wrapper approach: Pros:
32
Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons:
33
Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons: Computationally intractable: 2 |r’| *(training + testing) Specific to task, classifier; ad-hov Filtering approach: Pros
34
Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons: Computationally intractable: 2 |r’| *(training + testing) Specific to task, classifier; ad-hov Filtering approach: Pros: theoretical basis, less task, classifier specific Cons:
35
Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons: Computationally intractable: 2 |r’| *(training + testing) Specific to task, classifier; ad-hov Filtering approach: Pros: theoretical basis, less task, classifier specific Cons: Doesn’t always boost task performance
36
Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r
37
Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated
38
Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated Map to new concept representing all big, large, huge, gigantic, enormous concept of ‘bigness’ Examples:
39
Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated Map to new concept representing all big, large, huge, gigantic, enormous concept of ‘bigness’ Examples: Term classes: e.g. class-based n-grams Derived from term clusters
40
Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated Map to new concept representing all big, large, huge, gigantic, enormous concept of ‘bigness’ Examples: Term classes: e.g. class-based n-grams Derived from term clusters Dimensions in Latent Semantic Analysis (LSA/LSI) Result of Singular Value Decomposition (SVD) on matrix Produces ‘closest’ rank r’ approximation of original
41
Feature Mapping Pros:
42
Feature Mapping Pros: Data-driven Theoretical basis – guarantees on matrix similarity Not bound by initial feature space Cons:
43
Feature Mapping Pros: Data-driven Theoretical basis – guarantees on matrix similarity Not bound by initial feature space Cons: Some ad-hoc factors: e.g. # of dimensions Resulting feature space can be hard to interpret
44
Feature Filtering Filtering approaches: Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class
45
Feature Filtering Filtering approaches: Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class Fairly fast and classifier-independent
46
Feature Filtering Filtering approaches: Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class Fairly fast and classifier-independent Many different measures: Mutual information Information gain Chi-squared etc…
47
Feature Scoring Measures
48
Basic Notation, Distributions Assume binary representation of terms, classes t k : term in T; c i : class in C
49
Basic Notation, Distributions Assume binary representation of terms, classes t k : term in T; c i : class in C P(t k ): proportion of documents in which t k appears P(c i ): proportion of documents of class c i Binary so have
50
Basic Notation, Distributions Assume binary representation of terms, classes t k : term in T; c i : class in C P(t k ): proportion of documents in which t k appears P(c i ): proportion of documents of class c i Binary so have
51
Setting Up !c i cici !t k ab tktk cd
52
Setting Up !c i cici !t k ab tktk cd
53
Setting Up !c i cici !t k ab tktk cd
54
Setting Up !c i cici !t k ab tktk cd
55
Setting Up !c i cici !t k ab tktk cd
56
Setting Up !c i cici !t k ab tktk cd
57
Setting Up !c i cici !t k ab tktk cd
58
Feature Selection Functions Question: What makes a good features?
59
Feature Selection Functions Question: What makes a good features? Perspective: Best features: Features that are most DIFFERENTLY distributed across classes
60
Feature Selection Functions Question: What makes a good features? Perspective: Best features: Features that are most DIFFERENTLY distributed across classes I.e. features are best that most effectively differentiate between classes
61
Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears
62
Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold
63
Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold Intuition: Very rare terms: won’t help with categorization Or not useful globally Pros:
64
Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold Intuition: Very rare terms: won’t help with categorization Or not useful globally Pros: Easy to implement, scalable Cons:
65
Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold Intuition: Very rare terms: won’t help with categorization Or not useful globally Pros: Easy to implement, scalable Cons: Ad-hoc, low DF terms ‘topical’
66
Term Selection Functions: MI Pointwise Mutual Information (MI)
67
Term Selection Functions: MI Pointwise Mutual Information (MI) MI=0 if t,c independent
68
Term Selection Functions: MI Pointwise Mutual Information (MI) MI=0 if t,c independent Issue: Can be heavily influenced by marginal Problem comparing terms of differing frequencies
69
Term Selection Functions: IG Information Gain: Intuition: Transmitting Y, how many bits can we save if we know X? IG(Y,X i ) = H(Y)-H(Y|X)
70
Information Gain: Derivation From F. Xia, ‘11
71
More Feature Selection GSS coefficient: From F. Xia, ‘11
72
More Feature Selection GSS coefficient: NGL coefficient: N : # of docs From F. Xia, ‘11
73
More Feature Selection GSS coefficient: NGL coefficient: N : # of docs Chi-square: From F. Xia, ‘11
74
More Feature Selection GSS coefficient: NGL coefficient: N : # of docs Chi-square: From F. Xia, ‘11
75
More Term Selection Relevancy score: From F. Xia, ‘11
76
More Term Selection Relevancy score: Odds Ratio: From F. Xia, ‘11
77
Global Selection Previous measures compute class-specific selection What if you want to filter across ALL classes? Compute an aggregate measure across classes Sum: Average: Max: From F. Xia, ‘11
78
What’s the best? Answer: It depends on …. Classifiers Type of data … According to (Yang and Pedersen, 1997): {OR,NGL,GSS} > {X 2 max,Ig sum }> {# avg }>>{MI} On text classification tasks Using kNN From F. Xia, ‘11
79
Feature Weighting For text classification, typical weights include: Binary: weights in {0,1} Term frequency (tf): # occurrences of t k in document d i Inverse document frequency (idf): df k : # of docs in which t k appears; N: # docs idf = log (N/(1+df k )) tfidf = tf*idf
80
Chi Square Tests for presence/absence of relation random variables Bivariate analysis tests 2 random variables Can test strength of relationship (Strictly speaking) doesn’t test direction
81
Chi Square Example Can gender predict shoe choice? A: male/female Features B: shoe choice Classes: {sandal, sneaker,…} Due to F. Xia sandalsneakerleather shoe bootother Male6171395 Female1357169
82
Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male6171395 Female1357169 sandalsneakerleather shoe boototherTotal Male50 Female50 Total1922202514100 Due to F. Xia
83
Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male6171395 Female1357169 sandalsneakerleather shoe boototherTotal Male9.5111012.5750 Female9.5111012.5750 Total1922202514100 Due to F. Xia
84
Computing Chi Square Expected value for cell= row_total*column_total/table_total X 2 =(6-9.5) 2 /9.5+(17-11) 2 /11+.. = 14.026
85
Calculating X 2 Tabulate contigency table of observed values: O Compute row, column totals Compute table of expected values, given row/col Assuming no association Compute X 2
86
For 2x2 Table O: E: !c i cici !t k ab tktk cd !c i cici Total !t k (a+b)(a+c)/N(a+b)(b+d)/Na+b tktk (c+d)(a+c)/N(c+d)(b+d)/Nc+d totala+cb+dN
87
X 2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X 2 statistic: Compute degrees of freedom df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4 Test probability of X 2 statistic value X 2 table If probability is low – below some significance level Can reject null hypothesis
88
Requirements for X 2 Test Events assumed independent, same distribution Outcomes must be mutually exclusive Raw frequencies, not percentages Sufficient values per cell: > 5
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.