Feature Selection: Why?
Text collections have a large number of features 10,000 – 1,000,000 unique words … and more May make using a particular classifier feasible Some classifiers can’t deal with 100,000 of features Reduces training time Training time for some methods is quadratic or worse in the number of features Can improve generalization (performance) Eliminates noise features Avoids overfitting Think the two NB models 13.5
Basic feature selection algorithm
For a given class c, we compute a utility measure A(t,c) for each term of the vocabulary Select the k terms that have the highest values of A(t,c) SELECTFEATURES(D, c, k) 1 V ← EXTRACTVOCABULARY(D) 2 L ← [] 3 for each t ∈ V 4 do A(t, c) ← COMPUTEFEATUREUTILITY(D, t, c) 5 APPEND(L, A(t, c), ti) 6 return FEATURESWITHLARGESTVALUES(L, k)
Feature selection: how?
Three utility measures: Information theory: How much information does the value of one categorical variable give you about the value of another Mutual information Hypothesis testing statistics: Are we confident that the value of one categorical variable is associated with the value of another Chi-square test Frequency 13.5
Feature selection via Mutual Information
In training set, choose k words which best discriminate (give most info on) the categories. The Mutual Information between a word, class is: For each word w and each category c For MLEs of the probabilities
Feature selection via Mutual Information (example)
Feature selection via Mutual Information
Mutual information measures how much information – in the information theoretic sense If a term’s distribution is the same in the class as it is in the collection as a whole, then I(U; C) =0. MI reaches its maximum value if the term is a perfect indicator for class membership if the term is present in a document if and only if the document is in the class.
2 statistic In statistics, the χ2 test is applied to test the independence of two events. In feature selection, the two events are occurrence of the term and occurrence of the class
2 statistic(example)
2 statistic(example) X2 is a measure of how much expected counts E and observed counts N deviate from each other. A high value of X2 indicates that the hypothesis of independence, which implies that expected and observed counts are similar, is incorrect.
Frequency-based feature selection
Selecting the terms that are most common in the class Frequency can be either document frequency (the number of documents in the class c that contain the term t) or as collection frequency (the number of tokens of t that occur in documents in c) Discussions Frequency-based feature selection selects some frequent terms that have no specific information about the class the days of the week (Monday, Tuesday, ), which are frequent across classes in newswire text. When many thousands of features are selected, then frequency-based feature selection often does well. If somewhat suboptimal accuracy is acceptable, then frequency-based feature selection can be a good alternative to more complex methods.
Comparison of feature selection methods
Comparison of feature selection methods
χ2 selects more rare terms than mutual information The independence of term t and class c can sometimes be rejected with high confidence even if t carries little information about membership of a document in c.
Comparison of feature selection methods
All three methods – MI, χ2 and frequency-based– are greedy methods.
Feature selection for NB
In general feature selection is necessary for multivariate Bernoulli NB. “Feature selection” really means something different for multinomial NB. It means dictionary truncation
Evaluating Categorization
Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances). Classification accuracy: c/n where n is the total number of test instances and c is the number of test instances correctly classified by the system. Adequate if one class per document Otherwise F measure for each class Results can vary based on sampling error due to different training and test sets. 13.6
Classifier Accuracy Measures
(classifier)C1 C2 (true)C1 True positive False negative False positive True negative Classifier Accuracy Measures classes buy_computer = yes buy_computer = no total recognition(%) 6954 46 7000 99.34 412 2588 3000 86.27 7366 2634 10000 95.42 Accuracy of a classifier M, acc(M): percentage of test set tuples that are correctly classified by the model M Error rate (misclassification rate) of M = 1 – acc(M) Given m classes, CMi,j, an entry in a confusion matrix, indicates # of tuples in class i that are labeled by the classifier as class j Alternative accuracy measures (e.g., for cancer diagnosis) sensitivity = t-pos/pos /* true positive recognition rate */ specificity = t-neg/neg /* true negative recognition rate */ precision = t-pos/(t-pos + f-pos) accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg) This model can also be used for cost-benefit analysis
Evaluating the Accuracy of a Classifier
Holdout method Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation Cross-validation (k-fold, where k = 10 is most popular) Randomly partition the data into k mutually exclusive subsets, each approximately equal size At i-th iteration, use Di as test set and others as training set Leave-one-out: k folds where k = # of tuples, for small sized data
Evaluating the Accuracy of a Classifier or Predictor (II)
Bootstrap Works well with small data sets Samples the given training tuples uniformly with replacement i.e., each time a tuple is selected, it is equally likely to be selected again and re-added to the training set Several boostrap methods, and a common one is .632 boostrap Suppose we are given a data set of d tuples. The data set is sampled d times, with replacement, resulting in a training set of d samples. The data tuples that did not make it into the training set end up forming the test set. About 63.2% of the original data will end up in the bootstrap, and the remaining 36.8% will form the test set (since (1 – 1/d)d ≈ e-1 = 0.368) Repeat the sampling procedue k times, overall accuracy of the model:
Naïve Bayes on spam email
This graph is from Naive-Bayes vs. Rule-Learning in Classification of Jefferson Provost, UT Austin 13.6
Violation of NB Assumptions
Conditional independence Examples?
Example: Sensors NB FACTORS: P(s) = 1/2 P(+|s) = 1/4 P(+|r) = 3/4
Reality Raining Sunny P(+,+,r) = 3/8 P(-,-,r) = 1/8 P(+,+,s) = 1/8 P(-,-,s) = 3/8 NB Model NB FACTORS: P(s) = 1/2 P(+|s) = 1/4 P(+|r) = 3/4 PREDICTIONS: P(r,+,+) = (½)(¾)(¾) P(s,+,+) = (½)(¼)(¼) P(r|+,+) = 9/10 P(s|+,+) = 1/10 Raining? M1 M2
Naïve Bayes Posterior Probabilities
Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate. Correct estimation accurate prediction, but correct probability estimation is NOT necessary for accurate prediction (just need right ordering of probabilities)
Naive Bayes is Not So Naive
Naïve Bayes: First and Second place in KDD-CUP 97 competition, among 16 (then) state of the art algorithms Goal: Financial services industry direct mail response prediction model: Predict if the recipient of mail will actually respond to the advertisement – 750,000 records. Robust to Irrelevant Features Irrelevant Features cancel each other without affecting results Instead Decision Trees can heavily suffer from this. Very good in domains with many equally important features Decision Trees suffer from fragmentation in such cases – especially if little data A good dependable baseline for text classification (but not the best)! Optimal if the Independence Assumptions hold: If assumed independence is correct, then it is the Bayes Optimal Classifier for problem Very Fast: Learning with one pass of counting over the data; testing linear in the number of attributes, and document collection size Low Storage requirements
Bayesian Belief Networks
Bayesian belief network allows a subset of the variables conditionally independent A graphical model of causal relationships Represents dependency among the variables Gives a specification of joint probability distribution Nodes: random variables Links: dependency X and Y are the parents of Z, and Y is the parent of P No dependency between Z and P Has no loops or cycles Y Z P X
Bayesian Belief Network: An Example
Family History Smoker The conditional probability table (CPT) for variable LungCancer: LC ~LC (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) 0.8 0.2 0.5 0.7 0.3 0.1 0.9 LungCancer Emphysema CPT shows the conditional probability for each possible combination of its parents PositiveXRay Dyspnea Derivation of the probability of a particular combination of values of X, from CPT: Bayesian Belief Networks
Resources Open Calais: Automatic Semantic Tagging
IIR 13 Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002. Yiming Yang & Xin Liu, A re-examination of text categorization methods. Proceedings of SIGIR, 1999. Andrew McCallum and Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pp Tom Mitchell, Machine Learning. McGraw-Hill, 1997. Clear simple explanation of Naïve Bayes Open Calais: Automatic Semantic Tagging Free (but they can keep your data), provided by Thompson/Reuters Weka: A data mining software package that includes an implementation of Naive Bayes Reuters – the most famous text classification evaluation set and still widely used by lazy people (but now it’s too small for realistic experiments – you should use Reuters RCV1)
Classification by decision tree induction
Decision Tree Induction: Training Dataset
This follows an example of Quinlan’s ID3
Output: A Decision Tree for “buys_computer”
age? overcast student? credit rating? <=30 >40 no yes 31..40 fair excellent
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf
Attribute Selection Measure: Information Gain (ID3/C4.5)
Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D| Expected information (entropy) needed to classify a tuple in D: Information needed (after using A to split D into v partitions) to classify D: Information gained by branching on attribute A I : the expected information needed to classify a given sample E (entropy) : expected information based on the partitioning into subsets by A
Attribute Selection: Information Gain
Class P: buys_computer = “yes” Class N: buys_computer = “no” means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence Similarly,
Gain Ratio for Attribute Selection (C4.5)
Information gain measure is biased towards attributes with a large number of values C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) GainRatio(A) = Gain(A)/SplitInfo(A) Ex. gain_ratio(income) = 0.029/0.926 = 0.031 The attribute with the maximum gain ratio is selected as the splitting attribute
Gini index (CART, IBM IntelligentMiner)
If a data set D contains examples from n classes, gini index, gini(D) is defined as where pj is the relative frequency of class j in D If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as Reduction in Impurity: The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)
Gini index (CART, IBM IntelligentMiner)
Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no” Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2 but gini{medium,high} is 0.30 and thus the best since it is the lowest
The k-Nearest Neighbor Algorithm
All instances correspond to points in the n-D space The nearest neighbor are defined in terms of Euclidean distance, dist(X1, X2) Target function could be discrete- or real- valued For discrete-valued, k-NN returns the most common value among the k training examples nearest to xq Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples . _ _ . _ _ + . . + . _ xq + . _ +
Discussion on the k-NN Algorithm
k-NN for real-valued prediction for a given unknown tuple Returns the mean values of the k nearest neighbors Distance-weighted nearest neighbor algorithm Weight the contribution of each of the k neighbors according to their distance to the query xq Give greater weight to closer neighbors Robust to noisy data by averaging k-nearest neighbors Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes To overcome it, elimination of the least relevant attributes
Genetic Algorithms (GA)
Genetic Algorithm: based on an analogy to biological evolution An initial population is created consisting of randomly generated rules Each rule is represented by a string of bits E.g., if A1 and ¬A2 then C2 can be encoded as 100 If an attribute has k > 2 values, k bits can be used Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules and their offsprings The fitness of a rule is represented by its classification accuracy on a set of training examples Offsprings are generated by crossover and mutation The process continues until a population P evolves when each rule in P satisfies a prespecified threshold Slow but easily parallelizable
