A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015
Outline Classification ExamplesClassic ClassifiersTrain the ClassifierEvaluation MethodApply the Classifier
Classification Examples Spam filtering Fraud detection Self-piloting automobile
The Classification Problem
The Classification Problem
The Classification Problem
Classic Classifiers Naïve Bayes Decision Tree : J48(C4.5) KNN RandomForest SVM : SMO, LibSVM Neural Network …
How to Choose the Classifier? Observe your data: amount, features Your application: precision/recall, explainable, incremental, complexity Decision Tree is easy to understand, but can't predict numerical values and is slow. Naïve Bayes is robust for somehow, easy to increment. Neural networks and SVM are "black boxes“. SVM is fast to predict yes or no. ! Never Mind: You can try all of them. Model Selection with Cross Validation
How to Choose the Classifier?
Discussions: machine-learning-classifier machine-learning-classifier to-try-first to-try-first kind-of-classifier-to-use-1.html kind-of-classifier-to-use-1.html _based_on_the_data-set_provided _based_on_the_data-set_provided
Train Your Classifier
Obtain Training Set Instances should be labeled. From running systems in practice Annotate by multi-experts (Inter-rater agreement) Crowdsourcing (Google’s Captcha) …
Obtain Training Set Large Enough More data can reduce the noises The benefit of enough data even can dominate that of the classification algorithms Redundant data will do little help. Selection Strategies: nearest neighbors, ordered removals, random sampling, particle swarms or evolutionary methods
Obtain Training Set Unbalanced Training Instances for Different Classes Evaluation: For simple measures, precision/recall,only the instances of the marjority class (class with many samples), this measure still gives you a high rate. (AUC is better.) No enough information for the features to find the class boundaries.
Obtain Training Set Strategies: divide into L distinct clusters, train L predictors, and average them as the final one. Generate synthetic data for rare class. SMOTESMOTE Reduce the imbalance level. Cut down the majority class …
Obtain Training Set More materials unbalanced-training-set unbalanced-training-set unbalanced-test-data-set-and-balanced-training-data-in- classification unbalanced-test-data-set-and-balanced-training-data-in- classification He, Haibo, and Edwardo Garcia. "Learning from imbalanced data." Knowledge and Data Engineering, IEEE Transactions on 21, no. 9 (2009):
Feature Selection Why Unrelated Features noise, heavy computation Interdependent Features redundant features Better Model Guyon and Elisseeff in “An Introduction to Variable and Feature Selection” (PDF)An Introduction to Variable and Feature Selection
Feature Selection Feature Selection Method Filter methods: apply a statistical measure to assign a scoring to each feature. E.g., the Chi squared test, information gain and correlation coefficient scores. Wrapper methods: consider the selection of a set of features as a search problem. Embedded methods: learn which features best contribute to the accuracy of the model while the model is being created. LASSO, Elastic Net and Ridge Regression.
Evaluation Method Basic Evaluation Method Precision Confusion matrix Per-class accuracy AUC(Area Under the Curve) The ROC curve shows the sensitivity of the classifier by plotting the rate of true positives to the rate of false positives
Evaluation Method Cross Validation Random Subsampling K-fold Cross Validation Leave-one-out Cross Validation
Cross Validation Random Subsampling
Cross Validation K-fold Cross Validation
Cross Validation Leave-one-out Cross Validation
Cross Validation Three-way data splits
Apply the Classifier Save the Model Make the Model dynamic
Thank you!