Asymmetric Gradient Boosting with Application to Spam Filtering Jingrui He Bo Thiesson Carnegie Mellon University Microsoft Research
Roadmap Background MarginBoost Framework Boosting with Different Costs (BDC) Cost Functions BDC in the Low False Positive Region Parameter Study Experimental Results Conclusion
Background Classification Boosting Symmetric Loss Function Neural networks, Support Vector Machines Ensemble classifier Boosting Symmetric Loss Function The same cost for misclassified instances from different classes Weak learner Training data Reweight
Boosting with Different Costs to the rescue! Email Spam Filtering Classification task Logistic regression, AdaBoost, SVMs, Naïve Bayes, Decision Trees, Neural Networks, etc Misclassification of good emails: false positives False positives more expensive than false negatives Stratification Spam emails de-emphasized in the same way Unable to differentiate between the noisy and characteristic spam emails Boosting with Different Costs to the rescue!
MarginBoost Framework Mason et al., 1999 Training set: Strong classifier: voted combination of weak learners Loss functional Weak learner: Classification result Weight: Margin Correct prediction: + Incorrect prediction: - Sample average of the cost function Cost function:
MarginBoost Framework cont. Mason et al., 1999 To minimize the loss functional NO traditional parameter optimization Gradient descent in the function space In iteration t with classifier , find the direction s.t. decreases most rapidly. Negative functional derivative of S at Indicator function at Derivative of C with respect to the margin
MarginBoost Framework cont. Mason et al., 1999 If comes from some fixed parameterized class, it should maximize maximizes the weighted margins for all the data points, where weight Coefficient for Line search; more sophisticated method Stopping criterion Maximum number of iteration reached
MarginBoost Specialization Cost function + Cost Function Differentiable Monotonically decreasing AdaBoost LogitBoost Logistic regression
MarginBoost Specialization cont. Cost Function Weak Learner: Decision stumps is the most discriminating feature in that iteration Strong Classifier: Output: Upon convergence: logistic regression Stop earlier: feature selection
Boosting with Different Costs Advantages Weights of mislabeled spam Regular boosting BDC Weights of mislabeled ham Regular Boosting Larger and larger weights as more weak learners are combined Large weights for moderately misclassified spam Small weights for extremely misclassified spam Always High
Boosting with Different Costs cont. Cost Function Ham: Spam: Weight for Training Instances
BDC at Low False Positive Region Linear threshold Noisy spam message After one iteration in regular boosting After one iteration in BDC 4 3 High false positive region Low false positive region
Parameter Study in BDC Noisy Data Sets : the maximum cost for spam (stratification) : the slope of the cost around Noisy Data Sets Noise probability 0.03 Noise probability 0.05 Noise probability 0.1
Parameter Study in BDC cont. The effect of with FN at FP 0.03 FN at FP 0.05 FN at FP 0.1 FN at FP 0.03 FN at FP 0.05 FN at FP 0.1
Experimental Results Data Methods for Comparison Hotmail Feedback Loop data Training set: 200,000 messages received between July 1st, 2005 and August 9th, 2005 Test set: 60,000 messages received between December 1st, 2005 and December 6th, 2005 Methods for Comparison Logistic regression, regularized logistic regression LogitBoost LogitBoost and logistic regression with stratification
Experimental Results cont. Weak learner: decision stumps Weak learner: decision trees of depth 2
Conclusion MarginBoost in Email Spam Filtering Logistic regression as a special instance Smart feature selection in logistic regression BDC: Asymmetric Boosting Method Different cost functions for ham and spam Misclassified ham always have large weight Moderately misclassified spam have large weight Extremely misclassified spam have small weight Able to improve the false negative rates at the low false positive region
Thank you! www.cs.cmu.edu/~jingruih Q&A Thank you! www.cs.cmu.edu/~jingruih