Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin, Shu-Han Authors : George Forman (Hewlett-Packard Labs) Conference on Information and Knowledge Management (CIKM) (2009)
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Methodology Experiments Conclusion Comments
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation 3 Multi-class classification: 1 of n problem, e.g., topic category. Binary-class classification: 1 of 2 problem A B C D So every problem can be decompose to many binary classification problem: The positive/negative problem negative positive
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation Feature “Scaling” = “weighting”=“Scoring” ‘TF·IDF’ representation: IDF is oblivious to the class labels inappropriately Scales some features inappropriately 4 Positive (100)Negative (900)IDF X80 (80%)0 (0%)Log(1000/80)=1.1 Y8 (8%)0 (0%)Log(1000/8)=2.1
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objectives Maximize classification “performance” Feature selection Feature scaling: Make numeric range greater for more predictive feature Predictive: 100% positive, 0% negative 0% positive, 100% negative 5
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology – Feature Scoring Metrics 6 F -1 : The inverse normal cumulative distribution function
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Positive (30) Negative (400) BNSIDFLORIG Italy 30 (100%) x2 (7%) patient 30 (100%) 400 (100%) 0.00 cost (100%) y15 (50%)200 (50%)
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology – Feature Scoring Metrics 8 Positive (30) Negative (400) BNSIDFLORIG Italy 30 (100%) x3 (10%) [0% ~ 100%], - 0%
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology – Feature Scoring Metrics 9 Positive (30) Negative (400) BNSIDFLORIG Italy 040 (10%) x0 400 (100%) %, - [0% ~ 100%]
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology – Feature Scoring Metrics 10 Positive (30) Negative (400) BNSIDFLORIG patient 30 (100%) 400 (100%) 0.00 y15 (50%)200 (50%) [0% ~ 100%], - [0% ~ 100%]
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology – Feature Scoring Metrics 11 Positive (30) Negative (400) BNSIDFLORIG Italy 30 (100%) cost (100%) [0% ~ 100%], - [100% ~ 0%]
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Positive (30) Negative (400) BNSIDFLORIG Italy 30 (100%) x2 (7%) patient 30 (100%) 400 (100%) 0.00 cost (100%) y15 (50%)200 (50%)
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – Accuracy & F-measure 13
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – Precision vs. Recall 14
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – The effect of class distribution 15
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – compare to other scoring metrics 16
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – Feature selection + Feature scaling 17
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions BNS the difference between the rate of + class and - class Use IG selection + BNS scaling No need to feature selection: better use all features for the best performance Better to simply use all binary features 18
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Comments Advantage Idea is clear: consider the class distribution Drawback Restrict to the 2-class problem Use all features takes time Application Instead of IDF 19