Download presentation
Presentation is loading. Please wait.
Published byMerry Lane Modified over 9 years ago
1
Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin, Shu-Han Authors : George Forman (Hewlett-Packard Labs) Conference on Information and Knowledge Management (CIKM) (2009)
2
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Methodology Experiments Conclusion Comments
3
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation 3 Multi-class classification: 1 of n problem, e.g., topic category. Binary-class classification: 1 of 2 problem A B C D So every problem can be decompose to many binary classification problem: The positive/negative problem negative positive
4
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation Feature “Scaling” = “weighting”=“Scoring” ‘TF·IDF’ representation: IDF is oblivious to the class labels inappropriately Scales some features inappropriately 4 Positive (100)Negative (900)IDF X80 (80%)0 (0%)Log(1000/80)=1.1 Y8 (8%)0 (0%)Log(1000/8)=2.1
5
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objectives Maximize classification “performance” Feature selection Feature scaling: Make numeric range greater for more predictive feature Predictive: 100% positive, 0% negative 0% positive, 100% negative 5
6
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology – Feature Scoring Metrics 6 F -1 : The inverse normal cumulative distribution function
7
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Positive (30) Negative (400) BNSIDFLORIG Italy 30 (100%)03.291.164.680.37 x2 (7%)00.142.331.760.02 patient 30 (100%) 400 (100%) 0.00 cost 0 400 (100%) 3.290.03-4.680.37 y15 (50%)200 (50%)0.000.300.00
8
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology – Feature Scoring Metrics 8 Positive (30) Negative (400) BNSIDFLORIG Italy 30 (100%)03.291.164.680.37 x3 (10%)0 0.362.161.950.03 + [0% ~ 100%], - 0%
9
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology – Feature Scoring Metrics 9 Positive (30) Negative (400) BNSIDFLORIG Italy 040 (10%) 0.361.03-0.820.01 x0 400 (100%) 3.290.03-4.680.37 + 0%, - [0% ~ 100%]
10
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology – Feature Scoring Metrics 10 Positive (30) Negative (400) BNSIDFLORIG patient 30 (100%) 400 (100%) 0.00 y15 (50%)200 (50%)0.000.300.00 + [0% ~ 100%], - [0% ~ 100%]
11
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Methodology – Feature Scoring Metrics 11 Positive (30) Negative (400) BNSIDFLORIG Italy 30 (100%)03.291.164.680.37 cost 0 400 (100%) 3.290.03-4.680.37 + [0% ~ 100%], - [100% ~ 0%]
12
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Positive (30) Negative (400) BNSIDFLORIG Italy 30 (100%)03.291.164.680.37 x2 (7%)00.142.331.760.02 patient 30 (100%) 400 (100%) 0.00 cost 0 400 (100%) 3.290.03-4.680.37 y15 (50%)200 (50%)0.000.300.00
13
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – Accuracy & F-measure 13
14
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – Precision vs. Recall 14
15
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – The effect of class distribution 15
16
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – compare to other scoring metrics 16
17
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments – Feature selection + Feature scaling 17
18
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions BNS the difference between the rate of + class and - class Use IG selection + BNS scaling No need to feature selection: better use all features for the best performance Better to simply use all binary features 18
19
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Comments Advantage Idea is clear: consider the class distribution Drawback Restrict to the 2-class problem Use all features takes time Application Instead of IDF 19
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.