Presentation is loading. Please wait.

Presentation is loading. Please wait.

OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et.

Similar presentations


Presentation on theme: "OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et."— Presentation transcript:

1 OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et al. Microsoft Research Asia Peking University Tsinghua University Chinese University of Hong Kong Virginia Polytechnic Institute and State University

2 Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works

3 Motivation DR are highly desired for web scale text data DR can improve efficiency and effectiveness Feature selection (FS) is more applicable than feature extraction (FE) Most of FS algorithms are greedy. simple, effective, efficient and optimal FS algorithm

4 Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works

5 Problem Formulation (p<<d) Dimension reduction: Consider linear case: suppose where

6 Problem Formulation We denote the discrete solution space as: given a set of labeled training documents X, learn a transformation matrix such that it is optimal according to some criterion in space. The problem is: FS:

7 Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works

8 Related Works – IG Information gain aims to select a group of optimal features: by: and global optimal is NP , greedy computing

9 Related Works – CHI CHI aims to select a group of features by: and

10 Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works

11 Orthogonal Centroid Algorithm Orthogonal centroid : FE algorithm. Effective for DR of text classification problems. Computation is based on QR matrix decomposition Theorem: the solution of OC algorithm equals to the solution of the following optimization problem, where

12 Intuition of Our Work OC from the FE perspective where

13 Intuition of Our Work OC from the FS perspective: how to optimize J in discrete space by FE by FS

14 The OCFS Algorithm FS problem: suppose we want to preserve the m th and n th feature of and discard the others.

15 The OCFS Algorithm Optimization:

16 The OCFS Algorithm Solution : p largest ones from OCFS:

17 The OCFS Algorithm

18 Algorithm Analysis The Number of selected features is subject to where the energy function

19 Algorithm Analysis Complexity: time complexity is O(cd) OCFS only compute the simple square function instead of some functional computation such as logarithm of IG

20 Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works

21 Experiments Setup Datasets: 20 Newsgroups (5-class; 5,000-data; 131,072-d) Reuters Corpus Volume 1 (4-class; 800,000-data; 500,000-d) Open Directory Project (13-class) Baseline : IG & CHI Performance measurement : CPU runtime and Classifier : SVM SMO

22 Experimental Results –20NG 20NG F1 CPU runtime

23 Experimental Results –20NG

24 Experimental Results – RCV1 F1 CPU runtime RCV1

25 Experimental Results – ODP F1 ODP

26 Results Analysis Better than IG and CHI Only half of the time Outperform performance when dimension small. dimension is small, optimal outperform greedy. increasing selected features, the saturation of features makes additional features of less value.

27 Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works

28 Conclusion We proposed a novel efficient and effective feature selection algorithm for text categorization. Main advantages : optimal better performance more efficient

29 Future Works Future works: unbalanced data combine with other approaches. E.g. OCFS + PCA

30 The End Thanks! Q & A


Download ppt "OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et."

Similar presentations


Ads by Google