OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et al. Microsoft Research Asia Peking University Tsinghua University Chinese University of Hong Kong Virginia Polytechnic Institute and State University
Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works
Motivation DR are highly desired for web scale text data DR can improve efficiency and effectiveness Feature selection (FS) is more applicable than feature extraction (FE) Most of FS algorithms are greedy. simple, effective, efficient and optimal FS algorithm
Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works
Problem Formulation (p<<d) Dimension reduction: Consider linear case: suppose where
Problem Formulation We denote the discrete solution space as: given a set of labeled training documents X, learn a transformation matrix such that it is optimal according to some criterion in space. The problem is: FS:
Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works
Related Works – IG Information gain aims to select a group of optimal features: by: and global optimal is NP , greedy computing
Related Works – CHI CHI aims to select a group of features by: and
Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works
Orthogonal Centroid Algorithm Orthogonal centroid : FE algorithm. Effective for DR of text classification problems. Computation is based on QR matrix decomposition Theorem: the solution of OC algorithm equals to the solution of the following optimization problem, where
Intuition of Our Work OC from the FE perspective where
Intuition of Our Work OC from the FS perspective: how to optimize J in discrete space by FE by FS
The OCFS Algorithm FS problem: suppose we want to preserve the m th and n th feature of and discard the others.
The OCFS Algorithm Optimization:
The OCFS Algorithm Solution : p largest ones from OCFS:
The OCFS Algorithm
Algorithm Analysis The Number of selected features is subject to where the energy function
Algorithm Analysis Complexity: time complexity is O(cd) OCFS only compute the simple square function instead of some functional computation such as logarithm of IG
Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works
Experiments Setup Datasets: 20 Newsgroups (5-class; 5,000-data; 131,072-d) Reuters Corpus Volume 1 (4-class; 800,000-data; 500,000-d) Open Directory Project (13-class) Baseline : IG & CHI Performance measurement : CPU runtime and Classifier : SVM SMO
Experimental Results –20NG 20NG F1 CPU runtime
Experimental Results –20NG
Experimental Results – RCV1 F1 CPU runtime RCV1
Experimental Results – ODP F1 ODP
Results Analysis Better than IG and CHI Only half of the time Outperform performance when dimension small. dimension is small, optimal outperform greedy. increasing selected features, the saturation of features makes additional features of less value.
Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works
Conclusion We proposed a novel efficient and effective feature selection algorithm for text categorization. Main advantages : optimal better performance more efficient
Future Works Future works: unbalanced data combine with other approaches. E.g. OCFS + PCA
The End Thanks! Q & A