Download presentation
Presentation is loading. Please wait.
1
OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, and Weiguo Fan et al. Microsoft Research Asia Peking University Tsinghua University Chinese University of Hong Kong Virginia Polytechnic Institute and State University
2
Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works
3
Motivation DR are highly desired for web scale text data DR can improve efficiency and effectiveness Feature selection (FS) is more applicable than feature extraction (FE) Most of FS algorithms are greedy. simple, effective, efficient and optimal FS algorithm
4
Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works
5
Problem Formulation (p<<d) Dimension reduction: Consider linear case: suppose where
6
Problem Formulation We denote the discrete solution space as: given a set of labeled training documents X, learn a transformation matrix such that it is optimal according to some criterion in space. The problem is: FS:
7
Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works
8
Related Works – IG Information gain aims to select a group of optimal features: by: and global optimal is NP , greedy computing
9
Related Works – CHI CHI aims to select a group of features by: and
10
Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works
11
Orthogonal Centroid Algorithm Orthogonal centroid : FE algorithm. Effective for DR of text classification problems. Computation is based on QR matrix decomposition Theorem: the solution of OC algorithm equals to the solution of the following optimization problem, where
12
Intuition of Our Work OC from the FE perspective where
13
Intuition of Our Work OC from the FS perspective: how to optimize J in discrete space by FE by FS
14
The OCFS Algorithm FS problem: suppose we want to preserve the m th and n th feature of and discard the others.
15
The OCFS Algorithm Optimization:
16
The OCFS Algorithm Solution : p largest ones from OCFS:
17
The OCFS Algorithm
18
Algorithm Analysis The Number of selected features is subject to where the energy function
19
Algorithm Analysis Complexity: time complexity is O(cd) OCFS only compute the simple square function instead of some functional computation such as logarithm of IG
20
Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works
21
Experiments Setup Datasets: 20 Newsgroups (5-class; 5,000-data; 131,072-d) Reuters Corpus Volume 1 (4-class; 800,000-data; 500,000-d) Open Directory Project (13-class) Baseline : IG & CHI Performance measurement : CPU runtime and Classifier : SVM SMO
22
Experimental Results –20NG 20NG F1 CPU runtime
23
Experimental Results –20NG
24
Experimental Results – RCV1 F1 CPU runtime RCV1
25
Experimental Results – ODP F1 ODP
26
Results Analysis Better than IG and CHI Only half of the time Outperform performance when dimension small. dimension is small, optimal outperform greedy. increasing selected features, the saturation of features makes additional features of less value.
27
Outline Motivation Problem formulation Related works The OCFS algorithm Experiments Conclusion and future works
28
Conclusion We proposed a novel efficient and effective feature selection algorithm for text categorization. Main advantages : optimal better performance more efficient
29
Future Works Future works: unbalanced data combine with other approaches. E.g. OCFS + PCA
30
The End Thanks! Q & A
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.