Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao Ren †, Rong Yan ‡ and Qiang Yang ¶ ‡ IBM T. J. Watson Research Center † Sun Yat-Sen University *Montclair State University § Xavier University of Lousiana Facebook, Inc ¶ Hong Kong University of Science and Technology Construction works when the original pool is not good enough (feature selection won’t work) Too many choices to construct Evaluate on local space not always on all the data points Better Automated

Feature Construction -- Example XOR like problem Not linearly separable: use both features to construct a “cross” model Linearly separable: one feature F3 is enough

Main Challenges To address these, we have 3 main steps 1.Too many ways to construct new features: x y, x-y,x/y, etc Divide and Conquer 2. Insignificant on the whole data set - highly discriminant in local region Local Feature Construction and Evaluation 3. Automated – not based on domain knowledge Automatically adjusted weighting rules 4 binary operators, 1000 original features up to constructed features F2 not very useful unless considered with F1

Divide-Conquer Local Feature Construction and Evaluation Stopping Criteria: 1.The number of instances in the node is smaller than a threshold 2.The node only contains examples from one class Constructed Features (org + new)

Every node … (1) F (3) (4) Weighted 1.Random subset of orig features 2.“Weighted random” subset of operators (2) Weighting Rule

Weight is proportional to the info-gain of features constructed by the operator. Sum of its past info gain

Properties Number of features is bounded. Highly weighted operator is expected to perform better in its two child nodes (see paper) FCTree’s error is bounded. –also explains why the features are of high quality

Experiment – Data Set UCI repository (Balanced) Caltech-256 database: An image database of 256 object categories. Each category is processed via a 177-dimensional color correlogram (Balanced) Landmine collection: Collected via remote sensing techniques (Skewed) Nuclear Ban data source: A nuclear explosion detection problem used by ICDM’08 contest (Skewed)

Experiment -- Baseline methods Original Features TFC: –enumerates all possible features generated by operators NB,SVM and C45 Operators FCTree:

Performance--Blannced Data Best in 23 out of 33 comparisions

Performance--Skew Data Best in 25 out of 33 comparisions

Scalability Analysis

Strength of Weighting Rule

Original FCTree 177 dimension color correlogram

Conclusion Key points –Divide-conquer to avoid exhaustive enumeration; –Local feature construction subspace evaluation –Weighting rules based search: domain knowledge free and provable performance. Code and data available from the authors

Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Similar presentations

Presentation on theme: "Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao.

Similar presentations

Presentation on theme: "Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao."— Presentation transcript:

Similar presentations

About project

Feedback