Download presentation
Presentation is loading. Please wait.
1
Generalized and Heuristic-Free Feature Construction for Improved Accuracy Wei Fan ‡, Erheng Zhong †, Jing Peng*, Olivier Verscheure ‡, Kun Zhang §, Jiangtao Ren †, Rong Yan ‡ and Qiang Yang ¶ ‡ IBM T. J. Watson Research Center † Sun Yat-Sen University *Montclair State University § Xavier University of Lousiana Facebook, Inc ¶ Hong Kong University of Science and Technology Construction works when the original pool is not good enough (feature selection won’t work) Too many choices to construct Evaluate on local space not always on all the data points Better Automated
2
Feature Construction -- Example XOR like problem Not linearly separable: use both features to construct a “cross” model Linearly separable: one feature F3 is enough
4
Main Challenges To address these, we have 3 main steps 1.Too many ways to construct new features: x y, x-y,x/y, etc Divide and Conquer 2. Insignificant on the whole data set - highly discriminant in local region Local Feature Construction and Evaluation 3. Automated – not based on domain knowledge Automatically adjusted weighting rules 4 binary operators, 1000 original features up to constructed features F2 not very useful unless considered with F1
5
Divide-Conquer Local Feature Construction and Evaluation Stopping Criteria: 1.The number of instances in the node is smaller than a threshold 2.The node only contains examples from one class Constructed Features (org + new)
6
Every node … (1) F (3) (4) Weighted 1.Random subset of orig features 2.“Weighted random” subset of operators (2) Weighting Rule
7
Weight is proportional to the info-gain of features constructed by the operator. Sum of its past info gain
8
Properties Number of features is bounded. Highly weighted operator is expected to perform better in its two child nodes (see paper) FCTree’s error is bounded. –also explains why the features are of high quality
9
Experiment – Data Set UCI repository (Balanced) Caltech-256 database: An image database of 256 object categories. Each category is processed via a 177-dimensional color correlogram (Balanced) Landmine collection: Collected via remote sensing techniques (Skewed) Nuclear Ban data source: A nuclear explosion detection problem used by ICDM’08 contest (Skewed)
10
Experiment -- Baseline methods Original Features TFC: –enumerates all possible features generated by operators NB,SVM and C45 Operators FCTree:
11
Performance--Blannced Data Best in 23 out of 33 comparisions
12
Performance--Skew Data Best in 25 out of 33 comparisions
13
Scalability Analysis
14
Strength of Weighting Rule
15
Original FCTree 177 dimension color correlogram
16
Conclusion Key points –Divide-conquer to avoid exhaustive enumeration; –Local feature construction subspace evaluation –Weighting rules based search: domain knowledge free and provable performance. Code and data available from the authors
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.