Download presentation
Presentation is loading. Please wait.
1
A Unifying View on Instance Selection
Thomas Reinartz DaimlerChrysler AG, Research and Technology, Germany
2
Outlines Introduction Focusing Tasks
Evaluation Criteria for Instance Selection Unifying Framework for Instance Selection Evaluation Conclusions
3
Introduction CRISP-DM (Cross-Industry Standard Process of Data Mining)
Business Understanding Data Understanding Data Preparation: data selection, cleaning, construction, integration, formatting Modeling Evaluation Deployment Data Selection: data shrink or data reduction is needed for huge data size, Focusing
4
Focusing Tasks (1) Data as a table (A,T) Focusing Specification
A : Attribute – characterized by name, type, domain of values T : Tuple or Instance – a sequence of attribute values Focusing Specification Focusing Input : table, a component of a table Focusing Output : a subset of Focusing Input Simple subset Constrained subset Constructed subset Focusing Criterion : the relation between input and output
5
Focusing Tasks (2) Focusing Context
Data Mining Goal: classification, prediction, description, concept description, summarization, dependency analysis Data Characteristics: simple statistics, information quality Data Mining Algorithm Instance Selection: a particular focusing task where input is a set of cases, output is a subset of input
6
Evaluation Criteria for Instance Selection
Different Evaluation Strategies Filter and Wrapper Evaluation Filter Approach: only considers data reduction w.r.t. the mean, variance, distribution, joint distribution Wrapper Approach: evaluate with specific data mining aspect e.g. execution time, storage requirements, accuracy, complexity Isolated and Comparative Evaluation (for Solutions) Separated and Combined Evaluation (for Criteria)
7
Unifying Framework for Instance Selection
InputSamplingClusteringPrototypingOutput Evaluations Sampling : simple random sampling, systematic sampling, stratified sampling Clustering Prototyping Order is not important
8
Evaluation Generic Sampling (GENSAM) – implement unifying framework by additional preparation steps Sorting: select by ordering the values of cases for the important attribute Stratification: separate cases by the attribute intervals (continuous) or attribute values (discrete) in the order of attribute relevance Intelligent Sampling: random sampling, stratified sampling, systematic sampling, leader sampling, similarity-driven sampling by combining the above methods Experimental Setting Goal: Classification, Algorithm: C4.5, Instance-based learning (NN classifier) Instance selection methods: simple random sampling (S), simple random sampling with stratification (RS), systematic sampling (S), systematic sampling with sorting (SS), leader sampling (L), leader sampling with sorting and stratification (LS) Data: training set (80%), test set (20%)
10
Conclusions Evaluation Criteria More Intelligent Focusing Solutions
Analytical, experimental studies for different instant selection techniques
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.