Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8.

Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8 of Instace selection and Costruction for Data Mining (2001) By Carlos By Carlos Domingo et.al., Kruwer Academic Publishers ( Summarized by Jinsan Yang, SNU Biointelligence Lab)

 Abstract Methods for large amounts of data Adaptive sampling method instead of random sampling  Keywords Data Mining, Knowledge Discovery, Scalibility, Adaptive sampling, Concentration Bounds

Outline  Introduction  General Rule Selection Problem  Adaptive Sampling Algorithm  An Application of Adaselect Problem and Algorithm Experiments  Concluding Remarks

Introduction (1)  Analysis of Large data Redesign a known algorithm Reduce the data size  A typical task in data mining Finding or selecting some rules or laws (General Rule Selection) General Rule Selection: by random sampling (Batch Sampling) Proper sample size: by Concentration Bounds or Deviation bounds (Chernoff, Hoeffding bounds) Problems Immense sample size is needed for good accuracy and confidence For the batch sampling, the sample size should be determined a priori as the worst size and it is overestimated for most of the situations

Introduction (2)  Overcoming Sampling in online sequential fashion (one by one or block by block) Adaptive sample sizes (adaptive sampling)

General Rule Selection Problem  Given Date D (discrete, categorical ?) and model set H, Select a model h with maximum value of Utility U(h) (supervised learning)

Adaptive Sampling Algorithm (1)  Extension of Hoeffding bound  Reliability of Algorithm

Adaptive Sampling Algorithm (2)

An Application of Adaselect (1)  Can apply as a tool for the General rule selection problem  Example chosen: A boosting based classification algorithm that uses a simple decision stump learner as a base learner. Decision stump: a single-split decision tree. AdaBoost for boosting by sub-sampling or re-weighting. Apply adaptive sampling to base learner (boosting by filtering). Use MadaBoost by controlling the initial weight as bounded.

An Application of Adaselect (2)  Algorithm Data: discrete instance vector with labels Classification rule: decision stump 0-1 error measure, U: Utility Function

An Application of Adaselect (3)  Experiments Discretize by 5 intervals and treat missing value as another value. Artificial inflation (100 copies) of original UCI data Only for 2 classes 10 fold cross validation and the results are averaged over 10 runs Computer: cpu alpha 600MHz, 250Mb memory, 4.3 Gb Hard under Linux C4.5 and Naïve Bayes classifier for comparison Boosting round: 10 Number of all possible decision stumps: (set of weighted majority of ten depth-1 decision tree)

An Application of Adaselect (4)

An Application of Adaselect (5) AdaSel is faster than C4.5 faster in large sample size.

Concluding Remarks  Justification and efficiency analysis  Applied in the design of a base learner for a boosting algorithm

Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8.

Similar presentations

Presentation on theme: "Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8.

Similar presentations

Presentation on theme: "Adaptive Sampling Methods for Scaling up Knowledge Discovery Algorithms From Ch 8 of Instace selection and Costruction for Data Mining (2001) From Ch 8."— Presentation transcript:

Similar presentations

About project

Feedback