Multivariate Discretization of Continuous Variables for Set Mining Author:Stephen D. Bay Advisor: Dr. Hsu Graduate: Kuo-wei Chen
Outline Motivation Objective Introduction (1)~(2) Multivariate Discretization Approach(1)~(5) Experiment (1)~(6) Conclusions Opinion
Motivation Most discretization method are univariate and consider only a single feature at a time.This is a sub-optimal approach for knowledge discovery as univariate discretization can destroy hidden patterns in data.
Objective To describe why univariate is scarcely comparable to multivariate. Present a bottom up merging algorithm that is called “ MVD ” Present an experiment to prove that MVD ’ s execute time is more efficient than other univariate approaches.
Introduction(1) In Knowledge Discovery, to promote predictive accuracy is not the most important thing. The emphasis is previously unknown and insightful patterns. The discretized intervals should not hide patterns. The intervals should be semantically meaningful. Multivariate discretization one considers how all the variables interact before deciding on discretized intervals.
Introduction(2) Example
Multivariate Discretization Approach(1) Past Discretization Approaches Univariate Miss interactions of several variables Executable Time is long: O(n 2 ) Many Rules
Multivariate Discretization Approach(2) STUCCO Find large differences between two probability distributions The mining objectives of STUCCO P(C|G 1 ) p(C|G 2 ) …… (1) |support(C|G 1 ) support(C|G 2 )| …… (2) Control the merging process.
Multivariate Discretization Approach(3) Algorithm Step 1.Partition all continuous attributes into n basic intervals 2.Merging adjacent intervals X and Y where they have the minmum combined support. 3.If Fx~Fy then merge X and Y. 4.If there are no eligible intervals stop.Otherwise go to 2.
Multivariate Discretization Approach(4) Efficiency STUCCO runs efficientl on many datasets. The problems STUCCO are often easier than that faced by the main mining program. Only to find single difference between the groups Calling STUCCO repeatedly will result in many passes over the database.
Multivariate Discretization Approach(5) Sensitivity to hidden Patterns Parity R+I Eexample
Experiment(1) Sun Ultra-5 with 128MB Parameter settings
Experiment(2) Discretization Time in CPU seconds
Experiment(3) Qualitative Results Discretization Cutpoints for Age on the Adult Census Data
Experiment(4) Qualitative Results Discretization Cutpoints for Capital-Loss on the Adult Census Data
Experiment(5) Qualitative Results Discretization Cutpoints for Parental Income on the UCI Admission Data
Experiment(6) Qualitative Results Discretization Cutpoints for GPA on the UCI Admission Data
Conclusions The MVD algorithm can finely partitions continuous variables and then merges adjacent intervals continuous variables only if their instances have similar multivariate distributions. Experimental results indicate that the MVD algorithm detect high dimensional interactions between feature and discretize the data appropriately. The MVD algorithm run in time comparable to a popular univariate recursive approach.
Opinion If the adjacent intervals don ’ t have similar distributions between them, then MVD algorithm won ’ t be efficient. Generally,this condition is usually occurred.