Multivariate Discretization of Continuous Variables for Set Mining Author:Stephen D. Bay Advisor: Dr. Hsu Graduate: Kuo-wei Chen.

Slides:



Advertisements
Similar presentations
An Interactive-Voting Based Map Matching Algorithm
Advertisements

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Decision Tree Approach in Data Mining
Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
2001/12/181/50 Discovering Robust Knowledge from Databases that Change Author: Chun-Nan Hsu, Craig A. Knoblock Advisor: Dr. Hsu Graduate: Yu-Wei Su.
Graduate : Sheng-Hsuan Wang
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Application of Apriori Algorithm to Derive Association Rules Over Finance Data Set Presented By Kallepalli Vijay Instructor: Dr. Ruppa Thulasiram.
Estimation and the Kalman Filter David Johnson. The Mean of a Discrete Distribution “I have more legs than average”
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Birch: An efficient data clustering method for very large databases
RECURSIVE PATTERNS WRITE A START VALUE… THEN WRITE THE PATTERN USING THE WORDS NOW AND NEXT: NEXT = NOW _________.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
Data Mining: A Closer Look
Business Intelligence, Data Mining and Data Analytics/Predictive Analytics By: Asela Thomason IS 495 Summer 2015.
Data Mining Chun-Hung Chou
Discovering Outlier Filtering Rules from Unlabeled Data Author: Kenji Yamanishi & Jun-ichi Takeuchi Advisor: Dr. Hsu Graduate: Chia- Hsien Wu.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A novel genetic algorithm for automatic clustering Advisor.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Anthony K.H. Tung Hongjun Lu Jiawei Han Ling Feng 國立雲林科技大學 National.
1 ENTROPY-BASED CONCEPT SHIFT DETECTION PETER VORBURGER, ABRAHAM BERNSTEIN IEEE ICDM 2006 Speaker: Li HueiJyun Advisor: Koh JiaLing Date:2007/11/6 1.
Chapter 8 Discriminant Analysis. 8.1 Introduction  Classification is an important issue in multivariate analysis and data mining.  Classification: classifies.
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
An Introduction.  Summarize and describe data.
Data Mining: Software Helping Business Run
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Decision Trees. MS Algorithms Decision Trees The basic idea –creating a series of splits, also called nodes, in the tree. The algorithm adds a node to.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Ming Hsiao Author : Bing Liu Yiyuan Xia Philp S. Yu 國立雲林科技大學 National Yunlin University.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
A Fuzzy k-Modes Algorithm for Clustering Categorical Data
© John M. Abowd 2007, all rights reserved General Methods for Missing Data John M. Abowd March 2007.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Chung-hung.
Mining A Stream of Transactions for Customer Patterns Author: Diane Lambert Advisor: Dr. Hsu Graduate: Yan-cheng Lin.
2001/11/13 the lab of intelligent database system, IDS IBM Intelligent Miner introduction Advisor: Dr. Hsu Graduates: Yan-cheng Lin Yu-Wei Su.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Data Mining Spring 2007 Noisy data Data Discretization using Entropy based and ChiMerge.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
1 Fuzzy Versus Quantitative Association Rules: A Fair Data-Driven Comparison Shih-Ming Bai and Shyi-Ming Chen Department of Computer Science and Information.
Hybrid Intelligent Systems for Network Security Lane Thames Georgia Institute of Technology Savannah, GA
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Data Mining Copyright KEYSOFT Solutions.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Comparing Association Rules and Decision Trees for Disease.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
Efficient Rule-Based Attribute-Oriented Induction for Data Mining Authors: Cheung et al. Graduate: Yu-Wei Su Advisor: Dr. Hsu.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Sanghamitra.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Yu Cheng Chen Author: Lynette.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Mining Top-n Local Outliers in Large Databases Author: Wen Jin, Anthony K. H. Tung, Jiawei Han Advisor: Dr. Hsu Graduate: Chia- Hsien Wu.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Mining Time-Changing Data Streams
Chapter 6 Classification and Prediction
Variables and Data.
Presentation transcript:

Multivariate Discretization of Continuous Variables for Set Mining Author:Stephen D. Bay Advisor: Dr. Hsu Graduate: Kuo-wei Chen

Outline Motivation Objective Introduction (1)~(2) Multivariate Discretization Approach(1)~(5) Experiment (1)~(6) Conclusions Opinion

Motivation Most discretization method are univariate and consider only a single feature at a time.This is a sub-optimal approach for knowledge discovery as univariate discretization can destroy hidden patterns in data.

Objective To describe why univariate is scarcely comparable to multivariate. Present a bottom up merging algorithm that is called “ MVD ” Present an experiment to prove that MVD ’ s execute time is more efficient than other univariate approaches.

Introduction(1) In Knowledge Discovery, to promote predictive accuracy is not the most important thing. The emphasis is previously unknown and insightful patterns. The discretized intervals should not hide patterns. The intervals should be semantically meaningful. Multivariate discretization one considers how all the variables interact before deciding on discretized intervals.

Introduction(2) Example

Multivariate Discretization Approach(1) Past Discretization Approaches Univariate Miss interactions of several variables Executable Time is long: O(n 2 ) Many Rules

Multivariate Discretization Approach(2) STUCCO Find large differences between two probability distributions The mining objectives of STUCCO P(C|G 1 )  p(C|G 2 ) …… (1) |support(C|G 1 )  support(C|G 2 )|   …… (2) Control the merging process.

Multivariate Discretization Approach(3) Algorithm Step 1.Partition all continuous attributes into n basic intervals 2.Merging adjacent intervals X and Y where they have the minmum combined support. 3.If Fx~Fy then merge X and Y. 4.If there are no eligible intervals stop.Otherwise go to 2.

Multivariate Discretization Approach(4) Efficiency STUCCO runs efficientl on many datasets. The problems STUCCO are often easier than that faced by the main mining program. Only to find single difference between the groups Calling STUCCO repeatedly will result in many passes over the database.

Multivariate Discretization Approach(5) Sensitivity to hidden Patterns Parity R+I Eexample

Experiment(1) Sun Ultra-5 with 128MB Parameter settings

Experiment(2) Discretization Time in CPU seconds

Experiment(3) Qualitative Results Discretization Cutpoints for Age on the Adult Census Data

Experiment(4) Qualitative Results Discretization Cutpoints for Capital-Loss on the Adult Census Data

Experiment(5) Qualitative Results Discretization Cutpoints for Parental Income on the UCI Admission Data

Experiment(6) Qualitative Results Discretization Cutpoints for GPA on the UCI Admission Data

Conclusions The MVD algorithm can finely partitions continuous variables and then merges adjacent intervals continuous variables only if their instances have similar multivariate distributions. Experimental results indicate that the MVD algorithm detect high dimensional interactions between feature and discretize the data appropriately. The MVD algorithm run in time comparable to a popular univariate recursive approach.

Opinion If the adjacent intervals don ’ t have similar distributions between them, then MVD algorithm won ’ t be efficient. Generally,this condition is usually occurred.