CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Hunt’s Algorithm CIT365: Data Mining & Data Warehousing Bajuna Salehe
Decision Tree Approach in Data Mining
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Bab /44 Bab 4 Classification: Basic Concepts, Decision Trees & Model Evaluation Part 1 Classification With Decision tree.
1 Data Mining Classification Techniques: Decision Trees (BUSINESS INTELLIGENCE) Slides prepared by Elizabeth Anglo, DISCS ADMU.
Decision Tree.
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Scalable Classification Robert Neugebauer David Woo.
Chapter 7 – Classification and Regression Trees
Chapter 7 – Classification and Regression Trees
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,
1 Decision Tree Classification Tomi Yiu CS 632 — Advanced Database Systems April 5, 2001.
Classification and Prediction
CSci 8980: Data Mining (Fall 2002)
Decision Tree Algorithm
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Classification Continued
Lecture 5 (Classification with Decision Trees)
Classification II.
Classification.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Privacy-Preserving Data Mining Rakesh Agrawal Ramakrishnan Srikant IBM Almaden Research Center 650 Harry Road, San Jose, CA Published in: ACM SIGMOD.
Chapter 7 Decision Tree.
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Decision Trees.
1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,
Classification supplemental. Scalable Decision Tree Induction Methods in Data Mining Studies SLIQ (EDBT’96 — Mehta et al.) – builds an index for each.
Chapter 9 – Classification and Regression Trees
Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.
Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
SPRINT : A Scalable Parallel Classifier for Data Mining John Shafer, Rakesh Agrawal, Manish Mehta.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Distributed Database. Introduction A major motivation behind the development of database systems is the desire to integrate the operational data of an.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob CURE: Efficient Clustering Algorithm for Large Databases for Large Databases.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
CIS671-Knowledge Discovery and Data Mining Vasileios Megalooikonomou Dept. of Computer and Information Sciences Temple University AI reminders (based on.
Lecture Notes for Chapter 4 Introduction to Data Mining
1 Classification: predicts categorical class labels (discrete or nominal) classifies data (constructs a model) based on the training set and the values.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
An Interval Classifier for Database Mining Applications Rakes Agrawal, Sakti Ghosh, Tomasz Imielinski, Bala Iyer, Arun Swami Proceedings of the 18 th VLDB.
1 Decision Trees. 2 OutlookTemp (  F) Humidity (%) Windy?Class sunny7570true play sunny8090true don’t play sunny85 false don’t play sunny7295false don’t.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob Fast Algorithms for Projected Clustering.
SLIQ (SUPERVISED LEARNING IN QUEST) STUDENT: NIKOLA TERZIĆ PROFESOR: VELJKO MILUTINOVIĆ.
Basic Data Mining Techniques Chapter 3-A. 3.1 Decision Trees.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
SLIQ and SPRINT for disk resident data. Shortcommings of ID3 Scalability ? requires lot of computation at every stage of construction of decision tree.
Mining High-Speed Data Streams Presented by: William Kniffin Pedro Domingos Geoff Hulten Sixth ACM SIGKDD International Conference
By N.Gopinath AP/CSE.  A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a test on an attribute, each.
DECISION TREE INDUCTION CLASSIFICATION AND PREDICTION What is classification? what is prediction? Issues for classification and prediction. What is decision.
Chapter 6 Decision Tree.
DECISION TREES An internal node represents a test on an attribute.
Parallel Databases.
Database Management Systems (CS 564)
Data Mining Classification: Basic Concepts and Techniques
Classification and Prediction
Communication and Memory Efficient Parallel Decision Tree Construction
Classification by Decision Tree Induction
Data Mining – Chapter 3 Classification
Dept. of Computer Sciences University of Wisconsin-Madison
Decision Trees for Mining Data Streams
Presentation transcript:

CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining

Overview Classification Decision Tree SPRINT Serial Algorithm Parallelizing Classification Performance Evaluation Conclusion Reference

Classification Objective –To build a model of the classifying attribute based upon the other attributes Classification given –A set of example record, called a training set that is a set of records consists of several attributes Attributes are either 1.Continuous (i.e. Age: 23,24,25) 2.Categorical (i.e. Gender: Female, Male) One of the attributes called the classifying attribute

Decision Trees Advantages: –Good for data mining –Can be constructed fast –Models are simple –Easy to understand –Can be easily converted into SQL statements –Classifiers obtain better accuracy

Example – Car Insurance Continuous attribute Categorical attribute

Classification algorithm Two classification algorithms are presented in the paper –SLIQ –SPRINT SLIQ –Handles large disk-resident data –Requires some data per record that stay in memory-resident all the time SPRINT –Removes all of the memory restrictions –Fast and scalable and it can be easily parallelized How good it is? –Excellent scaleup, speedup and sizeup properties

SPRINT Serial Algorithm A decision tree classifier is built in two phases –Growth phase –Prune phase Growth phase is computationally much more expensive than pruning

Serial Algorithm Growth Phase Key Issue –To find split points that define node tests. –Having chosen a split point, how to partition the data Data Structures –Attribute lists –Histograms

Serial Algorithm Data Structure (Attribute lists) SPRINT creates an attribute list for each attribute Entries are called attribute records which contains –Attribute value –Class label –Index of the record

Serial Algorithm Data Structure (Attribute lists) The initial lists are associated with the root As the tree is grown, the attribute lists belonging to each node are partitioned and associated with the children

Serial Algorithm Data Structure (Histograms) For continuous attributes, two histograms are associated with each decision-tree node. These histograms, denoted as C above and C below –C below: : maintains this distribution for attribute records that already been processed –C above : maintains this distribution for attribute records that have not been processed

Serial Algorithm Data Structure (Histograms) For categorical attributes, one histogram associated with a node. However, only one histogram is needed and called count matrix

Serial Algorithm Finding split points While growing the tree, the goal at each node is to determine the split point that “best” divides the training records belonging to the leaf In the paper, they use gini index Gini(S)=1-  p j 2 –Where p j is the relative of class j in S Gini split (S) = n 1 /n(S 1 )+n 2 /n(S 2 )

Serial Algorithm Example – Split point for the continuous attributes AgeClassTid 17High1 20High5 23High0 32Low4 43High2 68Low3 HL C above 30 C below 12 Cursor Position 3:

Serial Algorithm Example – Split point for the continuous attributes After finding all the gini indexes we choose the lowest as the split point Therefore, we split at position 3 where the candidate split point is the mid-point between age 23 and 32 (i.e. Age < 27.5)

Serial Algorithm Example – Split point for the categorical attributes HL Family21 Sports20 Truck01

Serial Algorithm Performing the split Once the best split point has been found for a node, we then execute the split by creating child nodes and dividing the attribute records between them For the rest of the attribute lists (i.e. CarType) we need to retrieve the information by using rids

Serial Algorithm Comparison with SLIQ SLIQ does not have separate sets of attribute lists for each node Advantage –Do not have to rewrite these lists during a split –Reassignment of records is easy Disadvantage –It must stay in memory all the time which limits the amount of data that can be classified by SLIQ SPRINT was not to outperform SLIQ. Rather, the purpose of the algorithm is to develop an accurate classifier for datasets, and to develop a classifier efficiently. Furthermore, SPRINT is designed to be easily parallelizable, thus the workload can be shared among N processor

SPRINT Parallelizing Classification SPRINT was specifically designed to remove any dependence on data structures that are either centralized or memory-resident These algorithms all based on a shared-nothing parallel environment where each of N processor has private memory and disks. The processor are connected by a communication network and can communicate only by passing message

Parallelizing Classification Data placement and Workload Balancing The main data structures are –attribute lists –class histograms SPRINT achieves uniform data placement and workload balancing by distributing the attribute lists evenly over N processor Each processor to work on only 1/N of the total data

Parallelizing Classification Finding split points (Continuous attributes) In a parallel environment, each processor has a separate contiguous section of a “global” attribute list –C below: : must initially reflect the class distribution of all sections of an attribute-list assigned to processors of lower rank –C above : must initially reflect the class distribution of the local section as well as all sections assigned to processor of higher rank

Parallelizing Classification Example: Split point for the continuous attributes AgeClassrid 17High1 20High5 23High0 HL C below 00 C above 42 AgeClassrid 32Low4 43High2 68Low3 HL C below 30 C above 12 Processor 1 Processor 0

Parallelizing Classification Finding split points (Categorical attributes) Each processor built a counter matrix for a leaf Exchange these matrices to get the “global” counts Sums the local matrices to get the global count matrices

Parallelizing Classification Performing the Splits Splitting the attribute lists for each leaf is identical to the serial algorithm The only difference is that before building the probe structure, we will need to collect rids from all the processors. Exchange the rids Each processor constructs a probe-structure with all the rids and using it to split the leaf’s remaining attribute lists

Parallelizing Classification Parallelizing SLIQ Two approaches for parallelizing SLIQ –The class list is replicated in memory of every processor (SLIQ/R) –The class list is distributed such that each processor’s memory holds only a portion of the entire list (SLIQ/D) Disadvantage –SLIQ/R – the size of the training set is limited by the memory size of a single processor –SLIQ/D – High communication cost

Performance Evaluation The primary metric for evaluating classifier performance –Classification accuracy –Classification time –Size of the decision tree The accuracy and tree size characteristics of SPRINT are identical to SLIQ Focus only on the classification time

Performance Evaluation Datasets Each record in synthetic database consists of nine attributes. In the paper, they present results for two of these function Function 2 results in fairly small decision tree Function 7 produces very large tree Both these functions divide the database into two classes

Performance Evaluation Serial Performance We used training sets ranging in size from 10,000 records to 2.5 million records To examine how well SPRINT performs in operating regions where SLIQ can and cannot run

Performance Evaluation Comparison of Parallel Algorithms In this experiments, each processor contained 50,000 training examples and the number of processors varied from 2 to 16

Performance Evaluation Scaleup

Performance Evaluation Speedup

Performance Evaluation Sizeup

Performance Evaluation Conclusion Classification can handle very large database by building classifiers SPRINT efficiently allows classification of virtually any sized dataset and parallelizable. SPRINT scales nicely with the size of the dataset SPRINT parallelize efficiently in a shared- nothing environment

References John Shafer,Rakeeh Agrawal and Manish Mehta. SPRINT: A Scalable Parallel Classifier for Data Mining