Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Slides:



Advertisements
Similar presentations
Paper By - Manish Mehta, Rakesh Agarwal and Jorma Rissanen
Advertisements

“RainForest – A Framework for Fast Decision Tree Construction of Large Datasets” J. Gehrke, R. Ramakrishnan, V. Ganti. ECE 594N – Data Mining Spring 2003.
Scalable Classification Robert Neugebauer David Woo.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.
Department of Computer Science, University of Waikato, New Zealand Geoffrey Holmes, Bernhard Pfahringer and Richard Kirkby Traditional machine learning.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.
A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.
High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
Computer Science and Engineering FREERIDE-G: A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Gagan Agrawal.
RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Research Overview Gagan Agrawal Associate Professor.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Sameh Shohdy, Yu Su, and Gagan Agrawal
Linchuan Chen, Xin Huo and Gagan Agrawal
Year 2 Updates.
Scalable Data Mining: Algorithms, System Support, and Applications
Department of Computer Science University of California, Santa Barbara
Communication and Memory Efficient Parallel Decision Tree Construction
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Dept. of Computer Sciences University of Wisconsin-Madison
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Fast and Exact K-Means Clustering
Decision Trees for Mining Data Streams
Department of Computer Science University of California, Santa Barbara
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Motivation Can the algorithms for a variety of mining tasks be parallelized using a common parallelization structure ? If so, we can Have a common set of parallelization techniques Develop general runtime / middleware support Parallelize starting from a high-level interface

Context Part of the FREERIDE (Framework for Rapid Implementation of Datamining Engines) system Support parallelization on shared-nothing configurations Support parallelization on shared memory configurations Support processing of large datasets Previously reported our work for Distributed memory parallelization and processing of disk- resident datasets (SDM 01, IPDPS 01 workshop) Shared memory parallelization applied to association mining and clustering (SDM 02, IPDPS 02 workshop, SIGMETRICS 02)

Decision Tree Construction One of the key mining problems Previous parallel algorithms have had a significantly different structure than parallel algorithms for association mining, clustering, etc. Frequently require sorting of data (SPRINT, SLIQ etc.) Require attributes to be written back Difficult to obtain very high speedups Can we perform shared memory parallelization of decision tree construction using the same techniques that were used for association mining and clustering ?

Outline Previous work on shared memory parallelization Observation from major mining algorithms Parallelization Techniques Decision tree construction problem and algorithms RainForest Based Approach Parallelization methods Experimental Results

Common Processing Structure Structure of Common Data Mining Algorithms {* Outer Sequential Loop *} While () { { * Reduction Loop* } Foreach (element e) { (i,val) = process(e); Reduc(i) = Reduc(i) op val; } Applies to major association mining, clustering and decision tree construction algorithms How to parallelize it on a shared memory machine?

Challenges in Parallelization Statically partitioning the reduction object to avoid race conditions is generally impossible. Runtime preprocessing or scheduling also cannot be applied Can’t tell what you need to update w/o processing the element The size of reduction object means significant memory overheads for replication Locking and synchronization costs could be significant because of the fine-grained updates to the reduction object.

Parallelization Techniques Full Replication: create a copy of the reduction object for each thread Full Locking: associate a lock with each element Optimized Full Locking: put the element and corresponding lock on the same cache block Fixed Locking: use a fixed number of locks Cache Sensitive Locking: one lock for all elements in a cache block

Memory Layout for Various Locking Schemes Full Locking Fixed Locking Optimized Full LockingCache-Sensitive Locking LockReduction Element

Trade-offs between Techniques Memory requirements: high memory requirements can cause memory thrashing Contention: if the number of reduction elements is small, contention for locks can be a significant factor Coherence cache misses and false sharing: more likely with a small number of reduction elements

Summary of Results on Association Mining and Clustering  Applied techniques on apriori association mining and k-means clustering  Each of full replication, optimized full locking, and cache-sensitive locking can outperform each other, depending upon the size of the reduction object  Near-linear speedups obtained in all our experiments

Decision Tree Construction Problem Input: a set of training records, each with several attributes One attribute is special, called the class label, others are predictor attributes The goal is to construct a prediction model, which predicts the class model for a new record, using values of its predictor attributes A tree is constructed in a top-down fashion

Existing Algorithms Initial algorithms: required training records to fit in memory SLIQ: scalable, but Requires sorting of numerical attributes Separation of attribute lists A data-structure called class list to be maintained in main memory (size proportional to the number of training records) SPRINT: scalable and parallelizable, but Requires sorting of numerical attributes Separation of attribute lists Partitioning of attribute lists while splitting a node

RainForest Based Decision Tree Construction A general approach to scaling decision tree construction Key idea: AVC-set (Attribute Value, Classlabel) Sufficient information for deciding on split condition For an attribute and node, size is proportional to the number of distinct values and class labels Easily constructed by taking one pass on data A number of different algorithms: RF-read, RF-Write, RF-hybrid RF-read has a structure that fits in very well with canonical loop presented earlier

RF-read Algorithm High-level structure of the algorithm While the stop condition is not satisfied read the data build the AVC-group for nodes choose the splitting attributes to split nodes select a new set of node to process as long as the main memory could hold it Never need to write-back any data to disks May require multiple passes to process one levels of the tree

Overall Parallelization Approach Training records can be processed independently by processors AVC-sets of nodes are reduction objects : race conditions can arise in updating the values Use the different parallelization techniques we have for avoiding race conditions Higher memory requirements can mean more passes to process one level of the tree

Parallelization Strategies Pure approach: only apply one of full replication, optimized full locking and cache-sensitive locking Vertical approach: use replication at top levels, locking at lower Horizontal: use replication for attributes with a small number of distinct values, locking otherwise Mixed approach: combine the above two

Experimental Setup SMP Machine SunFire MHz processors (only up to 8 for our experiments) 64 KB L1 cache, 8 MG L2 cache, 24 GB memory Dataset 1.3 GB dataset with 32 million records Synthetic data, using a tool available from IBM almaden 9 attributes, 3 categorical, 6 numerical Used functions 1 and 7 (1 in paper)

Results Performance of pure versions, 1.3GB dataset with 32 million records in the training set, function 7, the depth of decision tree = 16.

Results Combining full replication and optimized full locking

Results Combining full replication and cache-sensitive locking

Summary A set of common techniques can be used for shared memory parallelization of different mining algorithms Combination of parallelization techniques gives the best performance for decision tree construction Best speedup of 5.9 on 8 processors – comparable with other results on shared memory and distributed memory parallelization