FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

Slides:

Advertisements

Similar presentations

DECISION TREES. Decision trees  One possible representation for hypotheses.

Advertisements

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Scalable Classification Robert Neugebauer David Woo.

Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.

Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Webpage Understanding: an Integrated Approach

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Data-Intensive Computing: From Multi-Cores and GPGPUs to Cloud Computing and Deep Web Gagan Agrawal u.

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,

CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.

High-level Interfaces and Abstractions for Data-Driven Applications in a Grid Environment Gagan Agrawal Department of Computer Science and Engineering.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.

Computer Science and Engineering FREERIDE-G: A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Gagan Agrawal.

RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

Packet Size optimization for Supporting Coarse-Grained Pipelined Parallelism Wei Du Gagan Agrawal Ohio State University.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Research Overview Gagan Agrawal Associate Professor.

DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.

REU 2007-ParSat: A Parallel SAT Solver Christopher Earl, Mentor: Dr. Hao Zheng Department of Computer Science & Engineering Introduction Results and Conclusions.

1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.

System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University

Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering

Sameh Shohdy, Yu Su, and Gagan Agrawal

Year 2 Updates.

Scalable Data Mining: Algorithms, System Support, and Applications

Communication and Memory Efficient Parallel Decision Tree Construction

Data-Intensive Computing: From Clouds to GPU Clusters

Dept. of Computer Sciences University of Wisconsin-Madison

Fast and Exact K-Means Clustering

Evaluation of Relational Operations: Other Techniques

Decision Trees for Mining Data Streams

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Motivation: Data Mining Problem Datasets available for mining are often large Our understanding of what algorithms and parameters will give desired insights is limited Time required for creating scalable implementations of different algorithms and running them with different parameters on large datasets slows down the data mining process

Project Overview FREERIDE (Framework for Rapid Implementation of datamining engines) as the base system Demonstrated for a variety of standard mining algorithms

FREERIDE offers:  The ability to rapidly prototype a high- performance mining implementation  Distributed memory parallelization  Shared memory parallelization  Ability to process large and disk-resident datasets  Only modest modifications to a sequential implementation for the above three

Key Observation from Mining Algorithms Popular algorithms have a common canonical loop Can be used as the basis for supporting a common middleware While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. }

Shared Memory Parallelization Techniques Full Replication: create a copy of the reduction object for each thread Full Locking: associate a lock with each element Optimized Full Locking: put the element and corresponding lock on the same cache block Fixed Locking: use a fixed number of locks Cache Sensitive Locking: one lock for all elements in a cache block

Memory Layout for Various Locking Schemes Full Locking Fixed Locking Optimized Full LockingCache-Sensitive Locking LockReduction Element

Trade-offs between Techniques Memory requirements: high memory requirements can cause memory thrashing Contention: if the number of reduction elements is small, contention for locks can be a significant factor Coherence cache misses and false sharing: more likely with a small number of reduction elements

Combining Shared Memory and Distributed Memory Parallelization Distributed memory parallelization by replication of reduction object Naturally combines with full replication on shared memory For locking with non-trivial memory layouts, two options Communicate locks also Copy reduction elements to a separate buffer

Apriori Association Mining 500MB dataset, N2000,L20, 4 threads

K-means Shared Memory Parallelization

Performance on Cluster of SMPs Apriori Association Mining

Results from EM Clustering Algorithm EM is a popular data mining algorithm Can we parallelize it using the same support that worked for other clustering algo (k-means) and algo for other mining tasks

Results from FP-tree FPtree: 800 MB dataset 20 frequent itemsets

A Case Study: Decision Tree Construction Question: can we parallelize decision tree construction using the same framework ? Most existing parallel algorithms have a fairly different structure (sorting, writing back …) Being able to support decision tree construction will significantly add to the usefulness of the framework

Approach Implemented RainForest framework (Gehrke) Currently focus on RF-read Overview of the algorithm While the stop condition not satisfied read the data build the AVC-group for nodes choose the splitting attributes to split nodes select a new set of node to process as long as the main memory could hold it

Parallelization Strategies Pure approach: only apply one of full replication, optimized full locking and cache-sensitive locking Vertical approach: use replication at top levels, locking at lower Horizontal: use replication for attributes with a small number of distinct values, locking otherwise Mixed approach: combine the above two

Results Performance of pure versions, 1.3GB dataset with 32 million records in the training set, function 7, the depth of decision tree = 16.

Results Combining full replication and full locking

Results Combining full replication and cache-sensitive locking

Combining Distributed Memory and Shared Memory Parallelization for Decision Tree The key problem: large size of AVC groups means very high communication volume Results in very limited speedups Can we modify the algorithm to reduce communication volume ?

SPIES On (a) FREERIDE Developed a new communication efficient decision tree construction algorithm – Statistical Pruning of Intervals for Enhanced Scalability (SPIES) Combines RainForest with statistical pruning of intervals of numerical attributes to reduce memory requirements and communication volume Does not require sorting of data, or partitioning and writing-back of records Paper in SDM regular program

Applying FREERIDE for Scientific Data Mining Joint work with Machiraju and Parthasarathy Focusing on feature extraction, tracking, and mining approach developed by Machiraju et al. A feature is a region of interest in a dataset A suite of algorithms for extracting and tracking features

Summary Demonstrated a common framework for parallelization of a wide range of mining algos Association mining – apriori and fp-tree Clustering – k-means and EM Decision tree construction Nearest neighbor search Both shared memory and distributed memory parallelism A number of advantages Ease parallelization Support higher-level interfaces