Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Slides:



Advertisements
Similar presentations
OpenFOAM on a GPU-based Heterogeneous Cluster
Advertisements

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.
Parallel Data Analysis from Multicore to Cloudy Grids Indiana University Geoffrey Fox, Xiaohong Qiu, Scott Beason, Seung-Hee.
Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
1 A Framework for Data-Intensive Computing with Cloud Bursting Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The Ohio.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Data-Intensive Computing: From Multi-Cores and GPGPUs to Cloud Computing and Deep Web Gagan Agrawal u.
Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.
A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.
Computer Science and Engineering FREERIDE-G: A Grid-Based Middleware for Scalable Processing of Remote Data Leonid Glimcher Gagan Agrawal.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.
System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Research Overview Gagan Agrawal Associate Professor.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Distributed Network Traffic Feature Extraction for a Real-time IDS
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Year 2 Updates.
Department of Computer Science University of California, Santa Barbara
Communication and Memory Efficient Parallel Decision Tree Construction
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Scalable Parallel Interoperable Data Analytics Library
Wei Jiang Advisor: Dr. Gagan Agrawal
Data-Intensive Computing: From Clouds to GPU Clusters
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Hybrid Programming with OpenMP and MPI
A Map-Reduce System with an Alternate API for Multi-Core Environments
Department of Computer Science University of California, Santa Barbara
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

OUTLINE Motivation FREERIDE Middleware Generalized Reduction structure Shared Memory Parallelization techniques Scalability results - Kmeans, Apriori & EM Performance Analysis results Related work & Conclusion

Motivation Availability of huge amount of data – Data-intensive applications Advent of multi-core Need for abstractions and parallel programming systems Best Shared Memory Parallelization (SMP) technique is still not clear.

Goals Scalability of data-intensive applications on multi-core Comparison of different shared memory parallelization (SMP) techniques Performance analysis of SMP techniques

Context: FREERIDE A middle-ware for parallelizing Data-intensive applications Motivated by difficulties in implementing parallel datamining applications Provides high-level APIs for easier parallel programming Based on an observation of similar generalized reduction among many datamining and scientific applications

FREERIDE – Core Reduction Object – A shared data structure where results from processed data instances are stored Types of Reduction Local Reduction – Reduction within a single node Global Reduction – Reduction among a cluster of nodes

Generalized Reduction structure

Parallelization Challenges Reduction object cannot be statically partitioned between threads/nodes – Data races should be handled at runtime Size of reduction object could be large – Replication can cause memory overhead Updates to reduction object is fine-grained – Locking schemes can cause significant overhead

Techniques in FREERIDE Full-replication(f-r) Locking based techniques – Full-locking (f-l) – Optimized Full-locking(o-f-l) – Cache-sensitive locking( cs-l)

Memory Layout of locking schemes

Applications Implemented on FREERIDE Apriori (Association mining) Kmeans (Clustering based) Expectation Maximization (E-M) (clustering based)

Goals in Experimental Study Scalability of data-intensive applications on multi-core Comparison of different shared memory parallelization (SMP) techniques and mpi Performance analysis of SMP techniques

Experimental setup  Each node in the cluster has: Intel Xeon E5345 CPU 2 Quad-core machine Each core 2.33GHz 6GB Main memory  Nodes in cluster are connected by Infiniband

Experiments Two sets of experiments: Comparison of scalability results for f-r, cs-l, o-f-l and mpi with k-means, Apriori and E-M – Single node – Cluster of nodes Performance analysis results with k-means, Apriori and E-M

Applications data setup Apriori – Dataset size 900MB – Support = 3%, Confidence = 9% K-means – Dataset size 6.4 GB – 3-Dimensional points – No. of clusters, 250 E-M – Dataset size 6.4 GB – 3-Dimensional points – No. of clusters, 60

Apriori (Single node)

Apriori (cluster)

k-means (single node)

K-means (cluster)

E-M (Single node)

E-M (cluster)

Performance Analysis of SMP techniques Given an application can we predict the factors that determines the best SMP technique? Why locking techniques suffer with Apriori, but competes well with other applications? What factors limit the overall scalability of data-intensive applications?

Performance Analysis setup Valgrind used for the Dynamic Binary Analysis Cachegrind used for the analysis of cache utilization

Performance Analysis Locking vs Merge Overhead

Performance Analysis (contd…) Relative L2 misses for reduction object

Performance Analysis (contd …) Total program read/write misses

Analysis Important Trade-off – Memory needs of application – Frequency of updating reduction object E-M is compute and memory intensive – Locking overhead is very low – Replication overhead is high Apriori has high update fraction and very less computation – Locking overhead is extremely high – Replication performs the best

Related Work Google Mapreduce Yahoo Hadoop Phoenix – Stanford university SALSA – Indiana university

Conclusion Replication and locking schemes can outperform each other Locking schemes have huge overhead when there is little computation between updates in ReductionObject MPI processes competes well upto 4 threads, but experiences communication overheads with 8 threads Performance analysis shows memory needs of an application and update fraction are significant factors for scalability

Thank you!!!! Questions???