Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

Slides:

Advertisements

Similar presentations

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.

Advertisements

Parallel Programming Motivation and terminology – from ACM/IEEE 2013 curricula.

Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.

Lock vs. Lock-Free memory Fahad Alduraibi, Aws Ahmad, and Eman Elrifaei.

Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.

Computer Science and Engineering A Middleware for Developing and Deploying Scalable Remote Mining Services P. 1DataGrid Lab A Middleware for Developing.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.

Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

1 Parallelizing FPGA Placement with TMSteffan Parallelizing FPGA Placement with Transactional Memory Steven Birk*, Greg Steffan**, and Jason Anderson**

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Data-Intensive Computing: From Multi-Cores and GPGPUs to Cloud Computing and Deep Web Gagan Agrawal u.

A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal June 1,

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.

Parallel Event Processing for Content-Based Publish/Subscribe Systems Amer Farroukh Department of Electrical and Computer Engineering University of Toronto.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.

Software Transactional Memory Should Not Be Obstruction-Free Robert Ennals Presented by Abdulai Sei.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Rapid Tomographic Image Reconstruction via Large-Scale Parallelization Ohio State University Computer Science and Engineering Dep. Gagan Agrawal Argonne.

MATE-CG: A MapReduce-Like Framework for Accelerating Data-Intensive Computations on Heterogeneous Clusters Wei Jiang and Gagan Agrawal.

System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Gauss Students’ Views on Multicore Processors Group members: Yu Yang (presenter), Xiaofang Chen, Subodh Sharma, Sarvani Vakkalanka, Anh Vo, Michael DeLisi,

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Linchuan Chen, Peng Jiang and Gagan Agrawal

Communication and Memory Efficient Parallel Decision Tree Construction

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Wei Jiang Advisor: Dr. Gagan Agrawal

Data-Intensive Computing: From Clouds to GPU Clusters

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Peng Jiang, Linchuan Chen, and Gagan Agrawal

A Map-Reduce System with an Alternate API for Multi-Core Environments

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The Ohio State University Columbus, Ohio

Outline Motivation Software Transactional Memory Shared Memory Parallelization Schemes Transactional Locking II (TL2) Hybrid Replicated STM scheme FREERIDE Processing Structure Experimental Results Conclusions October 24, 20152

Motivation Availability of large data for analysis –On the scale of tera and peta bytes Advent of multi-core, many-core architecture –Intel’s Polaris 80-core chip –Larrabee many-core architecture Programmability challenge –Coarse-grained Performance not sufficient –Fine-grained Better left to experts Need for transparent, scalable shared- memory parallelization technique October 24, 20153

Software Transactional Memory (STM) October 24, Maps concurrent transactions in database to concurrent thread operations Programmer Identify critical sections Tag them as transactions Launch multiple threads Transactions run as atomic and isolated operations Data races handled automatically Guarantees absence of deadlock!

Contributions October 24, 20155

FREERIDE Processing Structure (Framework for Rapid Implementation of Datamining Engines) October 24, {* Outer sequential loop*} While( ) { {*Reduction loop*} Foreach( element e) { (i, val) = compute(e) RObj(i) = Reduc(Robj(i), val) } Map-reduce Two-stage FREERIDE One-stage Intermediate structure exposed Better Performance than mapreduce [Cluster ‘09] Middleware API Process each data instance Reduce the result into Reduction object Local combination from all threads if needed Reduction Object

Shared-Memory Parallelization Techniques Context: FREERIDE ( Framework for Rapid Implementation of Datamining Engines) October 24, Replication-based (Lock-free) Full-replication (f-r) Lock-based Full-locking Cache-sensitive locking (cs-l) Full Locking Cache-Sensitive Locking LockReduction Element

Motivation for STM Integration Potential downside of existing schemes [CCGRID ‘09] Full-replication –Very high memory requirements Cache-sensitive locking –Tuned for specific cache architecture –Risk of introducing bugs, deadlocks with porting Advantages of STM Leverage on large body of STM work –Easier programmability –No deadlocks! Provide transparent integration –Programmer don’t bother about STM details What do we need? Use easy programmability of STM Achieve competitive performance October 24, 20158

Transactional Locking II (TL2) October 24, Word-based, Lock-based algorithm Faster than non-blocking STM techniques API –STMBeginTransaction() –STMWrite() –STMRead() –STMCommit() We used Rochester STM (RSTM-TL2) Downside of STM –Large number of conflicts -> large number of aborts

Optimization – Hybrid Replicated STM (rep-stm) October 24, Best of two worlds –Replication –STM Replicated STM –Group ‘n’ threads by ‘m’ groups –‘m’ copies of Reduction object –Each group of threads has private copy –n/m threads within a group share to use STM Adv. of Replicated STM –Reduce no. of reduction object copies –Reduce merge overhead –Also, reduce conflicts with STM

October 24, Experimental Goals Setup Intel Xeon E5345 processors 2-Quad cores (8 cores), each core 2.33 GHz 6 GB main memory 8 MB L2 cache Goals Compare f-r, cs-l, TL2 and rep-stm for three datamining alogrithms –K-means, Expectation Maximization (E-M) and Principal Component Analysis (PCA) Evaluate different Read-Write mixes Evaluate conflicts and aborts October 24,

Parallel Efficiency of PCA Principal Component Analysis 8.5 GB data Best result rep-stm (6.1x) Observations All techs. are competitive PCA specific Computation is high for finding co-variance matrix Amortizes the revalidation /acquire/release of locks STM overheads, 2.3% October 24,

Parallel Efficiency of EM October 24, Expectation-Maximization (EM) 6.4 GB data Best result cs-l (~ 5x) Observations STM schemes are competitive STM have better scalability Diff. between stm-TL2/rep- stm not observed with 8 cores EM specific Computation between updates is high Again, initial overhead is high

Canonical Loop – Parallel Efficiency for Read-write Mixes Canonical loop Synthetic computation Follows generalized reduction Diff workloads with R/W mix All results from 8-threads Interesting Diff. winner for each workload October 24,

Evaluation of Conflicts and Aborts Same canonical loop Compare rate of aborts for stm-TL2 and rep-stm Demonstrates the adv. of rep-stm over stm-TL2 for large no. of threads All cases, for rep-stm –Rate of growth of aborts is much slower –Reduces aborts by 40-55% October 24,

Conclusions Transparent use of STM schemes Developed Hybrid Replicated-STM to reduce –Memory requirements –Conflicts/aborts TL2 and rep-stm are competitive with highly-tuned locking scheme rep-stm significantly reduces no. of aborts with TL2 October 24,

October 24, Thank You! Questions? Contacts: Vignesh Ravi- Gagan Agrawal-

Parallel Efficiency of K-means Kmeans clustering 6 GB data, k=250 Best result f-r (6.57x) STM overheads 15.3% Revalidate R/W Acquire/Release locks Kmeans specific Computation between updates is quite low October 24,