Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Slides:

Advertisements

Similar presentations

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Optimization on Kepler Zehuan Wang

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

OpenFOAM on a GPU-based Heterogeneous Cluster

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Panda: MapReduce Framework on GPU’s and CPU’s

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.

Supporting GPU Sharing in Cloud Environments with a Transparent

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

Extracted directly from:

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.

Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Integrating and Optimizing Transactional Memory in a Data Mining Middleware Vignesh Ravi and Gagan Agrawal Department of ComputerScience and Engg. The.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Data-Intensive Computing: From Clouds to GPUs Gagan Agrawal December 3,

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Co-Processor Architectures Fermi vs. Knights Ferry Roger Goff Dell Senior Global CERN/LHC Technologist |

Martin Kruliš by Martin Kruliš (v1.0)1.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Lecture 5: GPU Compute Architecture

Speedup over Ji et al.'s work

Linchuan Chen, Xin Huo and Gagan Agrawal

Year 2 Updates.

Lecture 5: GPU Compute Architecture for the last time

NVIDIA Fermi Architecture

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Data-Intensive Computing: From Clouds to GPU Clusters

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Department of Computer Science University of California, Santa Barbara

6- General Purpose GPU Programming

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

Presentation transcript:

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal

Outline Motivation and Background FERMI series and the TESLA series GPUs Reduction based Data Mining Algorithms Parallelization Methods for GPUs Experimental Evaluation Conclusion

Background GPUs have emerged as a major player in high performance computing recently. – Excellent price to performance ratio provided by GPUs – suitability and popularity of CUDA to program a variety of high performance applications. GPU hardware and software have evolved rapidly. New GPU products and successive versions of CUDA added new functionality and better performance.

The FERMI GPU The Fermi series of cards – include the C2050 and the C2070 cards. – also referred to as the 20-series family of NVIDIA Tesla GPUs. Support for double precision atomic operations. Much larger shared memory/L1 cache which can be configured – 48kB shared memory, 16kB L1 cache or – 16kB shared memory, 48kb L1 cache Presence of an L2 cache.

TESLA vs FERMI

Thesis Objective Optimizing and evaluating the new features of the Fermi series GPUs – Increased Shared memory – Support for atomic operations on floating point data Using three parallelization approaches on reduction based mining algorithms: – Full Replication in Shared memory – Improving locking with inbuilt atomic operations – Creation of several hybrid versions for optimal performance

Generalized Reductions op is a function that is both commutative and associative and Reduc is a data structure referred to as the reduction object Specific elements of the reduction object updated depend on results of previous processing Divide the data instances (or records or transactions) among the processing threads The reduction object updated in iteration i of the loop is determined as a result of previous processing

Parallelizing Generalized Reductions It is not possible to statically partition the reduction object, different processors update different disjoint portions at runtime: – Can lead race conditions – Execution time of the process function can take up a major chunk of the total execution time of an iteration of the loop, so runtime preprocessing and static scheduling techniques cannot be applied. – Sometimes the size of the reduction object may be too large to fit in replicas in memory without significant overheads.

Earlier Parallelization Techniques Attempts to parallelize the Map-Reduce class of applications – lack of support for atomic operations on floating point numbers – large number of threads required for effective parallelization. The larger shared memory allows total replication of the reduction object for some thread configurations – significantly avoids the possibility of race conditions and thread contention.

Full Replication In any shared memory system, the best way to avoid race conditions would be to – Have each thread keep its own copy of the reduction object on the device memory and process each object separately. – Then at the end of each iteration, a global combination could be performed either by a single thread or by using the tree structure. – The final object is copied back to host memory

Full Replication in Shared Memory The factors which affect performance of full replication mode of reduction – size of the reduction object (depends on the number of threads per multiprocessor). – the amount of computation in comparison to the amount of data copied between devices and – whether or not, global data can be copied into shared memory. In Tesla it was not possible to fit in all the copies of the reduction object within 16k of shared memory available – Higher latency device memory had to be used.

Full Replication on Shared memory (continued) The higher amount of available shared memory in Fermi can fit in all copies of the reduction object entirely within the shared memory for smaller configurations: – No race conditions and contention among threads because each thread updates its own copy of the object. – Global memory accesses are now replaced by low latency shared memory accesses.

Locking Scheme The shared memories of different multiprocessors, have no synchronization mechanism, so a separate copy of the reduction object is placed in the shared memory of each multiprocessor. While performing updates on the reduction object, all threads of a thread block use locking to avoid race conditions. Finally a global combination is performed on all the accumulated updates on the different multiprocessors.

Locking : TESLA vs FERMI Fine Grained Locking: TESLA: FERMI:

The Hybrid Scheme Full replication – A private copy of the reduction object is needed for each thread in a block – Larger reduction objects stored in the high latency global device memory. – The cost of combination could be very high. Locking – A single copy of the reduction object is stored in the shared memory – Eliminates the need for global combination. – Contention among threads in a block is very high. Configuring an application with a larger number of threads in a multiprocessor typically leads to better performance. – Latencies can be masked by context switching between warps.

The Hybrid Scheme (continued) While choosing the number of groups, M – M copies of the reduction object should still fit into the shared memory. – If the reduction object is big, the overhead of combination would be higher than the overhead of contention. – When the object is smaller, the contention overhead dominates over the combination overhead. Since it is desirable to keep the contention overhead smaller, a larger number of groups are preferable. Several Hybrid versions were created and evaluated on Fermi – to study the optimal balance between contention and combination overheads

Experimental Evaluation Environment: – TESLA: NVIDIA Tesla C1060 GPU with 240 cores, clock frequency of GHz and 4GB device memory. – FERMI: NVIDIA Tesla C2050 GPU with 448 processor cores, clock frequency of 1.15GHz and a device memory of 3 GB.

Observations For larger reduction objects, the hybrid approach generally outperforms the replication and the locking approaches. – Contention overhead dominates. For smaller reduction objects full replication in shared memory yields the best performance. – Combination overhead dominates. Inbuilt support for atomic floating point operations outperforms the previously used wrapper based implementation.

K-Means Results Wrapper based implementation of atomic floating point operations k=10 Inbuilt support for atomic floating point operations k=10

Kmeans Results Wrapper based implementation of atomic floating point operations k=100 Inbuilt support for atomic floating point operations k=100

K-Means Results Hybrid Versions for k=10 Hybrid Versions for k=100

PCA Results Comparison of Parallelization schemes with wrapper based implementation for 16 columns Comparison of Parallelization schemes with inbuilt atomic floating point for 32 columns

PCA - Results Hybrid versions for 16 columns Hybrid versions for 32 columns

kNN - Results

Conclusions The new features of the Fermi series GPU cards: – support for inbuilt atomic double precision operations – increase in the amount of available shared memory Evaluated against three reduction based data mining algorithms. – Balance between the overheads of thread contention and global combination. – For smaller clusters, contention is a dominant factor. – For larger clusters, combination overhead dominates.

Thank You! Questions?