Speedup over Ji et al.'s work

Slides:



Advertisements
Similar presentations
SHREYAS PARNERKAR. Motivation Texture analysis is important in many applications of computer image analysis for classification or segmentation of images.
Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Intermediate GPGPU Programming in CUDA
Multipattern String Matching On A GPU Author: Xinyan Zha, Sartaj Sahni Publisher: 16th IEEE Symposium on Computers and Communications Presenter: Ye-Zhi.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Accelerating SYMV kernel on GPUs Ahmad M Ahmad, AMCS Division, KAUST
Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
QCAdesigner – CUDA HPPS project
Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Exploiting Computing Power of GPU for Data Mining Application Wenjing Ma, Leonid Glimcher, Gagan Agrawal.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Blocked 2D Convolution Ravi Sankar P Nair
Computer Engg, IIT(BHU)
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
Gwangsun Kim, Jiyun Jeong, John Kim
Employing compression solutions under openacc
GPU-based iterative CT reconstruction
Seth Pugsley, Jeffrey Jestes,
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
GPU Memories These notes will introduce:
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Advisor: Dr. Gagan Agrawal
Tools and Techniques for Processing (and Management) of Data
Tools and Techniques for Processing and Management of Data
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
MASS CUDA Performance Analysis and Improvement
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Data-Intensive Computing: From Clouds to GPU Clusters
All-Pairs Shortest Paths
Operating Systems (CS 340 D)
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Fully-dynamic aimGraph
© David Kirk/NVIDIA and Wen-mei W. Hwu,
6- General Purpose GPU Programming
Presentation transcript:

Speedup over Ji et al.'s work Motivations Methods Reduction-based MapReduce Memory Hierarchy MapReduce Programming Model Emerged with the development of Data-Intensive Computing GPUs Cost-effective and Power-efficient Suitable for implementing MapReduce Effective Utilization of the Fast Shared Memory Shared memory size is small MapReduce generates a lot of key-value pairs Reduction-intensive applications: Reduction : associative and commutative Lots of key-value pairs share the same key map(input) { (key, value) = process(input); reductionobject->insert(key, value); } reduce(value1, value2) value1 = operation(value1, value2); Stores reduction objects in both shared memory and device memory Combines reduction results Handles shared memory overflow GPU Reduction Object 0 Reduction Object 1 Reduction Object 0 Reduction Object 1 … … … … … … Block 0’s Shared Memory Block 0’s Shared Memory Device Memory Reduction Object Reduction Object Result Array Device Memory … … CPU MapReduce Host Memory … … KeyIdx[0] ValIdx[0] KeyIdx[1] ValIdx[1] Multi-group Scheme Memory Allocator Divides each thread block into sub-groups Each sub-group owns a shared memory reduction object Trades off memory overhead and locking overhead M M M M M M M M Val Data Key Size Val Size Key Data Key Size Val Size Key Data Val Data K1:v k1:v k2:v K1:v K3:v k4:v K4:v k5:v K4:v K1:v k3:v Group by Key Experimental Results Conclusions K1: v, v, v, v K2:v K3:v, v K4:v, v, v K5:v R R R R R Presents methods of utilizing shared memory for MapReduce on GPUs Uses a reduction-based method to reduce memory overhead of the MapReduce, and performs reductions in shared memory Designs an approach for storing reduction objects on the memory hierarchy of a GPU Balances the memory overhead and locking overhead by developing a multi-group scheme. Configuration Multi-group Scheme Speedup over Sequential GPU: NVIDIA Quadro FX 5800 Device Memory: 4 GB Shared Memory: 16 KB GPUs Processing component: Memory component: Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) (1, 0) (2, 0) (0, 1) (1, 1) (2, 1) Grid 2 Block (1, 1) Thread (3, 1) (4, 1) (0, 2) (1, 2) (2, 2) (3, 2) (4, 2) (3, 0) (4, 0) (Device) Grid Constant Memory Texture Device Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host Speedup over MapCG (Reduction-intensive) Speedup over MapCG (Non-reduction-intensive) Speedup over Ji et al.'s work