Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

Slides:



Advertisements
Similar presentations
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Optimization on Kepler Zehuan Wang
Intel Core2 GHz Q6700 L2 Cache 8 Mbytes (4MB per pair) L1 Cache: (128 KB Instruction +128KB Data at the core level???) L3 Cache: None? CPU.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Latency considerations of depth-first GPU ray tracing
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
University of Michigan Electrical Engineering and Computer Science Low-Power Scientific Computing Ganesh Dasika, Ankit Sethia, Trevor Mudge, Scott Mahlke.
Interactive k-D Tree GPU Raytracing Daniel Reiter Horn, Jeremy Sugerman, Mike Houston and Pat Hanrahan.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPGPU platforms GP - General Purpose computation using GPU
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Personal Chris Ward CS147 Fall  Recent offerings from NVIDA show that small companies or even individuals can now afford and own Super Computers.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Timothy G. Rogers Daniel R. Johnson Mike O’Connor Stephen W. Keckler A Variable Warp-Size Architecture.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Advanced Science and Technology Letters Vol.43 (Multimedia 2013), pp Superscalar GP-GPU design of SIMT.
Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
The Present and Future of Parallelism on GPUs
EECE571R -- Harnessing Massively Parallel Processors ece
Kwang-yeob Lee1, Nak-woong Eum2, Jae-chang Kwak1 *
“The Brain”… I will rule the world!
Trends in Multicore Architecture
Mattan Erez The University of Texas at Austin
General Purpose Graphics Processing Units (GPGPUs)
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
Option Pricing Black-Scholes Equation
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji, et al.

Overview  Problem:  Want to use the GPU for things other than graphics, however the costs can be high  Solution:  Improve the CUDA drivers  Results:  As compared to node of a supercomputer, worth it  Conclusion  These improvements make using GPGPU’s more feasible

Problem: Need to computation power  Why GPU’s?  GPU’s are not being fully realized as a resource, often sitting idle when not being used for graphics  Better performance for less power as compared to CPU’s  What’s the issue? Cost.  Efficient scheduling – timing data loads with its uses  Memory management – using the small amount of memory available effectively  Loads and stores – waiting for memory transfers, taking 100’s of cycles

Solutions  Brook+ by AMD, Larrabee by Intel  CUDA by NVIDA  Greatest technological maturity at the time  Paper investigating existing technology and suggested improvements 30 Multi- Processors 8 Streaming Processors 16kb

NVIDA’s Tesla C1060 GPU vs. Hitachi HA8000-tc/RS425 (T2K) Super Computer  T2K – fastest supercomputer in Japan T2KC1060 Cores/MPs1630 Clock frequency2.3 GHz1.3 GHz Single SIMD vector length 432 Single peak294 Gflops933 Gflops Main memory32 GB4 GB Memory single peak Cost~$40,000~$2,500 Power300 W200 W

Issues to Overcome  High SIMD vector length  Small main memory size  High register spill cost  No L2 cache but rather read-only texture caches

Methods to Hide Away Latency

 Computation time between communications > Communication latency  Worth sending the data over to the GPU  Increasing bandwidth and size of messages makes the constant term in overhead latency seem smaller  Efficient use of registers to prevent spills  Deciding what work to do where, GPU vs. CPU, work sharing  Minimizing divergent warps using atomic operations found in CUDA  Divergent warp occur when threads must follow both paths

Results  Variable-sized multi-round data transfer scheduling Number of rounds

Results  Use of atomic instructions in CUDA to minimize latency

Conclusion  CUDA gives programmers the ability to harness the power of the GPU for general uses.  The improvements presented allow this option to be more feasible.  Strategic use of GPGPU’s as a resource will improve speed and efficiency.  However, presented material mainly theoretical, not much strong data to back up  More suggestions than implementations, promoting GPGPU use