1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover.
Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
CUDA and the Memory Model (Part II). Code executed on GPU.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Extracted directly from:
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Sunpyo Hong, Hyesoon Kim
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CUDA programming Performance considerations (CUDA best practices)
Computer Engg, IIT(BHU)
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
Mattan Erez The University of Texas at Austin
Graphics Processing Unit
Lecture 5: Synchronization and ILP
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

1)Leverage raw computational power of GPU  Magnitude performance gains possible

2)Leverage maturation of GPU HW and SW Dedicated fixed 3D accelerators Programmable gfx pipeline (shaders) General computing (nVidia G80) HW: Assembly code Shader programming languages (Cg/HLSL) General programming languages (CUDA) SW:

 Nanoscale Molecular Dynamics (NAMD) University of Illinois, Urbana-Champaign  tools for simulating and visualizing biomolecular processes  Yield 3.5x – 8x performance gains

 Develop a high performance library of core computational methods using the GPU  Library level  BLAS (Basic Linear Algebra Subprograms)  numerical methods  image processing kernels  Application level  port LONI algorithms

 G80 chipset: nVidia 8800 GTX  680 million transistors  Intel Core 2 (290 million)  128 micro-processors  16 multi-processor 1.3 GHz  8 processors per multi-processor unit  Device memory: 768 MB  High performance parallel architecture  On-chip shared memory (16 KB)  Texture cache (8 KB)  Constant memory (64 KB) and cache (8 KB)

 Compatible with all cards with CUDA driver  Linux / Windows  Mobile (GeForce 8M), desktop (GeForce 8), server (Quadro)  Scalable to multi-GPUs  nVidia SLI  Workstation cluster (nVidia Tesla)  1.5 GB Dedicated Memory  2 or 4 G80 GPUs (256 or 512 micro-processors)  Attractive cost-to-performance ratio  nVidia 8800 GTX: $550  nVidia Tesla: $ $12,000

 nVidia CUDA is 1 st generation  Not all algorithms scale well to GPU  Host memory to device memory bottleneck  Single-precision floating point  Cross-GPU development currently not available

TaskTime a) Identify computational methods to implement b) Evaluate if scalable to GPU weeks Experimentation/Implementation3 - 4 months Develop prototypeFeb 2008

 Basic definitions  BLOCK = conceptual computational node  Max number =  Optimal if # of blocks multiple of # of multiprocessors (16)  Each BLOCK runs a number of threads  Max threads per block = 512  Optimal if # threads multiple of warp size (32)  Pivot-divide for 3D volume data  Matrix pivot-divide applied to each slice independently  Mapped each slice to “block” (NUMBLOCKS = N)  Each thread in block handles one row in slice (NUMTHREADS = N)

 As long as no synchronization among slices, scales well to GPU  Concurrent read of other slices should be possible  Host to Device latency  1GB/s measured (2GB/s reported)  PCIe settings?  Need Investigating:  NUMBLOCKS and multiprocessor count?  Fine-tune number of slices per block?  CUDA scheduler seems to handle it well when NUMBLOCKS = N  Scaling issues  N > NUMTHREADS ?  Will we ever hit BLOCK limit?

 t( total ) = t( mem ) + t( compute )  GPU  t(mem) = host to device transfer  t(compute) = kernel time  CPU  t(mem) = memcpy()  t(compute) = loop time  Parameters  for N=16…256, BLOCKS = 256  for N=272…512, BLOCKS=512

 Host to Device memory bottleneck  Pageable vs Pinned memory allocation  2x faster with pinned

 Single Instruction Multiple Data Model (SIMD)  Less synchronization, higher performance  v1.0 – no sync among blocks  High Arithmetic Intensity  Arithmetic Intensity = Arithmetic OPs/Memory Ops  Computations can overlap with memory operations

 Memory Operations highest latency  Shared memory  Fast as accessing register with no bank conflicts  Limited to 16KB  Texture memory  Cached from device memory  Optimized for 2D spatial locality  Built-in filtering/interpolation methods  Read packed data in one operation (ie: RGBA)  Constant memory  Cached from device memory  Fast as register if all threads read same address  Device memory  Uncached, very slow  Faster if byte aligned and coalesced into single contiguous access

 Arithmetic Operations  4 clock cycles for float (+,*,*+), int (+)  16 clock cycles for 32-bit int mul (4 cycles for 24-bit)  36 clock cycles for float division  (int division and modulo very costly)  v1.0 – only floats (double converted to float)  Atomic operations (v1.1 only)  Provides locking mechanisms

 Minimize host-to-device memory transfers  Minimize device memory access  Optimize with byte alignment, coalescing  Minimize execution divergence  Minimize branching in kernel  Unroll loops  Make high use of shared memory  Must correctly stripe data to avoid bank conflicts  For image processing tasks, texture memory may be more efficient  # threads per block = multiple( 32 )  # blocks = ?