GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

Slides:



Advertisements
Similar presentations
Sven Woop Computer Graphics Lab Saarland University
Advertisements

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
GI 2006, Québec, June 9th 2006 Implementing the Render Cache and the Edge-and-Point Image on Graphics Hardware Edgar Velázquez-Armendáriz Eugene Lee Bruce.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Comparison of Modern CPUs and GPUs And the convergence of both Jonathan Palacios Josh Triska.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
Contemporary Languages in Parallel Computing Raymond Hummel.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
GPU – Graphic Processing Unit
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Computer Graphics Graphics Hardware
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
GPU Architecture and Programming
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu Sankaralingam University of Wisconsin-Madison Adapted from “Toward.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
GPUs – Graphics Processing Units Applications in Graphics Processing and Beyond COSC 3P93 – Parallel ComputingMatt Peskett.
Martin Kruliš by Martin Kruliš (v1.1)1.
Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Computer Graphics Graphics Hardware
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Coe818 Advanced Computer Architecture
Computer Graphics Graphics Hardware
6- General Purpose GPU Programming
Presentation transcript:

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li

Outline Graphic Processing Units o Features o Motivation o Challenges Accelerator o Methodology o Performance Evaluation o Discussion Rigel o Methodology o Performance Evaluation o Discussion Conclusion

Graphics Processing Units (GPU) GPU o Special purpose processors designed to render 3D scenes o In almost every desktop today Features o Highly parallel processors o Better floating point performance than CPUs  ATI Radeon x Gflops Motivation o Use GPUs for general purpose programming Challenges o Difficult for programmer to program o Trade off between programmability and performance GeForce 6600GT (NV43) GPU

Accelerator: Using Data Parallelism to Program GPUs for General Purpose Uses Methodology o Data Parallelism to program GPU (SIMD) o Parallel Array C# Object o No aspects of GPU are exposed to the programmer o Programmer only needs to know how to use the Parallel Array o Accelerator takes care of the conversion to pixel shader code o Parallel programs can be represented as DAGs Simplified block diagram for a GPU Expression DAG with shader breaks marked

Accelerator: Using Data Parallelism to Program GPUs for General Purpose Uses Performance Evaluation Performance of Accelerator versus hand coded pixel shader programs on a GeForce 7800 GTX and an ATI x1800. Performance is shown as speedup relative to the C++ version of programs Speedup of Accelator programs on various GPU compared to C++ programs running on a CPU

Rigel: 1024-core Accelerator Specific Architecture SPMD programming model Global address space RISC instruction set Write-back cache Cores laid out in clusters of 8, each cluster with local cache Custom cores (optimized for space / power) Hierarchical Task Queueing Single queue from programmer's perspective Architecture handles distributing tasks Customizable via API o Task granularity o Static vs. dynamic scheduling

Rigel's Performance Fairly Successful Achieved speedup utilizing all 1024 cores Hierarchical task structure effectively scaled to 1024 Issues Cache coherence! o Memory invalidate broadcasts slow system down  Barrier flags  Task enqueue / dequeue variables o Not done in hardware...  Lazy-evaluation write-through barriers at cluster level

Improving Rigel 1.Will the hierarchical task structure continue to scale? If not, when will the boundary be? (Think multiple cache levels but with processor tasks) 2.How could we implement barriers or queues to avoid contention, but still scale? (Is memory managed cache coherence appropriate?) 3.Is specialized hardware the way to go (clusters of 8 custom cores), or can this be replaced by general purpose cores?

Generic and Custom Accelerators Difficult to make generic enough programming interface between programmer and multi-core system o GPUs are limited by SIMD programming model o Specific hardware platforms still have issues for SPMD Efficiently scaling for more cores is still an issue How do we solve these issues?