A many-core GPU architecture.. Price, performance, and evolution.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Intel Redefines GPU: Larrabee Tianhao Tong Liang Wang Runjie Zhang Yuchen Zhou.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
L1 Event Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting Dubna, October 16, 2008.
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
GPU Computing with CUDA as a focus Christie Donovan.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 April 4, 2013 © Barry Wilkinson CUDAIntro.ppt.
Interactive Visualization of Volumetric Data on Consumer PC Hardware: Introduction Daniel Weiskopf Graphics Hardware Trends Faster development than Moore’s.
COMPUTER ARCHITECTURE (for Erasmus students)
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
GPU – Graphic Processing Unit
Event Reconstruction in STS I. Kisel GSI CBM-RF-JINR Meeting Dubna, May 21, 2009.
David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Lecture 2 : Introduction to Multicore Computing
Computer Graphics Graphics Hardware
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009.
NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
GPU Architecture and Programming
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
GPU Programming Shirley Moore CPS 5401 Fall 2013
Sony PlayStation 3 Sony also laid out the technical specs of the device. The PlayStation 3 will feature the much-vaunted Cell processor, which will run.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 3.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
CS4315A. Berrached:CMS:UHD1 Introduction to Operating Systems Chapter 1.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Scientific Computing Goals Past progress Future. Goals Numerical algorithms & computational strategies Solve specific set of problems associated with.
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Graphic Processing Units Presentation by John Manning.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
The Present and Future of Parallelism on GPUs
Computer Graphics Graphics Hardware
CS203 – Advanced Computer Architecture
GPU Architecture and Its Application
CMSC 611: Advanced Computer Architecture
What is GPU? how does it work?
Multi-core processors
Multi-Core Computing Osama Awwad Department of Computer Science
Ray-Cast Rendering in VTK-m
NVIDIA Fermi Architecture
Computer Graphics Graphics Hardware
Hardware Accelerated Video Decoding in
Graphics Processing Unit
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Presentation transcript:

A many-core GPU architecture.

Price, performance, and evolution.

 CPU (Central Processing Unit) – general purpose processor able to execute computer programs.  GPU (Graphics Processing Unit) - dedicated graphics rendering device.

 The nVIDIA GeForce 6800 Ultra is able to reach a performance of 40 Gflops whereas an Intel 3GHz Pentium4 is able to reach only 6. [1]  What is more impressive, current cards such as ATI HD5870, AMD FireStream 9250, NVIDIA GeForce 9800 run between 1 and 3 TFLOPS.  Reasons for this include highly parallel vector processing, fast onboard memory, and pipeline constraints which stream data without stalls.

 GPU performance has approximately doubled every 6 months since the mid-1990s.  CPU performance doubles every 18 months on average (Moore’s law).

How we use GPUs.

 New trends are showing GPU use in scientific computing using data-parallel algorithms. Examples include:

Clustering GPU clustering to simulate the dispersion of airborne contaminants in New York City.

Image Stitching Fast seamless stitching and tone-mapping of gigapixel images. (~1 hour on a notebook PC)

Molecular Dynamics Molecular dynamics to evaluate forces between atoms that do not share bonds.

How it is built.

TYPICAL GPU  Ordered sequence of rendering steps.  Fixed hardware dedicated to each step. LARABEE  Runs most of its pipeline in software running on multiple general purpose x86 cores.  This allows the rendering pipeline to be reconfigured dynamically. Hence, we are able to skip steps or allocate extra resources when required.

 The Larrabee core is “derived” from the Pentium processor.  1 scalar unit for single operations and 1 vector unit for multiple operations.  32KB L1 data and instruction cache.  256 KB L2 cache which share a ring network.

 8KB L1 cache is 4 times larger than original Pentium.  This is due to the fact that each core is able to perform four-way multithreading to reduce thread switching overhead. (Not to be confused with simultaneous multithreading.)  The 256KB L2 cache share a ring network. If a core is unable to find data in its own L2 cache, it places a request on a ring bus/network and will eventually find the data in its L2.  Uses a rendering technique called binning, which divides the screen into regions, and renders polygons accordingly.

Benefits of Larrabee Game physics Real-time ray tracing Image and video processing Physical simulation Extended rendering capabilities

 [1] Zhe Fan, Feng Qiu, Kaufman A., Yoakum- Stover S. GPU Cluster for High Performance Computing ACM / IEEE Supercomputing Conference 2004, November 06-12, Pittsburgh, PA.  [2] L. Seiler et al Larrabee: A Many- Core x86 Architecture for Visual Computing. ACM Transactions on Graphics, vl. 27, n. 3, Article 18, August 2008.