Introduction to CUDA Programming

Slides:



Advertisements
Similar presentations
The Fetch – Execute Cycle
Advertisements

Machine cycle.
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Central Processing Unit
Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
Memory Address Decoding
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.
CSCE101 – 4.2, 4.3 October 17, Power Supply Surge Protector –protects from power spikes which ruin hardware. Voltage Regulator – protects from insufficient.
Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
Veynu Narasiman The University of Texas at Austin Michael Shebanow
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
Michael Monroig Michael Fiorelli.  The Processor is also known as the CPU or Central Processing Unit.  Processors carry out the instructions of computer.
COMPUTER ARCHITECTURE
Adam Meyer, Michael Beck, Christopher Koch, and Patrick Gerber.
1 Chapter 04 Authors: John Hennessy & David Patterson.
OCR GCSE Computing © Hodder Education 2013 Slide 1 OCR GCSE Computing Chapter 2: CPU.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
1 Latest Generations of Multi Core Processors
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters
HOW COMPUTERS WORK THE CPU & MEMORY. THE PARTS OF A COMPUTER.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.
Moore’s Law Electronics 19 April Moore’s Original Data Gordon Moore Electronics 19 April 1965.
My Coordinates Office EM G.27 contact time:
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.
The Present and Future of Parallelism on GPUs
Itanium® 2 Processor Architecture
Lynn Choi School of Electrical Engineering
Chapter 10: Computer systems (1)
GPU Computing Jan Just Keijser Nikhef Jamboree, Utrecht
CS 286 Computer Architecture & Organization
Inc. 32 nm fabrication process and Intel SpeedStep.
CS 147 – Parallel Processing
Introduction to Computer Architecture
Simple Illustration of L1 Bandwidth Limitations on Vector Performance
QuickPath interconnect GB/s GB/s total To I/O
Prof. Zhang Gang School of Computer Sci. & Tech.
How does an SIMD computer work?
The fetch-execute cycle
Array Processor.
Mattan Erez The University of Texas at Austin
ECEG-3202 Computer Architecture and Organization
Multicore / Multiprocessor Architectures
Introduction to Computing
Introduction to Computer Architecture
Coe818 Advanced Computer Architecture
Introduction to Multiprocessors
ECEG-3202 Computer Architecture and Organization
The Little Man Computer
Chapter 1 Introduction.
PIPELINING Santosh Lakkaraju CS 147 Dr. Lee.
Introduction to CUDA.
General Purpose Graphics Processing Units (GPGPUs)
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
CSE 502: Computer Architecture
Utsunomiya University
Computer Architecture
Introduction to CUDA Programming
Presentation transcript:

Introduction to CUDA Programming Intel’s Sandy Bridge Architecture Andreas Moshovos Winter 2012 Based on David Kanter’s article: http://realworldtech.com/page.cfm?ArticleID=RWT091810191937 Introduction to CUDA Programming

Chip Architecture 32nm

Overall Core Architecture

Core Architecture – Instruction Fetch

Instruction Decode

Out-of-Order Execution

Execution Units – Scalar, SIMD, FP and AVX (vector)

Memory Access Units L1 load to use latency is 4 cycles Banked to support multiple accesses Poor man’s multiporting? L2 load to use latency is 12 cycles

L3 and ring interconnect CPUs, GPU and system agent Communicate via a ring Each CPU has an 2MB 8-way SA L3 slice Static hash of addresses to slices Latency varies: which core to which slice 26-31 cycles Max bandwidth: 435.2GB/s at 3.4Ghz

Overall Core Architecture