Introduction to CUDA Programming

Slides:

Advertisements

Similar presentations

The Fetch – Execute Cycle

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Central Processing Unit

Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

The University of Adelaide, School of Computer Science

Memory Address Decoding

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.

SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.

CSCE101 – 4.2, 4.3 October 17, Power Supply Surge Protector –protects from power spikes which ruin hardware. Voltage Regulator – protects from insufficient.

Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.

CPE 731 Advanced Computer Architecture Multiprocessor Introduction

Veynu Narasiman The University of Texas at Austin Michael Shebanow

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

Michael Monroig Michael Fiorelli.  The Processor is also known as the CPU or Central Processing Unit.  Processors carry out the instructions of computer.

COMPUTER ARCHITECTURE

Adam Meyer, Michael Beck, Christopher Koch, and Patrick Gerber.

1 Chapter 04 Authors: John Hennessy & David Patterson.

OCR GCSE Computing © Hodder Education 2013 Slide 1 OCR GCSE Computing Chapter 2: CPU.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

1 Latest Generations of Multi Core Processors

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

Morgan Kaufmann Publishers Multicores, Multiprocessors, and Clusters

HOW COMPUTERS WORK THE CPU & MEMORY. THE PARTS OF A COMPUTER.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.

Moore’s Law Electronics 19 April Moore’s Original Data Gordon Moore Electronics 19 April 1965.

My Coordinates Office EM G.27 contact time:

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.

The Present and Future of Parallelism on GPUs

Itanium® 2 Processor Architecture

Lynn Choi School of Electrical Engineering

Chapter 10: Computer systems (1)

GPU Computing Jan Just Keijser Nikhef Jamboree, Utrecht

CS 286 Computer Architecture & Organization

Inc. 32 nm fabrication process and Intel SpeedStep.

CS 147 – Parallel Processing

Introduction to Computer Architecture

Simple Illustration of L1 Bandwidth Limitations on Vector Performance

QuickPath interconnect GB/s GB/s total To I/O

Prof. Zhang Gang School of Computer Sci. & Tech.

How does an SIMD computer work?

The fetch-execute cycle

Array Processor.

Mattan Erez The University of Texas at Austin

ECEG-3202 Computer Architecture and Organization

Multicore / Multiprocessor Architectures

Introduction to Computing

Introduction to Computer Architecture

Coe818 Advanced Computer Architecture

Introduction to Multiprocessors

ECEG-3202 Computer Architecture and Organization

The Little Man Computer

Chapter 1 Introduction.

PIPELINING Santosh Lakkaraju CS 147 Dr. Lee.

Introduction to CUDA.

General Purpose Graphics Processing Units (GPGPUs)

Chapter 4 Multiprocessors

The University of Adelaide, School of Computer Science

CSE 502: Computer Architecture

Utsunomiya University

Computer Architecture

Introduction to CUDA Programming

Presentation transcript:

Introduction to CUDA Programming Intel’s Sandy Bridge Architecture Andreas Moshovos Winter 2012 Based on David Kanter’s article: http://realworldtech.com/page.cfm?ArticleID=RWT091810191937 Introduction to CUDA Programming

Chip Architecture 32nm

Overall Core Architecture

Core Architecture – Instruction Fetch

Instruction Decode

Out-of-Order Execution

Execution Units – Scalar, SIMD, FP and AVX (vector)

Memory Access Units L1 load to use latency is 4 cycles Banked to support multiple accesses Poor man’s multiporting? L2 load to use latency is 12 cycles

L3 and ring interconnect CPUs, GPU and system agent Communicate via a ring Each CPU has an 2MB 8-way SA L3 slice Static hash of addresses to slices Latency varies: which core to which slice 26-31 cycles Max bandwidth: 435.2GB/s at 3.4Ghz

Overall Core Architecture