CSE 502: Computer Architecture

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Instruction Level Parallelism (ILP) Colin Stevens.
ENEE350 Spring07 1 Ankur Srivastava University of Maryland, College Park Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005.”
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
SUPERSCALAR EXECUTION. two-way superscalar The DLW-2 has two ALUs, so it’s able to execute two arithmetic instructions in parallel (hence the term two-way.
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
1 Chapter 04 Authors: John Hennessy & David Patterson.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Hyper Threading (HT) and  OPs (Micro-Operations) Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki.
GPU Architecture and Programming
Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
A few issues on the design of future multicores André Seznec IRISA/INRIA.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Pentium Architecture Arithmetic/Logic Units (ALUs) : – There are two parallel integer instruction pipelines: u-pipeline and v-pipeline – The u-pipeline.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Graphic Processing Processors (GPUs) Parallel.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
1 Lecture 5a: CPU architecture 101 boris.
Data Parallel Computations and Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson, slides6c.ppt Nov 4, c.1.
M. Bellato INFN Padova and U. Marconi INFN Bologna
Single Instruction Multiple Data
A Level Computing – a2 Component 2 1A, 1B, 1C, 1D, 1E.
Microarchitecture.
Visit for more Learning Resources
Independence Instruction Set Architectures
Graphics Processing Unit
Morgan Kaufmann Publishers
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Vector Processing => Multimedia
COMP4211 : Advance Computer Architecture
Super Quick Architecture Review
Mattan Erez The University of Texas at Austin
Lecture 26: Multiprocessors
Mattan Erez The University of Texas at Austin
NVIDIA Fermi Architecture
Coe818 Advanced Computer Architecture
Chapter 1 Introduction.
Introduction to Heterogeneous Parallel Computing
Lecture 27: Multiprocessors
What is Computer Architecture?
Mattan Erez The University of Texas at Austin
What is Computer Architecture?
General Purpose Graphics Processing Units (GPGPUs)
What is Computer Architecture?
CS 286 Computer Organization and Architecture
Graphics Processing Unit
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
Spring’19 Prof. Eric Rotenberg
ESE532: System-on-a-Chip Architecture
Presentation transcript:

CSE 502: Computer Architecture Data-Level Parallelism

Vector Processors + r1 r2 r3 v1 v2 v3 add r3, r1, r2 SCALAR (1 operation) v1 v2 v3 vector length vadd.vv v3, v1, v2 VECTOR (N operations) Scalar processors operate on single numbers (scalars) Vector processors operate on sequences of numbers (vectors)

What’s in a Vector Processor? A scalar processor (e.g., a MIPS processor) Scalar register file (32 registers) Scalar functional units (arithmetic, load/store, etc) A vector register file (a 2D register array) Each register is an array of elements e.g., 32 registers with 32 64-bit elements per register MVL = maximum vector length = max # of elements per register A set of vector functional units Integer, FP, load/store, etc Sometimes vector and scalar units are combined (share ALUs)

Example of Simple Vector Processor

Advantages of Vector ISAs Compact: single instruction defines N operations Amortizes the cost of instruction fetch/decode/issue Also reduces the frequency of branches Parallel: N operations are (data) parallel No dependencies No need for complex hardware to detect parallelism Can execute in parallel assuming N parallel datapaths Expressive: memory operations describe patterns Continuous or regular memory access pattern Can prefetch or accelerate using wide/multi-banked memory Amortizes high latency for 1st element over sequential pattern

SIMD Example: Intel Xeon Phi PCIe Client Logic Core L2 TD GDDR MC L1 TLB and 32KB Code Cache T0 IP 4 Threads In-Order Decode uCode Pipe 0 X87 RF Scalar RF X87 ALU 0 ALU 1 VPU RF VPU 512b SIMD Pipe 1 T1 IP T2 IP T3 IP L1 TLB and 32KB Data Cache Multi-core chip with Pentium-based SIMD processors Targeting HPC market (Goal: high GFLOPS, GFLOPS/Watt) 4 hardware threads + wide SIMD units Vector ISA: 32 vector registers (512b), 8 mask registers, scatter/gather In-order, short pipeline Why in-order?

Graphics Processing Unit (GPU) Architecture for compute-intensive highly data-parallel computation Exactly what graphics rendering needs Transistors devoted to data processing Not caching and flow control Cache ALU Control DRAM DRAM CPU GPU

Data Parallelism in GPUs GPUs use massive DLP to provide high FLOPs More than 1 Tera DP FLOP in NVIDIA GK110 SIMT execution model Single instruction multiple threads Trying to distinguish itself from both “vectors” and “SIMD” A key difference: better support for conditional control flow Program it with CUDA or OpenCL (among other things) Extensions to C Run “shader task”(snippet of scalar code) on many elements Internally, GPU uses scatter/gather and vector-mask-like ops

GPGPU (General Purpose GPU) In the beginning… Originally could only perform “shader” computations on images So, programmers started using this framework for computation Puzzle to work around the limitations, unlock the raw potential As GPU designers notice this trend… Hardware provided more “hooks” for computation Provided some limited software tools GPU designs are now fully embracing compute More programmability features in each generation Industrial-strength tools, documentation, tutorials, etc. Can be used for in-game physics, etc. A major initiative to push GPUs beyond graphics (HPC)