UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014-4-15 John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and.

Slides:



Advertisements
Similar presentations
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
DSPs Vs General Purpose Microprocessors
Dr. Ken Hoganson, © August 2014 Programming in R COURSE NOTES 2 Hoganson Language Translation.
1 Lecture 3: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
The University of Adelaide, School of Computer Science
CS 61C: Great Ideas in Computer Architecture Case Studies: Server and Cellphone microprocessors Instructors: Krste Asanovic, Randy H. Katz
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Embedded Systems Programming
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Informationsteknologi Friday, October 19, 2007Computer Architecture I - Class 61 Today’s class Floating point numbers Computer systems organization.
Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Fixed-Point Arithmetics: Part II
Number Systems So far we have studied the following integer number systems in computer Unsigned numbers Sign/magnitude numbers Two’s complement numbers.
1 Chapter 04 Authors: John Hennessy & David Patterson.
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /14/2013 Lecture 16: Floating Point Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
The ISA Level The Instruction Set Architecture (ISA) is positioned between the microarchtecture level and the operating system level.  Historically, this.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
UC Regents Spring 2014 © UCBCS 152 L20: Dynamic Scheduling III John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and.
Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.
1 Latest Generations of Multi Core Processors
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.
Introduction to MMX, XMM, SSE and SSE2 Technology
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Processor Architecture
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Floating Point Numbers & Parallel Computing. Outline Fixed-point Numbers Floating Point Numbers Superscalar Processors Multithreading Homogeneous Multiprocessing.
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
Multicore – The future of Computing Chief Engineer Terje Mathisen.
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
Single Node Optimization Computational Astrophysics.
UC Regents Spring 2014 © UCBCS 152: L7: Power and Energy John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and Engineering.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
UC Regents Spring 2014 © UCBCS 152: Single-Cycle Design John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and Engineering.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
My Coordinates Office EM G.27 contact time:
Hardware Architecture
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.
Lecture 2: Intro to the simd lifestyle and GPU internals
Vector Processing => Multimedia
MMX Multi Media eXtensions
CS170 Computer Organization and Architecture I
NVIDIA Fermi Architecture
EE 193: Parallel Computing
EE 193: Parallel Computing
Review In last lecture, done with unsigned and signed number representation. Introduced how to represent real numbers in float format.
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 16 – RISC-V Vectors Krste Asanovic Electrical Engineering and.
Multicore and GPU Programming
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and Engineering www-inst.eecs.berkeley.edu/~cs152/ TA: Eric Love Lecture GPU + SIMD + Vectors I Play:

UC Regents Fall 2006 © UCBCS 152 L22: GPU + SIMD + Vectors Today: Architecture for data parallelism The Landscape: Three chips that deliver TeraOps/s in 2014, and how they differ. GK110: nVidia’s flagship Kepler GPU, customized for compute applications. Short Break E5-2600v2: Stretching the Xeon server approach for compute-intensive apps.

Sony/IBM Playstation PS3 Cell Chip - Released 2006

Sony PS3 Cell Processor SPE Floating-Point 32-bit Single-Instruction Multiple-Data 4 single- precision multiply-adds issue in lockstep (SIMD) per cycle. 6 cycle latency (in blue) 6 gamer SPEs, 3.2 GHz clock, --> 150 GigaOps/s

Sony PS3 Cell Processor SPE Floating-Point 32-bit Single-Instruction Multiple-Data In the 1970s a big part of a computer architecture class would be learning how to build units like this. Top-down (f.p. format) && Bottom-up (logic design)

Sony PS3 Cell Processor SPE Floating-Point The PS3 ceded ground to Xbox not because it was underpowered, but because it was hard to program. Today, the formats are standards (IEEE f.p.) and the bottom-up is now “EE.” Architects focus on how to organize floating point units into programmable machines for application domains.

UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors 2014: TeraOps/Sec Chips

Intel E5-2600v2 12-core Xeon Ivy Bridge 0.52 TeraOps/s GHz Each core can issue 16 single-precision operations per cycle. $2,600 per chip Haswell: 1.04 TeraOps/s

EECS 150: Graphics ProcessorsUC Regents Fall 2013 © UCB nVidia GPU 5.12 TeraOps/s MHz single-precision multiply-adds Kepler GK 110 $999 GTX Titan Black with 6GB GDDR5 (and 1 GPU)

Typical application: Medical imaging scanners, for first stage of processing after the A/D converters. XC7VX980T Xilinx Virtex 7 with the most DSP blocks MHz Comparable to single-precision floating-point TeraOps/s $16,824 per chip (die photo of a related part)

Intel E5-2600v GHz How? Haswell cores issue 32/cycle GHz Each core can issue 16 single-precision ops/cycle.

Die closeup of one Sandy Bridge core Advanced Vector Extension (AVX) unit Smaller than L3 cache, but larger than L2 cache. Relative area has increased in Haswell

Programmers Model AVX IA-32 Nehalem bit registers Each register holds 4 IEEE single-precision floats The programmers model has many variants, which we will introduce in the slides that follow

Example AVX Opcode VMULPS XMM4 XMM2 XMM3 XMM2 XMM3 XMM4 op = * Multiply two 4-element vectors of single-precision floats, element by element. New issue every cycle. 5 cycle latency (Haswell). Aside from its use of a special register set, VMULPS execute like normal IA-32 instructions.

Sandy Bridge, Haswell Sandy Bridge extends register set to 256 bits: vectors are twice the size. IA-64 AVX/AVX2 has 16 registers (IA-32: 8) Haswell adds 3-operand instructions a * b + c Fused multiply-add (FMA) 2 EX units with FMA --> 2X increase in ops/cycle

OoO Issue Haswell (2013) Haswell sustains 4 micro-op issues per cycle. One possibility: 2 for AVX, and 2 for Loads, Stores and book- keeping. Haswell has two copies of the FMA engine, on separate ports.

AVX: Not just single-precision floating-point AVX instruction variants interpret 128-bit registers as 4 floats, 2 doubles, 16 8-bit integers, etc bit version -> double-precision vectors of length 4

Exception Model MXCSR: AVX condition codes register Floating-point exceptions: Always a contentious issue in ISA design...

Exception Handling Use MXCSR to configure AVX to halt program for divide by zero, etc... Or, configure AVX for show must go on semantics: on error, results are set to +Inf, -Inf, NaN,...

Data moves AVX register file reads pass through a permute and shuffle networks in both “X” and “Y” dimensions. Many AVX instructions rely on this feature...

Pure data move opcode. Or, part of a math opcode.

Permutes over 2 sets of 4 fields of one vector. Arbitrary data alignment Shuffling two vectors.

Memory System Gather: Reading non-unit-stride memory locations into arbitrary positions in an AVX register, while minimizing redundant reads. Values in memory. Specified indices. Final result.

Positive observations... Best for applications that are a good fit for Xeon’s memory system: Large on-chip caches, up-to-a-TeraByte of DRAM, but only moderate bandwidth requirements to DRAM. Applications that do “a lot of everything” -- integer, random-access loads/stores, string ops -- gain access to a significant fraction of a TeraOp/s of floating point, with no context switching. If you’re planning on experimenting with GPUs, you need a Xeon server anyway...aside from $$$, why not buy a high-core-count variant?

Negative observations... AVX changes each generation, in a backward compatible way, to add the latest features. AVX is difficult for compilers. Ideally, someone has written a library of hand-crafted AVX assembly code that does exactly what you want. Two FMA units per core (50% of issue width) is probably the limit. So, scaling vector size or scaling core count are the only upgrade paths TeraOp/s (Ivy Bridge) << 5.12 TeraOp/s (GK110) And $2700 (chip only) >> $999 (Titan Black card) GB/s << 336 GB/s (memory bandwidth)

UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors Break Play:

EECS 150: Graphics ProcessorsUC Regents Fall 2013 © UCB nVidia GPU The granularity of SMX cores (15 per die) matches the Xeon core count (12 per die) Kepler GK 110

SMX core (28 nm) Sandy Bridge core (32 nm)

889 MHz GK 110 SMX core vs 2.7 GHz Haswell core single prec. double prec bit SIMD vectors: 4X more than Haswell 32 single-precision floats or 16 double-precision floats single precision single precision single precision single precision single precision single precision double precision double precision special ops memory ops Execution units vs. Haswell 3X (single-precision), 1X (double-precision) Clock speed vs Ivy Bridge Xeon: 3X slower 4X single-precision, 1.33X double-precision

CS 152 L14: Cache Design and CoherencyUC Regents Spring 2014 © UCB Organization: Multi-threaded like Niagara Thread scheduler 2048 registers in total. Several programmer models available. Largest model has 256 registers per thread, supporting 8 active threads.

CS 152 L14: Cache Design and CoherencyUC Regents Spring 2014 © UCB Organization: Multi-threaded, In-order Thread scheduler The SIMD math units live here Each cycle, 3 threads can issue 2 in-order instructions.

Bandwidth to DRAM is 5.6X Xeon Ivy Bridge But, DRAM limited to 6GB, and all caches are small compared to Xeon

EECS 150: Graphics ProcessorsUC Regents Fall 2013 © UCB nVidia GPU 5.12 TeraOps/s Kepler GK 110 $999 GTX Titan Black with 6GB GDDR5 (and 1 GPU) MHz single-precision multiply-adds

On Thursday To be continued... Have fun in section !