Computer Organization & Design 计算机组成与设计

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

© Krste Asanovic, 2014CS252, Spring 2014, Lecture 5 CS252 Graduate Computer Architecture Spring 2014 Lecture 5: Out-of-Order Processing Krste Asanovic.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
The University of Adelaide, School of Computer Science
Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Computer Graphics Graphics Hardware
1 Chapter 04 Authors: John Hennessy & David Patterson.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Vector computers.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Computer Organization & Design 计算机组成与设计 Weidong Wang ( 王维东 ) College of Information Science & Electronic Engineering 信息与通信工程研究所 Zhejiang.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Computer Organization & Design 计算机组成与设计
Computer Graphics Graphics Hardware
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 July 12, 2012 © Barry Wilkinson CUDAIntro.ppt.
These slides are based on the book:
Symmetric Multiprocessors: Synchronization and Sequential Consistency
M. Bellato INFN Padova and U. Marconi INFN Bologna
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
CS252 Graduate Computer Architecture Fall 2015 Lecture 5: Out-of-Order Processing Krste Asanovic
CS 152 Computer Architecture and Engineering Lecture 18: Snoopy Caches
3- Parallel Programming Models
Assembly Language for Intel-Based Computers, 5th Edition
Constructing a system with multiple computers or processors
Lecturer: Alan Christopher
Morgan Kaufmann Publishers
What happens inside a CPU?
Phnom Penh International University (PPIU)
Computer Organization & Design 计算机组成与设计
Morgan Kaufmann Publishers
Unit 2 Computer Systems HND in Computing and Systems Development
Computer Organization & Design 计算机组成与设计
Mattan Erez The University of Texas at Austin
Lecture 13 Multi-core Chips
Computer Organization & Design 计算机组成与设计
Symmetric Multiprocessors: Synchronization and Sequential Consistency
BIC 10503: COMPUTER ARCHITECTURE
Microprocessor & Assembly Language
NVIDIA Fermi Architecture
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Lecture 15 Multi-core Chips
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Chapter 1 Introduction.
Lecture 13 Multi-core Chips
Computer Graphics Graphics Hardware
Digital System Design II 数字系统设计2
COMS 361 Computer Organization
CS 286 Computer Organization and Architecture
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 16 – RISC-V Vectors Krste Asanovic Electrical Engineering and.
Graphics Processing Unit
Krste Asanovic Electrical Engineering and Computer Sciences
6- General Purpose GPU Programming
CSE 502: Computer Architecture
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Computer Organization & Design 计算机组成与设计 Weidong Wang (王维东) wdwang@zju.edu.cn College of Information Science & Electronic Engineering 信息与通信网络工程研究所(ICAN) Zhejiang University

Course Information Instructor: Weidong WANG TA: Email: wdwang@zju.edu.cn Tel(O): 0571-87953170; Office Hours: TBD, Yuquan Campus, Xindian (High-Tech) Building 306 Mobile: 13605812196 TA: mobile,Email: 陈 彬彬 Binbin CHEN, 13071888906; 15091831397@163.com; 陈 佳云 Jiayun CHEN,13161700140; chenjy93@outlook.com; Office Hours: Wednesday & Saturday 14:00-16:30 PM. Xindian (High-Tech) Building 308.(也可以短信邮件联系) 微信号-“2017计组群”

Lecture 13 Introduction to Multi-core Processor

Motivation:动机 Single Processor Performance Scaling 4

Multi-core Chips (aka亦称 Chip Multi-Processors or CMPs) 5

Sample of Multi-core Options 6

Sample of Multi-core Options 7

And There is Much More… 异种的 套/插座 群集 8

Vector Supercomputers Epitomized by Cray-1, 1976: Scalar Unit Load/Store Architecture Vector Extension Vector Registers Vector Instructions Implementation Hardwired Control Highly Pipelined Functional Units Interleaved Memory System No Data Caches No Virtual Memory 9 9

Vector Programming Model Scalar Registers r0 r15 Vector Registers v0 v15 [0] [1] [2] [VLRMAX-1] VLR Vector Length Register + [0] [1] [VLR-1] Vector Arithmetic Instructions ADDV v3, v1, v2 v3 v2 v1 v1 Vector Load and Store Instructions LV v1, r1, r2 Base, r1 Stride, r2 Memory Vector Register 10 10

Multimedia Extensions (aka SIMD extensions) 64b 32b 16b 8b Very short vectors added to existing ISAs for microprocessors Use existing 64-bit registers split into 2x32b or 4x16b or 8x8b This concept first used on Lincoln Labs TX-2 computer in 1957, with 36b datapath split into 2x18b or 4x9b Newer designs have 128-bit registers (PowerPC Altivec, Intel SSE2/3/4) Single instruction operates on all elements within register 16b + + + + 16b 4x16b adds 16b 11 11

Supercomputers Definition of a supercomputer: Fastest machine in world at given task A device to turn a compute-bound problem into an I/O bound problem Any machine costing $30M+ Any machine designed by Seymour Cray CDC6600 (Cray, 1964) regarded as first supercomputer 12

CDC 6600 Seymour Cray, 1963 A fast pipelined machine with 60-bit words 128 Kword main memory capacity, 32 banks Ten functional units (parallel, unpipelined) Floating Point: adder, 2 multipliers, divider Integer: adder, 2 incrementers, ... Hardwired control (no microcoding) Scoreboard for dynamic scheduling of instructions Ten Peripheral Processors for Input/Output a fast multi-threaded 12-bit integer ALU Very fast clock, 10 MHz (FP add in 4 clocks) >400,000 transistors, 750 sq. ft., 5 tons, 150 kW, novel freon-based technology for cooling Fastest machine in world for 5 years (until 7600) over 100 sold ($7-10M each) 3/10/2009 CS252 S05 13

CDC6600: Vector Addition B0  - n loop: JZE B0, exit A0  B0 + a0 load X0 A1  B0 + b0 load X1 X6  X0 + X1 A6  B0 + c0 store X6 B0  B0 + 1 jump loop Ai = address register Bi = index register Xi = data register 14 CS252 S05 14

Supercomputer Applications Typical application areas Military research (nuclear weapons, cryptography) Scientific research Weather forecasting Oil exploration Industrial design (car crash simulation) Bioinformatics Cryptography All involve huge computations on large data sets In 70s-80s, Supercomputer  Vector Machine 15

BlueGene/Q Compute chip System-on-a-Chip design : integrates processors, memory and networking logic into a single chip 360 mm² Cu-45 technology (SOI) ~ 1.47 B transistors 16 user + 1 service processors plus 1 redundant processor all processors are symmetric each 4-way multi-threaded 64 bits PowerISA™ 1.6 GHz L1 I/D cache = 16kB/16kB L1 prefetch engines each processor has Quad FPU (4-wide double precision, SIMD) peak performance 204.8 GFLOPS@55W Central shared L2 cache: 32 MB eDRAM multiversioned cache will support transactional memory, speculative execution. supports atomic ops Dual memory controller 16 GB external DDR3 memory 1.33 Gb/s 2 * 16 byte-wide interface (+ECC) Chip-to-chip networking Router logic integrated into BQC chip. External IO PCIe Gen2 interface

Blue Gene/Q packaging hierarchy 4. Node Card 32 Compute Cards, Optical Modules, Link Chips, Torus 3. Compute Card One single chip module, 16 GB DDR3 Memory 2. Module Single Chip 1. Chip 16 cores 5b. I/O Drawer 8 I/O Cards 8 PCIe Gen2 slots 6. Rack 2 Midplanes 1, 2 or 4 I/O Drawers 7. System 20PF/s 5a. Midplane 16 Node Cards 5-D Topology: 16x16x16x12x2 A Q32 card is 2x2x2x2x2 and a midplane is 4x4x4x4x2. Ref: SC2010

Graphics Processing Units (GPUs) Original GPUs were dedicated fixed-function devices for generating 3D graphics (mid-late 1990s) including high-performance floating-point units Provide workstation-like graphics for PCs User could configure graphics pipeline, but not really program it Over time, more programmability added (2001-2005) E.g., New language Cg for writing small programs run on each vertex or each pixel, also Windows DirectX variants Massively parallel (millions of vertices or pixels per frame) but very constrained programming model Some users noticed they could do general-purpose computation by mapping input and output data to images, and computation to vertex and pixel shading computations Incredibly difficult programming model as had to use graphics pipeline model for general computation 顶点着色器: 就三角形的3个顶点的颜色会计算, 三角形中的其 它像素通过插值得到 像素着色器: 每个像素的颜色都会被计算 18

General-Purpose GPUs (GP-GPUs) In 2006, Nvidia introduced GeForce 8800 GPU supporting a new programming language: CUDA “Compute Unified Device Architecture” Subsequently, broader industry pushing for OpenCL, a vendor-neutral version of same ideas. Idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for general-purpose computing Attached processor model: Host CPU issues data-parallel kernels to GP-GPU for execution This lecture has a simplified version of Nvidia CUDA-style model and only considers GPU execution for computational kernels, not graphics Would probably need another course to describe graphics processing 计算统一设备架构 19

“Single Instruction, Multiple Thread”线程 GPUs use a SIMT model, where individual scalar instruction streams for each CUDA thread are grouped together for SIMD execution on hardware (Nvidia groups 32 CUDA threads into a warp) µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7 ld x Scalar instruction stream mul a ld y add st y SIMD execution across warp 20

Nvidia Fermi GF100 GPU [Nvidia, 2010] 21

GPU Future High-end desktops have separate GPU chip, but trend towards integrating GPU on same die as CPU (already in laptops, tablets and smartphones) Advantage is shared memory with CPU, no need to transfer data Disadvantage is reduced memory bandwidth compared to dedicated smaller-capacity specialized memory system Graphics DRAM (GDDR) versus regular DRAM (DDR3) Will GP-GPU survive? Or will improvements in CPU DLP make GP-GPU redundant? On same die, CPU and GPU should have same memory bandwidth GPU might have more FLOPS as needed for graphics anyway 22

Another HW Issue: Memory Model for Multi-core 隐式 难处理 23

Symmetric对称性的Multiprocessors Memory I/O controller Graphics output CPU-Memory bus bridge Processor I/O bus Networks symmetric All memory is equally far away from all processors Any processor can do any I/O (set up a DMA transfer) 24

Synchronization同时性 The need for synchronization同步arises whenever there are concurrent processes并发的进程 in a system (even in a uniprocessor system) Producer-Consumer: A consumer process must wait until the producer process has produced data Mutual Exclusion互斥: Ensure that only one process uses a resource at a given time producer consumer 生产者一消费者问题 互斥 Shared Resource P1 P2 25

A Producer-Consumer Example tail head Rtail Rhead R Consumer: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R) Producer posting Item x: Load Rtail, (tail) Store (Rtail), x Rtail=Rtail+1 Store (tail), Rtail The program is written assuming instructions are executed in order. Problems? 26

Performance of Symmetric Shared-Memory Multiprocessors 均衡的 Cache performance is combination of: Uniprocessor cache miss traffic Traffic caused by communication Results in invalidations and subsequent cache misses Adds 4th C: coherence miss Joins Compulsory, Capacity, Conflict (Sometimes called a Communication miss) 39

Coherency一致性Misses True sharing misses arise from the communication of data through the cache coherence mechanism Invalidates due to 1st write to shared block Reads by another CPU of modified block in different cache Miss would still occur if block size were 1 word False sharing misses when a block is invalidated because some word in the block, other than the one being read, is written into Invalidation does not cause a new value to be communicated, but only causes an extra cache miss Block is shared, but no word in block is actually shared  miss would not occur if block size were 1 word 40

HomeWork Readings: HomeWork Read Book; Read Parallel Processors from Client to Cloud.pdf; HomeWork HW4上交 HW5 Project 2 Reading Appendices B: TH-2 HPC in Computer Organization and Design (COD) (Fifth Edition) 41

Acknowledgements These slides contain material from courses: UCB CS152 Stanford EE108B Also MIT course 6.823 42