Computer Organization & Design 计算机组成与设计 Weidong Wang (王维东) wdwang@zju.edu.cn College of Information Science & Electronic Engineering 信息与通信网络工程研究所(ICAN) Zhejiang University
Course Information Instructor: Weidong WANG TA: Email: wdwang@zju.edu.cn Tel(O): 0571-87953170; Office Hours: TBD, Yuquan Campus, Xindian (High-Tech) Building 306 Mobile: 13605812196 TA: mobile,Email: 陈 彬彬 Binbin CHEN, 13071888906; 15091831397@163.com; 陈 佳云 Jiayun CHEN,13161700140; chenjy93@outlook.com; Office Hours: Wednesday & Saturday 14:00-16:30 PM. Xindian (High-Tech) Building 308.(也可以短信邮件联系) 微信号-“2017计组群”
Lecture 13 Introduction to Multi-core Processor
Motivation:动机 Single Processor Performance Scaling 4
Multi-core Chips (aka亦称 Chip Multi-Processors or CMPs) 5
Sample of Multi-core Options 6
Sample of Multi-core Options 7
And There is Much More… 异种的 套/插座 群集 8
Vector Supercomputers Epitomized by Cray-1, 1976: Scalar Unit Load/Store Architecture Vector Extension Vector Registers Vector Instructions Implementation Hardwired Control Highly Pipelined Functional Units Interleaved Memory System No Data Caches No Virtual Memory 9 9
Vector Programming Model Scalar Registers r0 r15 Vector Registers v0 v15 [0] [1] [2] [VLRMAX-1] VLR Vector Length Register + [0] [1] [VLR-1] Vector Arithmetic Instructions ADDV v3, v1, v2 v3 v2 v1 v1 Vector Load and Store Instructions LV v1, r1, r2 Base, r1 Stride, r2 Memory Vector Register 10 10
Multimedia Extensions (aka SIMD extensions) 64b 32b 16b 8b Very short vectors added to existing ISAs for microprocessors Use existing 64-bit registers split into 2x32b or 4x16b or 8x8b This concept first used on Lincoln Labs TX-2 computer in 1957, with 36b datapath split into 2x18b or 4x9b Newer designs have 128-bit registers (PowerPC Altivec, Intel SSE2/3/4) Single instruction operates on all elements within register 16b + + + + 16b 4x16b adds 16b 11 11
Supercomputers Definition of a supercomputer: Fastest machine in world at given task A device to turn a compute-bound problem into an I/O bound problem Any machine costing $30M+ Any machine designed by Seymour Cray CDC6600 (Cray, 1964) regarded as first supercomputer 12
CDC 6600 Seymour Cray, 1963 A fast pipelined machine with 60-bit words 128 Kword main memory capacity, 32 banks Ten functional units (parallel, unpipelined) Floating Point: adder, 2 multipliers, divider Integer: adder, 2 incrementers, ... Hardwired control (no microcoding) Scoreboard for dynamic scheduling of instructions Ten Peripheral Processors for Input/Output a fast multi-threaded 12-bit integer ALU Very fast clock, 10 MHz (FP add in 4 clocks) >400,000 transistors, 750 sq. ft., 5 tons, 150 kW, novel freon-based technology for cooling Fastest machine in world for 5 years (until 7600) over 100 sold ($7-10M each) 3/10/2009 CS252 S05 13
CDC6600: Vector Addition B0 - n loop: JZE B0, exit A0 B0 + a0 load X0 A1 B0 + b0 load X1 X6 X0 + X1 A6 B0 + c0 store X6 B0 B0 + 1 jump loop Ai = address register Bi = index register Xi = data register 14 CS252 S05 14
Supercomputer Applications Typical application areas Military research (nuclear weapons, cryptography) Scientific research Weather forecasting Oil exploration Industrial design (car crash simulation) Bioinformatics Cryptography All involve huge computations on large data sets In 70s-80s, Supercomputer Vector Machine 15
BlueGene/Q Compute chip System-on-a-Chip design : integrates processors, memory and networking logic into a single chip 360 mm² Cu-45 technology (SOI) ~ 1.47 B transistors 16 user + 1 service processors plus 1 redundant processor all processors are symmetric each 4-way multi-threaded 64 bits PowerISA™ 1.6 GHz L1 I/D cache = 16kB/16kB L1 prefetch engines each processor has Quad FPU (4-wide double precision, SIMD) peak performance 204.8 GFLOPS@55W Central shared L2 cache: 32 MB eDRAM multiversioned cache will support transactional memory, speculative execution. supports atomic ops Dual memory controller 16 GB external DDR3 memory 1.33 Gb/s 2 * 16 byte-wide interface (+ECC) Chip-to-chip networking Router logic integrated into BQC chip. External IO PCIe Gen2 interface
Blue Gene/Q packaging hierarchy 4. Node Card 32 Compute Cards, Optical Modules, Link Chips, Torus 3. Compute Card One single chip module, 16 GB DDR3 Memory 2. Module Single Chip 1. Chip 16 cores 5b. I/O Drawer 8 I/O Cards 8 PCIe Gen2 slots 6. Rack 2 Midplanes 1, 2 or 4 I/O Drawers 7. System 20PF/s 5a. Midplane 16 Node Cards 5-D Topology: 16x16x16x12x2 A Q32 card is 2x2x2x2x2 and a midplane is 4x4x4x4x2. Ref: SC2010
Graphics Processing Units (GPUs) Original GPUs were dedicated fixed-function devices for generating 3D graphics (mid-late 1990s) including high-performance floating-point units Provide workstation-like graphics for PCs User could configure graphics pipeline, but not really program it Over time, more programmability added (2001-2005) E.g., New language Cg for writing small programs run on each vertex or each pixel, also Windows DirectX variants Massively parallel (millions of vertices or pixels per frame) but very constrained programming model Some users noticed they could do general-purpose computation by mapping input and output data to images, and computation to vertex and pixel shading computations Incredibly difficult programming model as had to use graphics pipeline model for general computation 顶点着色器: 就三角形的3个顶点的颜色会计算, 三角形中的其 它像素通过插值得到 像素着色器: 每个像素的颜色都会被计算 18
General-Purpose GPUs (GP-GPUs) In 2006, Nvidia introduced GeForce 8800 GPU supporting a new programming language: CUDA “Compute Unified Device Architecture” Subsequently, broader industry pushing for OpenCL, a vendor-neutral version of same ideas. Idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for general-purpose computing Attached processor model: Host CPU issues data-parallel kernels to GP-GPU for execution This lecture has a simplified version of Nvidia CUDA-style model and only considers GPU execution for computational kernels, not graphics Would probably need another course to describe graphics processing 计算统一设备架构 19
“Single Instruction, Multiple Thread”线程 GPUs use a SIMT model, where individual scalar instruction streams for each CUDA thread are grouped together for SIMD execution on hardware (Nvidia groups 32 CUDA threads into a warp) µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7 ld x Scalar instruction stream mul a ld y add st y SIMD execution across warp 20
Nvidia Fermi GF100 GPU [Nvidia, 2010] 21
GPU Future High-end desktops have separate GPU chip, but trend towards integrating GPU on same die as CPU (already in laptops, tablets and smartphones) Advantage is shared memory with CPU, no need to transfer data Disadvantage is reduced memory bandwidth compared to dedicated smaller-capacity specialized memory system Graphics DRAM (GDDR) versus regular DRAM (DDR3) Will GP-GPU survive? Or will improvements in CPU DLP make GP-GPU redundant? On same die, CPU and GPU should have same memory bandwidth GPU might have more FLOPS as needed for graphics anyway 22
Another HW Issue: Memory Model for Multi-core 隐式 难处理 23
Symmetric对称性的Multiprocessors Memory I/O controller Graphics output CPU-Memory bus bridge Processor I/O bus Networks symmetric All memory is equally far away from all processors Any processor can do any I/O (set up a DMA transfer) 24
Synchronization同时性 The need for synchronization同步arises whenever there are concurrent processes并发的进程 in a system (even in a uniprocessor system) Producer-Consumer: A consumer process must wait until the producer process has produced data Mutual Exclusion互斥: Ensure that only one process uses a resource at a given time producer consumer 生产者一消费者问题 互斥 Shared Resource P1 P2 25
A Producer-Consumer Example tail head Rtail Rhead R Consumer: Load Rhead, (head) spin: Load Rtail, (tail) if Rhead==Rtail goto spin Load R, (Rhead) Rhead=Rhead+1 Store (head), Rhead process(R) Producer posting Item x: Load Rtail, (tail) Store (Rtail), x Rtail=Rtail+1 Store (tail), Rtail The program is written assuming instructions are executed in order. Problems? 26
Performance of Symmetric Shared-Memory Multiprocessors 均衡的 Cache performance is combination of: Uniprocessor cache miss traffic Traffic caused by communication Results in invalidations and subsequent cache misses Adds 4th C: coherence miss Joins Compulsory, Capacity, Conflict (Sometimes called a Communication miss) 39
Coherency一致性Misses True sharing misses arise from the communication of data through the cache coherence mechanism Invalidates due to 1st write to shared block Reads by another CPU of modified block in different cache Miss would still occur if block size were 1 word False sharing misses when a block is invalidated because some word in the block, other than the one being read, is written into Invalidation does not cause a new value to be communicated, but only causes an extra cache miss Block is shared, but no word in block is actually shared miss would not occur if block size were 1 word 40
HomeWork Readings: HomeWork Read Book; Read Parallel Processors from Client to Cloud.pdf; HomeWork HW4上交 HW5 Project 2 Reading Appendices B: TH-2 HPC in Computer Organization and Design (COD) (Fifth Edition) 41
Acknowledgements These slides contain material from courses: UCB CS152 Stanford EE108B Also MIT course 6.823 42