MBG 1 CIS501, Fall 99 Lecture 23: Intro to Multi-processors Michael B. Greenwald Computer Architecture CIS 501 Fall 1999
MBG 2 CIS501, Fall 99 Administrative stuff Final exam will be in room Moore 216, 8:30- 10:30am on Thursday, December 16th. HW #6 delayed until Thursday, Dec. 9th. Project extension: no penalty if I get it by the time I show up tomorrow morning (Friday, 9am-ish). Final: open book? Vote Penn CISter’s women’s luncheon on Wednesday, December 8th, 12:30-2:30, – Polar Bear Lounge (129 Pender) –Hosted by Professors Martha Palmer & Susan Davidson –questions?
MBG 3 CIS501, Fall 99 Why multiprocessors? Exploit parallelism (duplicate every resource, so no structural hazards). Increase availability (single processors may fail but system remains robust). Simplify parallelization Goal: increase performance by factor of N, if there are N processors. Pay more money, increase speedup! Rarely achievable
MBG 4 CIS501, Fall 99 Barriers to factor of N speedup Not all resources are duplicated (structural hazards) –High cost or low utilization –Need to maintain identity, or used for sharing information. Data dependencies: –A depends upon result of B, true dependencies –Name dependencies: false sharing Synchronization –x := 25; x := x+1;x := x+1; => individual reads and writes –Timing, Barriers
MBG 5 CIS501, Fall 99 Impact of barriers: lack of duplication/structural hazards Well understood in CIS501: –Stalls –Bottleneck (e.g. shared bus) –Cost of arbitration
MBG 6 CIS501, Fall 99 Impact of barriers: Data Dependencies Increased Memory Costs –Cache misses as memory goes from cache 1 to cache 2. Proc A stall waiting for B to finish (lack of parallelism) Communication costs between subtasks –Stall waiting for data to be xmitted –Increased memory costs (more misses) False sharing –Example 2 objects in 1 cache line. –Increases memory costs
MBG 7 CIS501, Fall 99 Impact of barriers: Synchronization Hotspot/Bottleneck (leads to data dependencies on lock) Increased communication Lack of parallelism (mutual exclusion)
MBG 8 CIS501, Fall 99 Structure of Multiprocessors A multiprocess has N processors, with some manner of shared memory or communications In what sense do they “run the same program”? (How do they process Instructions/Data?) Memory Hierarchy: How is the memory organized? MemoryCommunication Interface: How is state shared?
MBG 9 CIS501, Fall 99 Popular Flynn Categories SISD (Single Instruction Single Data) –Uniprocessors MISD (Multiple Instruction Single Data) –??? (Image processing? Cellular automata?) SIMD (Single Instruction Multiple Data) –Examples: Illiac-IV, CM-2 (early multiproc, special purpose) »Simple programming model »Low overhead »Flexibility »All custom integrated circuits MIMD (Multiple Instruction Multiple Data) –Examples: Sun Enterprise 5000, Cray T3D, SGI Origin »Flexible »Economy of scale (each uproc is same as commodity off- the-shelf uni-processor). »Independent tasks can operate independently
MBG 10 CIS501, Fall 99 Memory Organization Centralized Shared-memory architecture; also known as “UMA (Uniform Memory Access)”: –Shared bus (low latency, high throughput) –Shared physical memory (shared L3 cache?) –Shared I/O system –Separate L1 (and L2?) caches Distributed Memory architecture; NUMA, “cluster”: –Independent I/O, memory, and caches per processor –Scales memory bandwidth, I/O bw, fast access to local memory –Large spectrum of interconnection networks (each node may be a UMA multiprocessor)
MBG 11 CIS501, Fall 99 Memory Architecture, Communication Models Distributed Shared Memory vs. Message passing DSM –Load/Store –Addressing: »one physical address space »One virtual address space Message passing –Synchronous (RPC) –Asynchronous (Pure message passing) »(Null RPC makes this distinction less important).
MBG 12 CIS501, Fall 99 Communication Models Shared Memory –Processors communicate with shared address space –Easy on small-scale machines –Advantages: »Model of choice for uniprocessors, small-scale MPs »Ease of programming »Lower latency »Easier to use hardware controlled caching Message passing –Processors have private memories, communicate via messages –Advantages: »Less hardware, easier to design »Focuses attention on costly non-local operations Can support either SW model on either HW base
MBG 13 CIS501, Fall 99 Parallel Applications: What programs can usefully use a multiprocessor? What applications can we make parallel? Need independent computations SPLASH benchmark
MBG 14 CIS501, Fall 99 Structure of parallel programs (Amdahl’s Law): never faster than setup + cleanup Setup Loop body1 Loop body2 Loop body3 Loop body (n-1) Loop bodyn Cleanup Setup Loop body1 Loop body2 Loop body3 Loop body (n-1) Loop bodyn Cleanup Loop body4 Loop body1 Loop body2 Loop body3 Loop body (n-1) Loop bodyn Cleanup Setup
MBG 15 CIS501, Fall 99 Structure of parallel programs (Amdahl’s Law): never faster than setup + cleanup Setup Loop body1 Loop body2 Loop body3 Loop body (n-1) Loop bodyn Cleanup Setup Loop body1 Loop body2 Loop body3 Loop body (n-1) Loop bodyn Cleanup Loop body4 Loop body1 Loop body2 Loop body3 Loop body (n-1) Loop bodyn Cleanup Setup Too Simple!
MBG 16 CIS501, Fall 99 Effect of parallelization If you divide a program block that takes time T(n) into P blocks, will each block take T(n)/P? Simple answer is “yes”, but... Reality is no: Data dependencies; must communicate results from one sub- computation to another –Must spend the time transmitting data (throughput) –Must wait for data to arrive (latency)
MBG 17 CIS501, Fall 99 Effect of parallelization If you divide a program block that takes time T(n) into P blocks, will each block take T(n)/P? Computation cost scales as 1/P Communication cost scales in algorithm specific way. Example: particle simulation. –2-d Grid, communication cost is O(1/sqrt(P)) per processor, so aggregate communication cost increases as we add processors and performance increase is sublinear.
MBG 18 CIS501, Fall 99 Effect of parallelization (continued) Inter-processor Communication is expensive. –Inter-proc Communication costs (computation/communication ratio only 1st order effect) –Memory costs (locality) –Redundant computation Trade off computation for communication Change memory layout (more cache misses on uni-processor, but fewer on multi-proc).
MBG 19 CIS501, Fall 99 Fundamental Issues 4 Issues to characterize parallel machines/systems 1) Naming 2) Synchronization 3) Latency and Bandwidth 4) Consistency
MBG 20 CIS501, Fall 99 Fundamental Issue #1: Naming Naming: how to solve large problem fast –what data is shared –how it is addressed –what operations can access data –how processes refer to each other Choice of naming affects code produced by a compiler; via load where just remember address or keep track of processor number and local virtual address for msg. passing Choice of naming affects replication of data; via load in cache memory hierachy or via SW replication and consistency
MBG 21 CIS501, Fall 99 Fundamental Issue #1: Naming Global physical address space: any processor can generate address, and access it in a single operation –memory can be anywhere: virtual addr. translation handles it Global virtual address space: if the address space of each process can be configured to contain all shared data of the parallel program Segmented shared address space: locations are named uniformly for all processes of the parallel program
MBG 22 CIS501, Fall 99 Fundamental Issue #2: Synchronization To cooperate, processes must coordinate Message passing is implicit coordination with transmission or arrival of data Shared address => additional operations to explicitly coordinate: e.g., write a flag, awaken a thread, interrupt a processor, atomic operation
MBG 23 CIS501, Fall 99 Fundamental Issue #3: Latency and Bandwidth Bandwidth –Need high bandwidth in communication –Cannot scale, but stay close –Match limits in network, memory, and processor –Overhead to communicate is a problem in many machines Latency –Affects performance, since processor may have to wait –Affects ease of programming, since requires more thought to overlap communication and computation Latency Hiding –How can a mechanism help hide latency? –Examples: overlap message send with computation, prefetch data, switch to other tasks
MBG 24 CIS501, Fall 99 SMP Interconnect Processors to Memory AND to I/O Bus based: all memory locations equal access time so SMP = “Symmetric MP” –Sharing limited BW as add processors, I/O –(see Chapter 1, Figs 1-18/19, page of [CSG96]) Crossbar: expensive to expand Multistage network (less expensive to expand than crossbar with more BW) “Dance Hall” designs: All processors on the left, all memories on the right