Multiprocessor Architectures and Parallel Programs
What’s It All About, Then? An old definition: A parallel Computer is a “collection of processing elements that communicate and cooperate to solve large problems fast.” Is life that easy?! What about “Collection” size and scalability Processing element power Communication infrastructure Communication protocols for Cooperation Data sharing … The answer (and more) is in ECE 5504, so stick around! function form The definition is not enough to describe a parallel architecture. Life is not as simple as it appears in the definition. We need to answer many many questions. This is what this course is here for.
Why Are We Going Parallel? Application Trends Performance “travel into the future using Moore’s Law!” Ticket price is an exponential function of distance! (how?) Parallelizable examples Scientific/Engineering Commercial (Do you know Toy Story’s Buzz Lightyear?!) Technology Trends VLSI Component size decreasing Useful chip area increasing Multiple processors on chip possible Architectural Trends See next couple of slides The wording of the quote above is mine, and I take the blame or the credit for it. However, the main idea is not something I came up with. Toy Story movie was produced on a parallel computer system composed of 100s of Sun workstations. Inflection point for microprocessor design in mid-1980s came with the arrival of 32-bit word processors with caches.
Recent History of Microprocessor Design Up to mid-1980s: Bit-level parallelism (word size) Limited gain after a certain point mid-1980s to mid-1990s: instruction-level parallelism (pipelined and superscalar computers) Bigger cache to keep more instructions ready Branch predictors Replacement of CISC by RISC Problems Costly cache misses (performance-wise) Very costly design (money-wise) Natural question: “What is next?” Answer: process/thread-level parallelism
Recent History of Computing Systems Since mid-1980s, microprocessors support multiprocessor configuration Multiprocessor trend migrated to desktop Mid 1990s: shared memory multiprocessor trend covered servers to desktops Case study: Pentium Pro (1994) Four processors wired directly No logic No bus drivers Software vendors call for parallel architectures side-by-side with hardware Database companies Difference between large servers and small desktops is in the number of processors per system: desktops support a few, large servers support 10s, and large commercial systems support 100s
Supercomputing Used mostly for scientific computing Two Main Trends Vector processors, e.g.: CRAY Xmp (2 then 4) CRAY Ymp (8) CRAY T94 (32) Microprocessor-based (MPP) 100s then 1000s MPP is now dominating Even CRAY Research announced T3D based on the DEC alpha processors
Vector Processors vs. MPP Cray started producing MPPs too in 1993!
Common Parallel Architectures Shared Address Space Message Passing Data Parallel Dataflow Systolic
Shared Address Space (1) Remember “shared-memory” multiprocessing for interprocess communication, logical/physical spaces? Processes define shared portions of address space Process i Private Shared Process j Physical memory Little history Increase memory capacity (&maybe bandwidth) by adding memory blocks Same for I/O; I/O requires direct memory access Same for processing capacity Several memory blocks may be needed as a historical thing. MEm
Shared Address Space (2) Interconnect is required between I/O & memory Processor & memory Dancehall (also UMA) Uniform Memory Access M M $ I/O P M $ I/O P $ I/O P SMP: Symmetric Multiprocessor 1. Crossbar 2. Multistage 3. Bus (SMP) Scalability (cost, performance) 1: (, ) 2: (, ) 3: (, )
Shared Address Space (3) Dancehall UMA model NUMA model Nonuniform Memory Access Scalable network $ P M M M $ $ P P Each processor has a local memory block In any model, user-level operations are store and load
Message Passing (1) Like “message passing” for interprocess communication except Communication is among processors Like shared address space’s NUMA except Communication is at I/O level Like networks/clusters except Node packaging is tighter No human I/O devices per node Network much faster than LAN
Message Passing (2) User-level operations are send and receive FIFO and blocking operations DMA and non-blocking operations Topologies Communication Between neighbors only (classic) Between arbitrary nodes: store&forward (modern) Hypercube Ring Grid
Got Confused Enough? That’s the point; architectures are converging! Blurry boundary at user level Message passing supported on shared memory Shared address space supported on message passing User-defined Paging-level (like virtual memory but disk is remote memory) Fast LAN-based clusters is converging to parallel machines
Data Parallel and Flynn’s Taxonomy (1972) Taxonomy based on Number of instructions issued Number of data elements worked on Spawned classes SISD: single instruction stream single data stream SIMD MISD ? MIMD I thought you were gonna talk about data parallel!! Well, SIMD is also called data parallel processing Compiler generates low-level parallel code
Dataflow Architecture Program represented by a graph Simple example: z = (x+y)*(u-v) P+ x y u v P- + ― Network P* * z Static vs. dynamic operation-processor mapping
Systolic Architecture Processing Elements Array of simple PEs Like SIMD data parallel except Each PE may perform a different operation
What About Distributed Systems? Main purpose of a distributed system is to make use of resources that are distributed geographically or across several machines. Exploiting parallelism is not necessarily the purpose of using a distributed system Distributed object programming Examples CORBA DCOM Java RMI Using the distributed object programming does not guarantee parallelism Common Object Request Broker Architecture Distributed Component Object Model Remote Method Invocation
Parallel Programs
Why Bother? This is an “architecture” course. Why would we care about software? As a parallel computer architect To understand the effect of the target software on the design’s degree of freedom To understand the role and limitations of the architecture As an algorithm designer To understand how to design effective parallel algorithms (parallel algorithm design can be significantly different from that of sequential ones!)
Why Bother? (Continued) As a programmer To obtain the best performance out of a parallel system through careful coding For parallel programs and systems Interaction is very strong There is at least one more degree of freedom: number of processors
Parallelizing a Once-Sequential Program Parallelizable Inherently sequential Sequential Parallelized Speedup = ts/tp ts tp
Example: Simulating Ocean Currents (a) Cross sections (b) Spatial discretization of a cross section Model as two-dimensional grids Discretize in space and time Finer spatial/temporal resolution means greater accuracy Many different computations per time step Set up and solve equations Concurrency across and within grid computations
How Is It Done? Manually Automatically By programmer at design time By compiler at compile time By operating system at run time Still a research problem
Terminology Task: Process Thread Processor Smallest unit of concurrency a parallel program can exploit (done by one processor) Process Abstract entity performing a subset of tasks (This is not an OS definition) Thread A process (non-OS definition) Processor Physical entity executing processes
Parallelization Steps (1) Partitioning Decomposition: of computation into tasks Assignment: of tasks to processes Orchestration: communication/synchronization among processes Mapping: of processes to processors
Parallelization Steps (2) Partitioning p0 p1 p2 p3 p0 p1 p2 p3 Decomposition Assignment Orchestration Mapping Sequential computation Tasks Processes Parallel program Parallel architecture
Decomposition Description Goals Limitation Architecture dependence Break down sequential computation into tasks Goals To expose concurrency Limitation Limited concurrency Architecture dependence Usually independent
Assignment Description Specifying how tasks are distributed over processes Goals Balance workload (computation, data, I/O, communication) Reduce interprocess communication Reduce run-time overhead of assignment management Types Static Dynamic: reassign for better balance Architecture dependence Usually independent
Orchestration Description Goals Dependent on Related issues Specifying how processes communicate/synchronize Goals Reduce processor communication/synchronization Preserve locality of data reference Optimize schedule Reduce parallelism management overhead Dependent on Architecture Programming model (coding style & primitives provided) Programming language Related issues Data organization Local task scheduling (within a process) for locality exploitation Explicit vs. implicit communication Message size Choice of communication/synchronization primitives
Mapping Description Schemes Specifying which processes run on which processors A form of resource allocation Schemes Space sharing Static vs. dynamic Program vs. OS control
General Notes on Parallelization A task is the smallest unit of computation. Therefore, decomposition can determine the # of processes (and hence processors) used effectively Complete balance of workload maybe inherently unachievable for some computations Mapping, if not done by user, is done by the OS (not necessarily optimized)
Speedup Analysis (1) Speedup = “sequential time” ÷ “parallel time” Assume A system with p processors Tp: “parallel time” (Tp= Tcomp + Tcomm) Ts: “sequential time” Tcomp= s. Ts+(1-s) .Ts/p, where s: the inherently sequential fraction of the program Speedup = Ts ÷ Tp, i.e.
Inherently sequential fraction Speedup Analysis (2) Amdahl’s Law Ignore communication time Amdahl’s limit Speedup Inherently sequential fraction Number of processors For the case where s = 0.5, the best you can do is get a speedup of 2, and that’s by using a virtually infinite number of processors. 0.5 0.2 0.1 0.01 0.001 s 2 5 10 100 1000 1/s
Degree of parallelism: number of parallel operations in a program Speedup Analysis (3) Communication limit Assume Tcomm(p)= f (p) Ts (f: communication-to-computation ration) Assume a perfectly parallelizable program and do the math Degree of parallelism limit Degree of parallelism: number of parallel operations in a program Efficiency Linear speedup