Models of Parallel Computation W+A: Appendix D “LogP: Towards a Realistic Model of Parallel Computation”, PPOPP, May 1993 Alpern, B., L. Carter, and J. Ferrante, ``Modeling Parallel Computers as Memory Hierarchies,'' Programming Models for Massively Parallel Computers, Giloi, W. K., S. Jahnichen, and B. D. Shriver ed., IEEE Press, 1993. CSE 160/Berman
Computation Models Model provides underlying abstraction useful for analysis of costs, design of algorithms Serial computational models use RAM or TM as underlying models for algorithm design CSE 160/Berman
RAM [Random Access Machine] unalterable program consisting of optionally labeled instructions. memory is composed of a sequence of words, each capable of containing an arbitrary integer. an accumulator, referenced implicitly by most instructions. a read-only input tape a write-only output tape CSE 160/Berman
RAM Assumptions We assume all instructions take the same time to execute word-length unbounded the RAM has arbitrary amounts of memory arbitrary memory locations can be accessed in the same amount of time RAM provides an ideal model of a serial computer for analyzing the efficiency of serial algorithms. CSE 160/Berman
PRAM [Parallel Random Access Machine] PRAM provides an ideal model of a parallel computer for analyzing the efficiency of parallel algorithms. PRAM composed of P unmodifiable programs, each composed of optionally labeled instructions. a single shared memory composed of a sequence of words, each capable of containing an arbitrary integer. P accumulators, one associated with each program a read-only input tape a write-only output tape CSE 160/Berman
More PRAM PRAM is a synchronous, MIMD, shared memory parallel computer. Different protocols can be used for reading and writing shared memory. EREW (exclusive read, exclusive write) CREW (concurrent read, exclusive write) CRCW (concurrent read, concurrent write) -- requires additional protocol for arbitrating write conflicts PRAM can emulate a message-passing machine by logically dividing shared memory into private memories for the P processors. CSE 160/Berman
Broadcasting on a PRAM “Broadcast” can be done on CREW PRAM in O(1): Broadcaster sends value to shared memory Processors read from shared memory CSE 160/Berman
LogP machine model Model of distributed memory multicomputer Developed by [Culler, Karp, Patterson, etc.] Authors tried to model prevailing parallel architectures (circa 1993). Machine model represents prevalent MPP organization: machine constructed from at most a few thousand nodes, each node contains a powerful processor each node contains substantial memory interconnection structure has limited bandwidth interconnection structure has significant latency CSE 160/Berman
LogP parameters L: upper bound on latency incurred by sending a message from a source to a destination o: overhead, defined as the time the processor is engaged in sending or receiving a message, during which time it cannot do anything else g: gap, defined as the minimum time between consecutive message transmissions or receptions P: number of processor/memory modules CSE 160/Berman
LogP Assumptions network has finite capacity. at most ceiling(L/g) messages can be in transit from any one processor to any other at one time. asynchronous communication. latency and order of messages is unpredictable all messages are small context switching overhead is 0 (not modeled) multithreading (virtual processes) may be employed but only up to a limit of L/g virtual processors CSE 160/Berman
LogP notes All parameters measured in processor cycles Local operations take one cycle Messages are assumed to be small LogP was particularly well-suited to modeling CM-5. Not clear if the same correlation is found with other machines. CSE 160/Berman
LogP Analysis of PRAM Broadcasting Algorithm Broadcaster sends value to shared memory (we’ll assume the value is in P0’s memory) P Processors read from shared memory (other processors receive messages from P0) Time for P0 to send P messages = o + g (P-1) Maximum time for other processors to receive messages = o + (P-2)g + o + L + o CSE 160/Berman
Efficient Broadcasting in LogP Model Gap includes overhead time so overhead < gap P0 P1 P2 P3 P4 P5 P6 P7 time o g L CSE 160/Berman
Mapping induced by LogP Broadcasting algorithm on 8 processors 24 20 22 18 14 10 P5 P0 P1 P4 P6 P7 P2 P3 P0 P1 P2 P3 P4 P5 P6 P7 time o g L CSE 160/Berman
Analysis of LogP Broadcasting Algorithm to 7 Processors Time to receive one message from P0 for first processor (P5) is L+2o Time to receive message for last processor is max{3g+L+2o, 2g+L+2o, g+2L+4o, 4o+2L, g+4o+2L}=max{3g+L+2o, g+2L+4o} Compare to LogP analysis of PRAM Broadcast which is o + (P-2)g + o + L + o = 5g + 3o + L P0 P1 P2 P3 P4 P5 P6 P7 time o g L CSE 160/Berman
Scalable Performance LogP Broadcast utilizes tree structure to optimize broadcast time Tree depends on values of L,o,g,P Strategy is much more scalable (and ultimately more efficient) than PRAM Broadcast 24 20 22 18 14 10 P5 P0 P1 P4 P6 P7 P2 P3 CSE 160/Berman
Moral Analysis can be no better than underlying model. The more accurate the model, the more accurate the analysis. (This is why we use TM to determine undecidability but RAM to determine complexity.) CSE 160/Berman
Other Models used for Analysis BSP (Bulk Synchronous Parallel) Slight precursor and competitor to LogP PMH (Parallel Memory Hierarchy) Focuses on memory costs CSE 160/Berman
BSP[Bulk Synchronous Parallel] BSP proposed by Valiant BSP model consists of P processors, each with local memory Communication network for point-to-point message passing between processors Mechanism for synchronizing all or some of the processors at defined intervals CSE 160/Berman
BSP Programs BSP programs composed of supersteps In each superstep, processors execute L computational steps using locally stored data, and send and receive messages Processors synchronized at the end of the superstep (at which time all messages have been received) BSP programs can be implemented through mechanisms like Oxford BSP library (C routines for implementing BSP programs) and BSP-L. superstep synchronization CSE 160/Berman
BSP Parameters P: number of processors (with memory) L: synchronization periodicity g: communication cost s: processor speed (measured in number of time steps/second) Processor sends at most h messages and receives at most h messages in a single superstep (communication called an h-relation) superstep synchronization CSE 160/Berman
BSP Notes Complete program = set of supersteps Communication startup not modeled, g is for continuous traffic conditions Message size is one data word More than one process or thread can be executed by a processor. Generally assumed that computation and communication are not overlapped Time for a superstep = max number of local operations performed by any processor + g(max number of messages sent or received by a processor) + L CSE 160/Berman
BSP Analysis of PRAM Broadcast Algorithm: Broadcaster sends value to shared memory (we’ll assume the value is in P0’s memory) P Processors read from shared memory (other processors receive messages from P0) In BSP model, processors only allowed to send or receive at most h messages in a single superstep. Broadcast for more than h processors would require a tree structure If there were more than Lh processors, then a tree broadcast would require more than one superstep. How much time does it take for a P processor broadcast? CSE 160/Berman
BSP Analysis of PRAM Broadcast How much time does it take for a P processor broadcast? … h-ary tree CSE 160/Berman
PMH [Parallel Memory Hierarchy] Model PMH seeks to represent memory. Goal is to model algorithms so that good decisions can be made about where to allocate data during execution. Model represents costs of interprocessor communication and memory hierarchy traffic (e.g. between main memory and disk, between registers and cache). Proposed by Carter, Ferrante, Alpern CSE 160/Berman
PMH Model Computer is modeled as a tree of memory modules with the processors at the leaves. All data movement takes the form of block transfers between children and their parents. PMH is composed of a tree of modules all modules hold data leaf modules also perform computation data in a module is partitioned into blocks Each module has 4 parameters for each module CSE 160/Berman
Un-parameterized PMH Models for a Cluster of Workstations network Main memories Disks Caches ALU/registers Main memories Shared disk system Disks Caches network ALU/registers Bandwidth from processor to disk > bandwidth from processor to network Bandwidth between 2 processors > bandwidth to disk CSE 160/Berman
PMH Module Parameters Blocksize s_m tells how many bytes there are per block of m Blockcount n_m tells how many blocks fit in m Childcount c_m tells how many children m has Transfer time t_m tells how many cycles it takes to transfer a block between m and its parent Size of "node" and length of "edge" in PMH graph should correspond to blocksize, blockcount and transfer time Generally all modules at a given level of the tree will have the same parameters CSE 160/Berman
Summary Goal of parallel computation models is to provide a realistic representation of the costs of programming. Model provides algorithm designers and programmers a measure of algorithm complexity which helps them decide what is “good” (i.e. performance-efficient) Next up: Mapping and Scheduling CSE 160/Berman