LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms Converging Hardware –Independent from Network Topology –Programming Models Assumption –Number of PE much bigger than data elements
Parameters L: Latency –delay on the network o: Overhead on PE g: gap –minimum interval between consecutive messages (due to bandwidth) P: Number of PEs Note: L,o,g : independent from P or node distances Message length: short message L,o,g are per word or per message of fixed length k word message: k short messages (k*o overhead) L independent from message length
Parameters (continue) Bandwidth: 1/g * unit message length Number of messages to send or receive for each PE: L/g Send to Receive total time : L+2o if o >> g, ignore o –Similar to BSP except no synchronization step –No communication computation overlapping –Speed-up factor at most two
Broadcast Optimal Broad cast tree P1 P0 P5 P3P4P2 P6P7 P=8, L=6, g=4, o= o g L p1 p0p0
Optimal Sum Given time T, how many items we can add? Approach: recursive –At root, if T <= L+2o use a single PE (can add T+1 items) –If T > L+2o, Root should have data ready at T, and sender must have sum ready at T - L - 2o - 1 Recursively construct the sum tree at the sender If T - g > L+2o, Root also can receive data, and compute the sum with T-g as the root.
Applications FFT on the Butterfly network –Data Placement cyclic layout - First log n/P local comm, last log P global blocked layout - First log P global comm, remaining local hybrid: After log (n/P) iteration, re-map to cyclic so that remaining can be also local Communication time: g* (n/P**2) (P-1) + L each PE has n/P data, each of 1/P goes to each other PE Total time is (1+g/logn) optimal –All to all Communication schedule Approach 1: each PE sends PE1, PE2, … => bottle neck at PE1, PE2 in this order Approach 2 (staggered re-map) -- no congestion –PE1 sends PE2, PE3,.. –PE2 sends PE3, PE4, etc
Implementation on CM5 CM: –33MHz –Fat Trees –Global Control for scan/prefix/broadcast –one CM MFLOPs –FFT on local: MFLOPs (cache effect) each cycle: –multiply and add : 4.5 us –o: 2us –L: 6us –g: 4us –load ans store overhaed per cycle 1us communication time : n/P max (1us + 2o, g) + L bottleneck: processing and overhead, not bw
LU decomposition Data arrangement critical
Matching machine with real machines Average Distance topology independent usually works for n=1024 nodes. The difference between average distance and max distance are not such different
Potential Concerns Algorithmic concern –Theory? –Too complex? Communication concerns –how to use trivial comm such as local exchange –topology dependencies?
Comparison with BSP Length of superstep message not usable till next step special hardware for sync virtual/physical large, context switching may be expensive