LoGPC: Modeling Network Contention in Message-Passing Programs Csaba Andras Moritz Matthew I. Frank Laboratory for Computer Science Massachusetts Institute of Technology {andras,mfrank}@lcs.mit.edu 2/16/2019 Andras
Introduction LogP, LogGP are great models to capture first order system costs Our new model LoGPC extends LogP and LogGP capturing pipelining and network contention Results preview: 3 applications, 50-76% contention found 2/16/2019 Andras
Motivation - why do we care? Regular, tightly synchronized communication patterns successfully modeled with LogP, LogGP. Important classes of applications have irregular comm patterns, are not tightly synchronized or using large messsages. 2/16/2019 Andras
Outline of presentation LoGPC methodology Contention-free models: LogP, LogGP Pipelining model Network contention model Applications This presentation is organized as follows: First we introduce the three models on top of which we define an optimization process for finding optimized grain and balance in Raw systems. We exemplify by showing some aspects in this process for the Jacobi 2D Finally we end with conclusions and recomendations for Raw systems 2/16/2019 Andras
LoGPC framework (optional) Platform Network Interfacing Pipelining model Communication layer Contention free models (optional) Performance signature Contention models Network Application Application performance 2/16/2019 Andras
Short messages: LogP (Culler et al) 4 parameters: L = Latency O = Overheads g = gap minimum time interval consecutive sends and receives P = Processors g g Osend L Orec 2/16/2019 Andras
Long messages: LogGP (Alexandrov et al) Sender k bytes message A new parameter introduced: ........ Receiver G = Gap per byte for long messg. L G Os Or (k-1)G 2/16/2019 Andras
LoGPC framework (optional) free models Platform: Comm.layer: Application Platform: Network Interface Contention models Pipelining model Comm.layer: free models Interconnection Network performance (optional) 2/16/2019 Andras
Pipelining model Network Interface - Alewife MEMORY NETWORK PROCESSOR Data Cache bus IRQ DMA1 DMA2 INPUT Q OUTPUT Q IPI in IPI out CONTROLLER 2/16/2019 Andras
Pipelining model Sender ... Receiver (k-1)G o dma interrupt L Memory transfer Network transfer .... 2/16/2019 Andras
LoGPC framework (optional) Application Contention model Platform Network Interface Pipelining Communication layer free models Interconnection performance Performance signature (optional) 2/16/2019 Andras
Network contention model Performance signature {L,o,G} (Active Messages - MIT-Alewife) Application specific: inter message time, average distance Contention model network dimension, network distance Contention delay per message Application Performance 2/16/2019 Andras
Contention per message: Cn Start with open network model by Agarwal for expressing contention per message: Cn= network contention L = network latency L + Cn 2/16/2019 Andras
Contention delay per message: Cn Close the model, P customer system Apply Little’s equation, solve for m Pm ... P T0 T0: inter-message time P: processors m: message rate Cn: contention delay L : network latency L + Cn 2/16/2019 Andras
LoGPC step-by-step Extract com. signature {L,o,G} Estimate inter message time(s) based on application comm pattern(s) {T0}. Estimate application locality ( = average message distance ) Use contention-model for contention delay per message {Cn}. Estimate runtime based on critical path 2/16/2019 Andras
Applications: All-to-all remap Measured LoGPC No contention 2/16/2019 Andras
Diamond DAG used in DNA chain comparison Random mapping Measured LoGPC Perfect mapping Measured LogGP = LoGPC 2/16/2019 Andras
Em3d - hot-spot elimination propagation of electromagnetic waves in solids asynchronous communication pattern with bulk transfer LoGPC used to eliminate performance bugs we improved performance 20% reducing contention by up to 70%. Synch Communication Synch Comp 2/16/2019 Andras
Summary We found network contention significant all-to-all remap: 50% Diamond DAG: up to 56% EM3D:up to 70% (averall performance 20%) LoGPC: Simple way to evaluate how much locality matters for an application. LoGPC: Simple way to evaluate if network contention is significant for an application. We used applications that are parallel. We can observe a following division of resources: 25% memory, 75 % processing and communication The cost-performance optimal Raw configuration is the following: We also used theframeowrk to compare designs with different assumptions. We used DRAM inside a tile and a simple 2 byte FIFO router we obtained better preformance for the same cost. 2/16/2019 Andras