LoGPC: Modeling Network Contention in Message-Passing Programs

Slides:



Advertisements
Similar presentations
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
Advertisements

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
1 Performance Modeling l Basic Model »Needed to evaluate approaches »Must be simple l Synchronization delays l Main components »Latency and Bandwidth »Load.
Building the communication performance model of heterogeneous clusters based on a switched network Alexey Lastovetsky
Design, Implementation, and Evaluation of Differentiated Caching Services Ying Lu, Tarek F. Abdelzaher, Avneesh Saxena IEEE TRASACTION ON PARALLEL AND.
The Sensitivity of Communication Mechanisms to Bandwidth and Latency Frederic T. Chong, Rajeev Barua, Fredrik Dahlgrenz, John D. Kubiatowicz, and Anant.
Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for GAS Languages Costin Iancu LBNL.
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
Gigabit Routing on a Software-exposed Tiled-Microprocessor
High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn Betz & Ketan Padalia FPGA Seminar Presentation Nov.
Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.
“The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı.
Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May O’Reilly †, Martin Martin †, and Saman Amarasinghe † *
1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.
Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.
LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.
RAM, PRAM, and LogP models
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
1 M. Tudruj, J. Borkowski, D. Kopanski Inter-Application Control Through Global States Monitoring On a Grid Polish-Japanese Institute of Information Technology,
(Mis)Understanding the NUMA Memory System Performance of Multithreaded Workloads Zoltán Majó Thomas R. Gross Department of Computer Science ETH Zurich,
The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Message Passing On Tightly- Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Sunpyo Hong, Hyesoon Kim
Contemporary DRAM memories and optimization of their usage Nebojša Milenković and Vladimir Stanković, Faculty of Electronic Engineering, Niš.
1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
Programming for Performance Laxmikant Kale CS 433.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
Auburn University
Module 12: I/O Systems I/O hardware Application I/O Interface
Overview Parallel Processing Pipelining
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Dynamic connection system
Lecture 23: Interconnection Networks
Alternative system models
Parallel Programming By J. H. Wang May 2, 2017.
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Performance Evaluation of Adaptive MPI
Parallel computation models
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
Israel Cidon, Ran Ginosar and Avinoam Kolodny
Guoliang Chen Parallel Computing Guoliang Chen
Switching, routing, and flow control in interconnection networks
Summary Background Introduction in algorithms and applications
Architectural Interactions in High Performance Clusters
Operating System Concepts
13: I/O Systems I/O hardwared Application I/O Interface
CS703 - Advanced Operating Systems
Yiannis Nikolakopoulos
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
COMP60621 Fundamentals of Parallel and Distributed Systems
MPJ: A Java-based Parallel Computing System
RAW Scott J Weber Diagrams from and summary of:
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
COMP60611 Fundamentals of Parallel and Distributed Systems
Module 12: I/O Systems I/O hardwared Application I/O Interface
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

LoGPC: Modeling Network Contention in Message-Passing Programs Csaba Andras Moritz Matthew I. Frank Laboratory for Computer Science Massachusetts Institute of Technology {andras,mfrank}@lcs.mit.edu 2/16/2019 Andras

Introduction LogP, LogGP are great models to capture first order system costs Our new model LoGPC extends LogP and LogGP capturing pipelining and network contention Results preview: 3 applications, 50-76% contention found 2/16/2019 Andras

Motivation - why do we care? Regular, tightly synchronized communication patterns successfully modeled with LogP, LogGP. Important classes of applications have irregular comm patterns, are not tightly synchronized or using large messsages. 2/16/2019 Andras

Outline of presentation LoGPC methodology Contention-free models: LogP, LogGP Pipelining model Network contention model Applications This presentation is organized as follows: First we introduce the three models on top of which we define an optimization process for finding optimized grain and balance in Raw systems. We exemplify by showing some aspects in this process for the Jacobi 2D Finally we end with conclusions and recomendations for Raw systems 2/16/2019 Andras

LoGPC framework (optional) Platform Network Interfacing Pipelining model Communication layer Contention free models (optional) Performance signature Contention models Network Application Application performance 2/16/2019 Andras

Short messages: LogP (Culler et al) 4 parameters: L = Latency O = Overheads g = gap minimum time interval consecutive sends and receives P = Processors g g Osend L Orec 2/16/2019 Andras

Long messages: LogGP (Alexandrov et al) Sender k bytes message A new parameter introduced: ........ Receiver G = Gap per byte for long messg. L G Os Or (k-1)G 2/16/2019 Andras

LoGPC framework (optional) free models Platform: Comm.layer: Application Platform: Network Interface Contention models Pipelining model Comm.layer: free models Interconnection Network performance (optional) 2/16/2019 Andras

Pipelining model Network Interface - Alewife MEMORY NETWORK PROCESSOR Data Cache bus IRQ DMA1 DMA2 INPUT Q OUTPUT Q IPI in IPI out CONTROLLER 2/16/2019 Andras

Pipelining model Sender ... Receiver (k-1)G o dma interrupt L Memory transfer Network transfer .... 2/16/2019 Andras

LoGPC framework (optional) Application Contention model Platform Network Interface Pipelining Communication layer free models Interconnection performance Performance signature (optional) 2/16/2019 Andras

Network contention model Performance signature {L,o,G} (Active Messages - MIT-Alewife) Application specific: inter message time, average distance Contention model network dimension, network distance Contention delay per message Application Performance 2/16/2019 Andras

Contention per message: Cn Start with open network model by Agarwal for expressing contention per message: Cn= network contention L = network latency L + Cn 2/16/2019 Andras

Contention delay per message: Cn Close the model, P customer system Apply Little’s equation, solve for m Pm ... P T0 T0: inter-message time P: processors m: message rate Cn: contention delay L : network latency L + Cn 2/16/2019 Andras

LoGPC step-by-step Extract com. signature {L,o,G} Estimate inter message time(s) based on application comm pattern(s) {T0}. Estimate application locality ( = average message distance ) Use contention-model for contention delay per message {Cn}. Estimate runtime based on critical path 2/16/2019 Andras

Applications: All-to-all remap Measured LoGPC No contention 2/16/2019 Andras

Diamond DAG used in DNA chain comparison Random mapping Measured LoGPC Perfect mapping Measured LogGP = LoGPC 2/16/2019 Andras

Em3d - hot-spot elimination propagation of electromagnetic waves in solids asynchronous communication pattern with bulk transfer LoGPC used to eliminate performance bugs we improved performance 20% reducing contention by up to 70%. Synch Communication Synch Comp 2/16/2019 Andras

Summary We found network contention significant all-to-all remap: 50% Diamond DAG: up to 56% EM3D:up to 70% (averall performance 20%) LoGPC: Simple way to evaluate how much locality matters for an application. LoGPC: Simple way to evaluate if network contention is significant for an application. We used applications that are parallel. We can observe a following division of resources: 25% memory, 75 % processing and communication The cost-performance optimal Raw configuration is the following: We also used theframeowrk to compare designs with different assumptions. We used DRAM inside a tile and a simple 2 byte FIFO router we obtained better preformance for the same cost. 2/16/2019 Andras