Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Distributed Processing, Client/Server and Clusters
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.
Types of Parallel Computers
1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Distributed Processing, Client/Server, and Clusters
Chapter 16 Client/Server Computing Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles,
MPI and RDMA Yufei 10/15/2010. MPI over uDAPL: abstract MPI: most popular parallel computing standard MPI needs the ability to deliver high performace.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Infiniband enables scalable Real Application Clusters – Update Spring 2008 Sumanta Chatterjee, Oracle Richard Frank, Oracle.
Ch4: Distributed Systems Architectures. Typically, system with several interconnected computers that do not share clock or memory. Motivation: tie together.
Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,
Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy,
Non-Collective Communicator Creation in MPI James Dinan 1, Sriram Krishnamoorthy 2, Pavan Balaji 1, Jeff Hammond 1, Manojkumar Krishnan 2, Vinod Tipparaju.
Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji and D. K. Panda Network Based.
1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.
GePSeA: A General Purpose Software Acceleration Framework for Lightweight Task Offloading Ajeet SinghPavan BalajiWu-chun Feng Dept. of Computer Science,
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.
1 Version 3.0 Module 11 TCP Application and Transport.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Argonne National Laboratory is a U.S. Department of Energy laboratory managed by U Chicago Argonne, LLC. Xin Zhao *, Pavan Balaji † (Co-advisor) and William.
© 2008 IBM Corporation Deep Computing Messaging Framework Lightweight Communication for Petascale Supercomputing Supercomputing 2008 Michael Blocksome,
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.
Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Hwajung Lee.  Interprocess Communication (IPC) is at the heart of distributed computing.  Processes and Threads  Process is the execution of a program.
Scaling NWChem with Efficient and Portable Asynchronous Communication in MPI RMA Min Si [1][2], Antonio J. Peña [1], Jeff Hammond [3], Pavan Balaji [1],
C-Hint: An Effective and Reliable Cache Management for RDMA- Accelerated Key-Value Stores Yandong Wang, Xiaoqiao Meng, Li Zhang, Jian Tan Presented by:
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.
HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.
Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.
GLOBAL EDGE SOFTWERE LTD1 R EMOTE F ILE S HARING - Ardhanareesh Aradhyamath.
Implementing Babel RMI with ARMCI Jian Yin Khushbu Agarwal Daniel Chavarría Manoj Krishnan Ian Gorton Vidhya Gurumoorthi Patrick Nichols.
DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Seminar On Rain Technology
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.
Enhancements for Voltaire’s InfiniBand simulator
Applying Twister to Scientific Applications
Hybrid Programming with OpenMP and MPI
High Performance Computing
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National Laboratory Richland, WA 2 Argonne National Laboratory Argonne, IL

Outline Introduction Background and Motivation InfiniBand Connection Semantics Global Arrays and ARMCI Overall Design Efficient Connection Teardown Connection Cache Management Performance Evaluation Computational Chemistry Sub-surface modeling Conclusions and Future Work

Introduction For runtime systems, scalable communication data structures is critical Communication data structures Buffers (data, control messages..) Connections End-points (Gemini, Seastar, BG..) One-to-one mapping (IB)) Registration data structures (Local for MPI, Local + Remote for PGAS) …. Efficient connection management is important 213 InfiniBand systems in TOP500 PGAS Models are becoming popular 3

InfiniBand Connection Management On-demand pair-wise process creation Cluster’02 (VIA), IPDPS’06, Cluster’08 (IB-MPI), CCGrid’10 (IB- PGAS) Persistent through the application lifetime Unreliable datagram based approaches (ICS’07) Natural fit for two-sided communication (send/receive model) Designing get and bulk data transfer is prohibitive Software maintained reliability eXtended Reliable Connection (XRC) Connection memory increases with nodes and not processes (ICS’07, Cluster’08) 4

Use Cases for PGAS Models Frequently combined with non-SPMD execution models Task Based Computations Dynamic load balancing and work stealing Linear communication over the application execution lifetime Time-Variance in execution Little temporal reuse (SC’09) Connection persistence is not useful 5

Problem Statement Given low temporal locality for PGAS models and non- SPMD executions What are the design choices for a disconnection protocol? What are the memory benefits and possible performance degradations? 6

Outline Introduction Background and Motivation InfiniBand Connection Semantics Global Arrays and ARMCI Overall Design Efficient Connection Teardown Connection Cache Management Performance Evaluation Computational Chemistry Sub-surface modeling Conclusions and Future Work

InfiniBand Transport Semantics Reliable Connection Most frequently used Supports In-order delivery, RDMA, QoS.. Reliable Datagram Most RC features, but.. Unreliable Connection RDMA, requires dedicated QP No ordering Unreliable Datagram Connectionless limited message size to MTU No ordering or reliability guarantees 8

Global Arrays Global Arrays is a PGAS programming model GA presents a shared view Provides one-sided communication model Used in wide variety of applications Computational Chemistry NWChem, Molcas, Molpro … Bioinformatics ScalaBLAST Ground Water Modeling STOMP 9 Physically distributed data Global Address Space

Aggregate Remote Memory Copy Interface Communication Runtime Systems for Global Arrays Used in Global Trees, and Chapel Provides one-sided communication runtime primitives Currently Supported Platforms Commodity Networks InfiniBand, Ethernet.. Leadership Class Machines Cray XE6, Cray XTs IBM BG’s On-going -> BG/Q and BlueWaters Upcoming features Fault tolerant continued execution (5.1) Energy Efficiency modes (5.2) 10

Outline Introduction Background and Motivation InfiniBand Connection Semantics Global Arrays and ARMCI Overall Design Efficient Connection Teardown Connection Cache Management Performance Evaluation Computational Chemistry Sub-surface modeling Conclusions and Future Work

Connection Structure in ARMCI 12 Master Process Data Server thread Client Process

Connection Cache Management Number of active connections Model? Dynamic behavior for task-based computations Finding a victim connection LRU LRU insufficient with communication cliques Multi-phase applications (use-case: Flow + Chemistry) Modified-LRU (LRU-M) Temporal locality of connections 13

Overlap Disconnection Protocol 14 Client Data Server Master Process Time WaitProc Flush Teardown Req Acknowledgement break

Outline Introduction Background and Motivation InfiniBand Connection Semantics Global Arrays and ARMCI Overall Design Efficient Connection Teardown Connection Cache Management Performance Evaluation Computational Chemistry and Sub-surface Modeling Conclusions and Future Work

Performance Evaluation Evaluation Test Bed 160 Tflop system with 2310 Dual socket quad core Barcelona processor InfiniBand DDR with PCI Express using DDR Voltaire switches Original implementation is Global Arrays (GA) version 4.3 The presented design is available with GA-5.0 Methodologies LRU, and LRU-M Varying the number of connection entries in connection cache Applications Northwest Chemistry (NWChem) Sub-surface Transport on Multiple Phases 16

Performance Evaluation with NWChem Evaluation with pentane input deck on 6144 processes The connection cache has a total of 128, 32, and 4 entries Negligible performance degradation for 128 and 32 cache size Total connections created – times for 32 cache size ~32 times for 4 cache size 17

Performance Evaluation :NWChem (Contd) Evaluation with siosi7 input deck on 4096 processes The connection cache has a total of 128, 32, and 4 entries Negligible performance degradation for 128 and 32 connection size Total connections created – times for 32 cache size ~32 times for 4 cache size 18

Performance Evaluation :STOMP Evaluation on 8192 processes The connection cache has a total of 128, 32, and 4 entries LRU-M reduces the overall connection establishment and break time in comparison to LRU 19

Conclusions and Future Work Persistent on-demand connection approaches are insufficient Presented a design for connection management Efficient connection cache management A conducive protocol for PGAS Models Memory benefits for two class of applications Future Work: Solve the problem for two-sided (pair-wise) connections Apply the problem to other communication data structures (remote registration caches) 20

Questions Global Arrays ARMCI HPC-PNL 21