uGNI-based Charm++ Runtime for Cray Gemini Interconnect

Slides:

Advertisements

Similar presentations

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Advertisements

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Toward Efficient Support for Multithreaded MPI Communication Pavan Balaji 1, Darius Buntinas 1, David Goodell 1, William Gropp 2, and Rajeev Thakur 1 1.

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.

A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.

Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.

Synchronization and Communication in the T3E Multiprocessor.

1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

A Comparative Study of the Linux and Windows Device Driver Architectures with a focus on IEEE1394 (high speed serial bus) drivers Melekam Tsegaye

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Charm Workshop CkDirect: Charm++ RDMA Put Presented by Eric Bohm CkDirect Team: Eric Bohm, Sayantan Chakravorty, Pritish Jetley, Abhinav Bhatele.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

1 Lecture 4: Part 2: MPI Point-to-Point Communication.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.

Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Effects of contention on message latencies in large supercomputers Abhinav S Bhatele and Laxmikant V Kale Parallel Programming Laboratory, UIUC IS TOPOLOGY.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

Teragrid 2009 Scalable Interaction with Parallel Applications Filippo Gioachin Chee Wai Lee Laxmikant V. Kalé Department of Computer Science University.

Robust Non-Intrusive Record-Replay with Processor Extraction Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.

Debugging Large Scale Applications in a Virtualized Environment Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

Parallel Programming Models

Effects of Contention on Message Latencies in Large Supercomputers

Interconnection topologies

Accelerating Large Charm++ Messages using RDMA

Performance Evaluation of Adaptive MPI

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab

Department of Computer Science University of California, Santa Barbara

CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #

Shreeni Venkataramaiah

User-level Distributed Shared Memory

Recent Communication Optimizations in Charm++

Upcoming Improvements and Features in Charm++

Application taxonomy & characterization

Faucets: Efficient Utilization of Multiple Clusters

Communication Framework

BigSim: Simulating PetaFLOPS Supercomputers

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Department of Computer Science University of California, Santa Barbara

Support for Adaptivity in ARMCI Using Migratable Objects

Emulating Massively Parallel (PetaFLOPS) Machines

Presentation transcript:

uGNI-based Charm++ Runtime for Cray Gemini Interconnect 10th Annual Workshop on Charm++ Applications uGNI-based Charm++ Runtime for Cray Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale Parallel Programming Lab University of Illinois at Urbana-Champaign Ryan Olson, Cray Inc Terry R. Jones, Oak Ridge National Lab

Motivation Modern Interconnects are complex Challenging to obtain good performance for applications with various communication patterns Multiple programming models/languages are developed For example NQueens

Motivation How to attain good performance for applications in alternative models on different interconnects ? For example NQueens

Motivation How to attain good performance for applications in alternative models? Focus: Exploit the performance of Charm++ programming model on Gemini Interconnect For example NQueens

Why not MPI-based Charm++ It works Not best one Unnecessary features in MPI lead to overhead Data interaction pattern in Charm++ is different from MPI Only sender-involving For example NQueens

Initial Pingpong Performance This motivated

Outline Overview of Gemini and uGNI Design of uGNI-based Charm++ Optimization to improve performance Micro-benchmark and application results This motivated

Gemini Interconnect Upgraded from SeaStar+ Low latency (700ns) High bandwidth (8GBytes/sec) Scale to 100,000 nodes Hardware support for one-sided communication Fast Memory Access (FMA) Block Transfer Engine (BTE) Framework . User do not need to know any of these problems and techniques

uGNI Two libraries uGNI Charm++ Distributed Memory Applications (DMAPP) User-level Generic Network Interface (uGNI) uGNI Charm++

uGNI APIs Memory Registration/de- Post FMA/BTE transactions Completion Queues

Lower Runtime System (LRTS) LrtsInit() LrtsSyncSend() LrtsAdvanceCommunication()

Design of uGNI-based Charm++

Design of uGNI-based Charm++ Small messages SMSG directly send with data_tag 1024 bytes default Registered memory increases linearly with maximum msg size and the number of nodes

Baseline Pingpong Performance

Performance Issues?

Memory Pool Memory registration/de-registration costs a lot Charm++ controls all memory allocation/de-allocation

Memory Pool Memory registration/de-registration costs a lot Charm++ controls all memory allocation/de-allocation Register big chucks of memory Allocation/de- is from memory pool

Performance of Memory Pool

Performance – Message Latency

Performance - Bandwidth

N-Queens (fine-grained)

NAMD 100M-atom on Titan

Conclusion Gemini Interconnect Charm++ LRTS Interface Memory Pool Optimization Reference Yanhua Sun, Gengbin Zheng, Ryan Olson, Terry Jones, Laxmikant Kale. A uGNI-Based Asynchronous Message-driven Runtime System for Cray Supercomputers with Gemini Interconnect [IPDPS 2012]