Upcoming Improvements and Features in Charm++

Slides:

Advertisements

Similar presentations

Part IV: Memory Management

Advertisements

Chorus and other Microkernels Presented by: Jonathan Tanner and Brian Doyle Articles By: Jon Udell Peter D. Varhol Dick Pountain.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.

Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.

CS533 Concepts of Operating Systems Class 3 Integrated Task and Stack Management.

1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.

Overview SAP Basis Functions. SAP Technical Overview Learning Objectives What the Basis system is How does SAP handle a transaction request Differentiating.

Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.

1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.

A Comparative Study of the Linux and Windows Device Driver Architectures with a focus on IEEE1394 (high speed serial bus) drivers Melekam Tsegaye

Advanced / Other Programming Models Sathish Vadhiyar.

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

3.14 Work List IOC Core Channel Access. Changes to IOC Core Online add/delete of record instances Tool to support online add/delete OS independent layer.

Processes CSCI 4534 Chapter 4. Introduction Early computer systems allowed one program to be executed at a time –The program had complete control of the.

Martin Kruliš by Martin Kruliš (v1.1)1.

Slides created by: Professor Ian G. Harris Operating Systems  Allow the processor to perform several tasks at virtually the same time Ex. Web Controlled.

UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.

CMSC 421 Spring 2004 Section 0202 Part II: Process Management Chapter 5 Threads.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

Introduction to threads

Chapter 3: Windows7 Part 5.

NFV Compute Acceleration APIs and Evaluation

Deterministic Communication with SpaceWire

Kernel Design & Implementation

OPERATING SYSTEM CONCEPT AND PRACTISE

Module 12: I/O Systems I/O hardware Application I/O Interface

Chapter 4: Threads.

Processes and threads.

Current Generation Hypervisor Type 1 Type 2.

Process Management Process Concept Why only the global variables?

The Mach System Sri Ramkrishna.

Operating Systems (CS 340 D)

Accelerating Large Charm++ Messages using RDMA

Heterogeneous Programming

Fabric Interfaces Architecture – v4

Semester Review Chris Gill CSE 422S - Operating Systems Organization

NGS computation services: APIs and Parallel Jobs

Main Memory Management

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

CMSC 611: Advanced Computer Architecture

Parallel Programming in Contemporary Programming Languages (Part 2)

Chapter 3: Windows7 Part 5.

Chapter 4: Threads.

Chapter 9: Virtual-Memory Management

Page Replacement.

Chapter 4: Threads.

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Integrated Runtime of Charm++ and OpenMP

Multistep Processing of a User Program

Main Memory Background Swapping Contiguous Allocation Paging

Operating System Concepts

13: I/O Systems I/O hardwared Application I/O Interface

CS703 - Advanced Operating Systems

Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.

Half-Sync/Half-Async (HSHA) and Leader/Followers (LF) Patterns

Lecture Topics: 11/1 General Operating System Concepts Processes

Threads and Concurrency

Threads Chapter 4.

Recent Communication Optimizations in Charm++

CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM

Top Half / Bottom Half Processing

Presented by Neha Agrawal

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Chapter 4: Threads & Concurrency

Lecture 3: Main Memory.

Chapter 4: Threads.

GPU Scheduling on the NVIDIA TX2:

Chapter 13: I/O Systems.

Module 12: I/O Systems I/O hardwared Application I/O Interface

Presentation transcript:

Upcoming Improvements and Features in Charm++ Discussion Moderator: Eric Bohm Charm++ Workshop 2018

One-sided, RDMA, zero-copy Direct API integrated into Charm See Nitin Bhat’s talk for details Note: uses IPC (i.e., CMA) for cross process within node Get vs Put Put semantics : when is it safe to write to remote memory? Message layer completion notification weak for put Get semantics : when is the remote data available? If your application already has that knowledge, get will have lower latency Memory registration Can’t access a page that isn’t mapped We have four strategies with different costs to choose from (next slide) We handle this for messages already, but if you want to zero copy send from your own buffers, you have to think about this issue. Should the Direct API cover GPU to GPU operations? Charm++ Workshop 2018

Memory registration Use CkRdmaAlloc / CkRdmaFree to allocate your buffers: use CK_BUFFER_PREREG mode – you assert that all your buffers are accessible (i.e., pinned) to get max performance benefits. CK_BUFFER_UNREG – You expect the runtime system to handle it as necessary. May incur registration or copy overhead. CK_BUFFER_REG – request that the Direct API register your buffers. May incur per transaction registration overhead. CK_BUFFER_NOREG – no guarantee regarding pinning, RDMA not supported. Generic support for Ncpy operations. Standard message protocols in use with associated copy overheads. Is this API sufficient? Charm++ Workshop 2018

Command Line Options How wide should a run be if the user provides no arguments? Default for MPI is 1 process with 1 core with 1 rank Conservative choice, but is that really what the user intended? +autoProvision Everything we can see is ours +processPerSocket +wthPerCore is probably the right answer Unless you need to leave cores free for OS voodoo (+excludecore ) Are there other common command line issues we should address? Charm++ Workshop 2018

Offload API Manage the offloading of work to accelerators CUDA only Support multiple accelerators per host and per process Completion converted to Charm Callback event. Allow work to be done on GPU or CPU Based on utilization and suitability CUDA only That is where the platforms have been and are going Are there other aspects of accelerator interaction that we should prioritize? How much priority should we place on other accelerator APIs? Charm++ Workshop 2018

C++ Integration vector directly supported in reductions [inline] supports templated methods R-value references supported PUP supports enums, deque, forward_list CkLoop supports lambda syntax Which advanced C++ features should be prioritized? Charm++ Workshop 2018

CharmPy Basic Charm++ support in Python Limited load balancer selection No nodegroup No SMP mode (python lock) Which parts of the Python ecosystem should we prioritize for compatibility? Are there use cases for CharmPy that sound interesting to you? Charm++ Workshop 2018

Within Node Parallelism Support for Boost Threads (uFcontext) Default choice for platforms where they don’t break other features (not OSX) Lowest context switching cost Integration of LLVM OpenMP implementation Supports clean interoperability between Charm++ ULT (CkLoop, [threaded] etc) and OpenMP to avoid oversubscription and resource contention Finer controls for CkLoop work stealing strategies Our support for the OpenMP task API is weak How important is that to you? Charm++ Workshop 2018

AMPI Improved point-to-point communication latency and bandwidth, particularly for messages within a process. Updated AMPI_Migrate() with built-in MPI_Info objects, such as AMPI_INFO_LB_SYNC.- Fixes to MPI_Sendrecv_replace, MPI_(I)Alltoall{v,w}, MPI_(I)Scatter(v), MPI_IN_PLACE in gather collectives, MPI_Type_free, MPI_Op_free, and MPI_Comm_free. Implemented MPI_Comm_create_group and MPI_Dist_graph support. Added support for using -tlsglobals for privatization of global/static variables in shared objects. Previously -tlsglobals required static linking. AMPI only renames the user's MPI calls from MPI_* to AMPI_* if Charm++/AMPI is built on top of another MPI implementation for communication. Support for compiling mpif.h in both fixed form and free form. PMPI profiling interface support added. Added an ampirun script that wraps charmrun to enable easier integration with build and test scripts that already take mpirun/mpiexec as an option. Which incomplete aspects of MPI 3 are of highest importance to you? Charm++ Workshop 2018

Charm Next Generation (post 6.9) How attached are you to the current Load Balancing API? Should per PE load balancing be the focus, or per host? Should chares be bound to a PE by default? Should entry methods be non-reentrant by default? Should unbound chares be non-reentrant by default? How much of a burden is charmxi to your application development? Dedicated scheduler 1:1 with execution stream and hardware thread vs Selectable number of schedulers bound to execution steams with remainder as drones executing work stealing queues. Should we implement multiple comm threads per process? No dedicated comm threads (ala PAMI layer)? Charm++ Workshop 2018