HPX-5 ParalleX in Action

Slides:

Advertisements

Similar presentations

Proposal (More) Flexible RMA Synchronization for MPI-3 Hubert Ritzdorf NEC–IT Research Division

Advertisements

MPI Message Passing Interface

Processes Management.

SDN Controller Challenges

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.

Distributed Systems CS

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

DEPARTMENT OF COMPUTER LOUISIANA STATE UNIVERSITY Models without Borders Thomas Sterling Arnaud & Edwards Professor, Department of Computer Science.

CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.

GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University

Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Task Scheduling and Distribution System Saeed Mahameed, Hani Ayoub Electrical Engineering Department, Technion – Israel Institute of Technology

Distributed Computations

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.

Parallel Computing Multiprocessor Systems on Chip: Adv. Computer Arch. for Embedded Systems By Jason Agron.

Concurrency CS 510: Programming Languages David Walker.

An Introduction to Parallel Computing Dr. David Cronk Innovative Computing Lab University of Tennessee Distribution A: Approved for public release; distribution.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Distributed Computing. Distributed Computation Using Files Part 1 Part 2 f1 = open(toPart2, …); while(…){ write(f1. …); } close(f1); … f2 = open(toPart1,

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.

NReduce: A Distributed Virtual Machine for Parallel Graph Reduction Peter Kelly Paul Coddington Andrew Wendelborn Distributed and High Performance Computing.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Distributed shared memory. What we’ve learnt so far  MapReduce/Dryad as a distributed programming model  Data-flow (computation as vertex, data flow.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

Programmability Hiroshi Nakashima Thomas Sterling.

 Process Concept  Process Scheduling  Operations on Processes  Cooperating Processes  Interprocess Communication  Communication in Client-Server.

UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.

Parallel Computing Presented by Justin Reschke

Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.

HPX The C++ Standards Library for Concurrency and Parallelism

CILK: An Efficient Multithreaded Runtime System

Miraj Kheni Authors: Toyotaro Suzumura, Koji Ueno

Chapter 3: Process Concept

CS5102 High Performance Computer Systems Thread-Level Parallelism

The Mach System Sri Ramkrishna.

“Language Mechanism for Synchronization”

CS533 Concepts of Operating Systems

Fabric Interfaces Architecture – v4

Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang

New trends in parallel computing

Performance Evaluation of Adaptive MPI

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Threads and Cooperation

CSE 451: Operating Systems Winter 2006 Module 20 Remote Procedure Call (RPC) Ed Lazowska Allen Center

MPI-Message Passing Interface

MXNet Internals Cyrus M. Vahid, Principal Solutions Architect,

CSE 451: Operating Systems Winter 2007 Module 20 Remote Procedure Call (RPC) Ed Lazowska Allen Center

Operating System Concepts

Distributed Systems CS

CSE 451: Operating Systems Winter 2004 Module 19 Remote Procedure Call (RPC) Ed Lazowska Allen Center

CSE 451: Operating Systems Spring 2012 Module 22 Remote Procedure Call (RPC) Ed Lazowska Allen Center

CSE 451: Operating Systems Autumn 2009 Module 21 Remote Procedure Call (RPC) Ed Lazowska Allen Center

MPJ: A Java-based Parallel Computing System

Presented by: SHILPI AGARWAL

CS510 - Portland State University

CS703 - Advanced Operating Systems

Parallel Programming in C with MPI and OpenMP

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

HPX-5 ParalleX in Action Martin Swany Associate Chair and Professor, Intelligent Systems Engineering Deputy Director, Center for Research in Extreme Scale Technology (CREST) Indiana University

ParalleX Execution Model Core Tenets Fine grained parallelism Hide latency with concurrency Runtime introspection and adaptation Formal components Global address space (shared memory programming) Processes Compute complexes Lightweight control objects Parcels Fully flexible but promotes fine-grained dataflow programs HPX-5 based on ParalleX and is part of the Center for Shock-Wave Processing of Advanced Reactive Materials (C-SWARM) effort in PSAAP-II

Model: Global Address Space Flat byte-addressable global addresses Put/get with local and remote completion Active message targets Array collectives Controls thread distribution and load balance Current implementation Block-based allocation Malloc/free with distribution (local, cyclic, user, etc) Traditional PGAS or directory-based AGAS High performance local allocation (high frequency LCO allocation) Soft core affinity for NUMA

Model: Parcels Active messages with continuations Target data action, global address, immediate data Continuation data action, global address lco_set, lco_delete, memput, free, etc Execute local to target address Unified local and remote execution model send() equiv to thread_create()

Model: User-level threads Cooperative threads Block on dynamic dependencies (lco_get, memput, etc) Continuation passing style Progenitor parcel specifies continuation target, action Thread “continues” value Call/cc “pushes” continuation parcel Isomorphic with parcels

Model: Local Control Objects Abstract synchronization interface Unified local/remote access Threads get, set, wait, reset, compound ops Parcel sends dependent on Built-in classes Futures, reductions, generation counts, semaphores, … User defined classes Initialize, set handler, predicate Colocates data with control and synchronization Implement dataflow with parcel continuations

Control: Parallel Parcels and Threads Serial work thread_continue thread_call/cc happens-before Thread 1 < Thread 2 Parallel work parcel_send unordered Thread 1 <> Thread 4 Higher level hpx_call local parfor hierarchical parfor Thread 2 Thread 1 thread_continue(x) parcel_send(q) parcel_send(r) Thread 4 parcel_send(p)

Control: LCO Synchronization Thread-thread synchronization Traditional monitor style synchronization Dynamic output dependencies Blocked threads as continuations Data-flow execution Pending parcels as continuations Execution ”consumes” output Can be manually regenerated for iterative execution Generic user-defined Any set of continuations Any function and predicate Lazy evaluation of function lco_set lco_get future parcel_send(p) and … lco_set(a) lco_get lco_set(b) f(a, b, …, x); pred(); … … parcel_send(p) lco_set(x)

Data Structures, Distribution Global linked data structures Graphs, trees, DAGs Global cyclic block arrays locality(block address) Global user-defined distributions locality[block address] Active GAS Distributed directory allows blocks to be dynamically remapped from their home localities. Application-specific explicit load balancing Automatic load balancing through GAS tracing and graph partitioning (slow)

Fibonacci fib(n) = fib(n-1) + fib(n-2) HPX_ACTION_DECL(fib); int fib_handler(int n) { if (n < 2) { return HPX_THREAD_CONTINUE(n); } // sequential int l = n - 1; int r = n - 2; hpx_addr_t lhs = hpx_lco_future_new(sizeof(int)); // GAS malloc hpx_addr_t rhs = hpx_lco_future_new(sizeof(int)); // GAS malloc hpx_call(HPX_HERE, fib, lhs, l); // parallel hpx_call(HPX_HERE, fib, rhs, r); // parallel hpx_lco_get(lhs, sizeof(int), &l); // LCO synchronization hpx_lco_get(rhs, sizeof(int), &r); // LCO synchronization hpx_lco_delete_sync(lhs); // GAS free hpx_lco_delete_sync(rhs); // GAS free int fn = l + r; return HPX_THREAD_CONTINUE(fn); // sequential } HPX_ACTION(HPX_DEFAULT, 0, fib, fib_handler, HPX_INT); fib(n) = fib(n-1) + fib(n-2)

Networking / Comms Internal interfaces Photon Isend/Irecv Preferred: put/get with remote completion Legacy: parcel send Photon rDMA put/get with remote completion operations Native PSM (libfabric), IB verbs, uGNI, sockets (libfabric) Parcel emulation through eager buffers Synchronized with fine-grained point-to-point locking Isend/Irecv MPI_THREAD_FUNNELED implementation PWC emulated through Isend/Irecv Portability, legacy upgrade path

Networking / Comms A key idea in the Photon library - Put/Get with Completion Minimal overhead to trigger waiting thread via LCO useful paradigm when combined with an “unexpected active message” capability Essentially attach parcel continuations (either already-running threads or yet-to-be-instantiated parcels) to both local and remote completion operations

Networking / Comms One of the key lessons in HPX-5 is the power of memget, memput with completion primitives (with associated low-level photon_pwc and photon_gwc) provides a very powerful abstraction One-sided operations in AMTs are not themselves that useful The ability to continue threads or spawn parcels provides performance improving functionality

Thank you hpx.crest.iu.edu