HPX-5 ParalleX in Action

Slides:



Advertisements
Similar presentations
Proposal (More) Flexible RMA Synchronization for MPI-3 Hubert Ritzdorf NEC–IT Research Division
Advertisements

MPI Message Passing Interface
Processes Management.
SDN Controller Challenges
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Distributed Systems CS
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
DEPARTMENT OF COMPUTER LOUISIANA STATE UNIVERSITY Models without Borders Thomas Sterling Arnaud & Edwards Professor, Department of Computer Science.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Computer Architecture Introduction to MIMD architectures Ola Flygt Växjö University
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Task Scheduling and Distribution System Saeed Mahameed, Hani Ayoub Electrical Engineering Department, Technion – Israel Institute of Technology
Distributed Computations
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Fundamental Design Issues for Parallel Architecture Todd C. Mowry CS 495 January 22, 2002.
Parallel Computing Multiprocessor Systems on Chip: Adv. Computer Arch. for Embedded Systems By Jason Agron.
Concurrency CS 510: Programming Languages David Walker.
An Introduction to Parallel Computing Dr. David Cronk Innovative Computing Lab University of Tennessee Distribution A: Approved for public release; distribution.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Distributed Computing. Distributed Computation Using Files Part 1 Part 2 f1 = open(toPart2, …); while(…){ write(f1. …); } close(f1); … f2 = open(toPart1,
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
NReduce: A Distributed Virtual Machine for Parallel Graph Reduction Peter Kelly Paul Coddington Andrew Wendelborn Distributed and High Performance Computing.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Distributed shared memory. What we’ve learnt so far  MapReduce/Dryad as a distributed programming model  Data-flow (computation as vertex, data flow.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Programmability Hiroshi Nakashima Thomas Sterling.
 Process Concept  Process Scheduling  Operations on Processes  Cooperating Processes  Interprocess Communication  Communication in Client-Server.
UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.
Parallel Computing Presented by Justin Reschke
Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.
HPX The C++ Standards Library for Concurrency and Parallelism
CILK: An Efficient Multithreaded Runtime System
Miraj Kheni Authors: Toyotaro Suzumura, Koji Ueno
Chapter 3: Process Concept
CS5102 High Performance Computer Systems Thread-Level Parallelism
The Mach System Sri Ramkrishna.
“Language Mechanism for Synchronization”
CS533 Concepts of Operating Systems
Fabric Interfaces Architecture – v4
Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang
New trends in parallel computing
Performance Evaluation of Adaptive MPI
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Threads and Cooperation
CSE 451: Operating Systems Winter 2006 Module 20 Remote Procedure Call (RPC) Ed Lazowska Allen Center
MPI-Message Passing Interface
MXNet Internals Cyrus M. Vahid, Principal Solutions Architect,
CSE 451: Operating Systems Winter 2007 Module 20 Remote Procedure Call (RPC) Ed Lazowska Allen Center
Operating System Concepts
Distributed Systems CS
CSE 451: Operating Systems Winter 2004 Module 19 Remote Procedure Call (RPC) Ed Lazowska Allen Center
CSE 451: Operating Systems Spring 2012 Module 22 Remote Procedure Call (RPC) Ed Lazowska Allen Center
CSE 451: Operating Systems Autumn 2009 Module 21 Remote Procedure Call (RPC) Ed Lazowska Allen Center
MPJ: A Java-based Parallel Computing System
Presented by: SHILPI AGARWAL
CS510 - Portland State University
CS703 - Advanced Operating Systems
Parallel Programming in C with MPI and OpenMP
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

HPX-5 ParalleX in Action Martin Swany Associate Chair and Professor, Intelligent Systems Engineering Deputy Director, Center for Research in Extreme Scale Technology (CREST) Indiana University

ParalleX Execution Model Core Tenets Fine grained parallelism Hide latency with concurrency Runtime introspection and adaptation Formal components Global address space (shared memory programming) Processes Compute complexes Lightweight control objects Parcels Fully flexible but promotes fine-grained dataflow programs HPX-5 based on ParalleX and is part of the Center for Shock-Wave Processing of Advanced Reactive Materials (C-SWARM) effort in PSAAP-II

Model: Global Address Space Flat byte-addressable global addresses Put/get with local and remote completion Active message targets Array collectives Controls thread distribution and load balance Current implementation Block-based allocation Malloc/free with distribution (local, cyclic, user, etc) Traditional PGAS or directory-based AGAS High performance local allocation (high frequency LCO allocation) Soft core affinity for NUMA

Model: Parcels Active messages with continuations Target data action, global address, immediate data Continuation data action, global address lco_set, lco_delete, memput, free, etc Execute local to target address Unified local and remote execution model send() equiv to thread_create()

Model: User-level threads Cooperative threads Block on dynamic dependencies (lco_get, memput, etc) Continuation passing style Progenitor parcel specifies continuation target, action Thread “continues” value Call/cc “pushes” continuation parcel Isomorphic with parcels

Model: Local Control Objects Abstract synchronization interface Unified local/remote access Threads get, set, wait, reset, compound ops Parcel sends dependent on Built-in classes Futures, reductions, generation counts, semaphores, … User defined classes Initialize, set handler, predicate Colocates data with control and synchronization Implement dataflow with parcel continuations

Control: Parallel Parcels and Threads Serial work thread_continue thread_call/cc happens-before Thread 1 < Thread 2 Parallel work parcel_send unordered Thread 1 <> Thread 4 Higher level hpx_call local parfor hierarchical parfor Thread 2 Thread 1 thread_continue(x) parcel_send(q) parcel_send(r) Thread 4 parcel_send(p)

Control: LCO Synchronization Thread-thread synchronization Traditional monitor style synchronization Dynamic output dependencies Blocked threads as continuations Data-flow execution Pending parcels as continuations Execution ”consumes” output Can be manually regenerated for iterative execution Generic user-defined Any set of continuations Any function and predicate Lazy evaluation of function lco_set lco_get future parcel_send(p) and … lco_set(a) lco_get lco_set(b) f(a, b, …, x); pred(); … … parcel_send(p) lco_set(x)

Data Structures, Distribution Global linked data structures Graphs, trees, DAGs Global cyclic block arrays locality(block address) Global user-defined distributions locality[block address] Active GAS Distributed directory allows blocks to be dynamically remapped from their home localities. Application-specific explicit load balancing Automatic load balancing through GAS tracing and graph partitioning (slow)

Fibonacci fib(n) = fib(n-1) + fib(n-2) HPX_ACTION_DECL(fib); int fib_handler(int n) {   if (n < 2) { return HPX_THREAD_CONTINUE(n); } // sequential   int l = n - 1;   int r = n - 2;   hpx_addr_t lhs = hpx_lco_future_new(sizeof(int)); // GAS malloc   hpx_addr_t rhs = hpx_lco_future_new(sizeof(int)); // GAS malloc   hpx_call(HPX_HERE, fib, lhs, l); // parallel   hpx_call(HPX_HERE, fib, rhs, r); // parallel   hpx_lco_get(lhs, sizeof(int), &l); // LCO synchronization   hpx_lco_get(rhs, sizeof(int), &r); // LCO synchronization   hpx_lco_delete_sync(lhs); // GAS free   hpx_lco_delete_sync(rhs); // GAS free   int fn = l + r;   return HPX_THREAD_CONTINUE(fn); // sequential } HPX_ACTION(HPX_DEFAULT, 0, fib, fib_handler, HPX_INT); fib(n) = fib(n-1) + fib(n-2)

Networking / Comms Internal interfaces Photon Isend/Irecv Preferred: put/get with remote completion Legacy: parcel send Photon rDMA put/get with remote completion operations Native PSM (libfabric), IB verbs, uGNI, sockets (libfabric) Parcel emulation through eager buffers Synchronized with fine-grained point-to-point locking Isend/Irecv MPI_THREAD_FUNNELED implementation PWC emulated through Isend/Irecv Portability, legacy upgrade path

Networking / Comms A key idea in the Photon library - Put/Get with Completion Minimal overhead to trigger waiting thread via LCO useful paradigm when combined with an “unexpected active message” capability Essentially attach parcel continuations (either already-running threads or yet-to-be-instantiated parcels) to both local and remote completion operations

Networking / Comms One of the key lessons in HPX-5 is the power of memget, memput with completion primitives (with associated low-level photon_pwc and photon_gwc) provides a very powerful abstraction One-sided operations in AMTs are not themselves that useful The ability to continue threads or spawn parcels provides performance improving functionality

Thank you hpx.crest.iu.edu