Chandra S. Martha Min Lee 02/10/2016

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Refining High Performance FORTRAN Code from Programming Model Dependencies Ferosh Jacob University of Alabama Department of Computer Science
MPI Message Passing Interface
This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Parallel Programming Models and Paradigms
Mapping Techniques for Load Balancing
OCR User Hints API Rob, Sanjay, Zoran. Motivation for OCR user hints API Create a facility for the OCR application developer to provide application specific.
CS 221 – May 13 Review chapter 1 Lab – Show me your C programs – Black spaghetti – connect remaining machines – Be able to ping, ssh, and transfer files.
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain Cledat) June 12,
NReduce: A Distributed Virtual Machine for Parallel Graph Reduction Peter Kelly Paul Coddington Andrew Wendelborn Distributed and High Performance Computing.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Martin Berzins (Steve Parker) What are the hard apps problems? How do the solutions get shared? What non-apps work is needed? Thanks to DOE for funding.
Crossing The Line: Distributed Computing Across Network and Filesystem Boundaries.
CSE 486/586 CSE 486/586 Distributed Systems Graph Processing Steve Ko Computer Sciences and Engineering University at Buffalo.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
The FI-WARE Project – Base Platform for Future Service Infrastructures FI-WARE Interface to the network and Devices Chapter.
A SPMD Model for OCR (with collectives) Sanjay Chatterjee 2/9/2015 Intel Confidential1.
General Purpose and Domain Specific Languages (GPLs and DSLs) Katherine Yelick, LBNL (moderator) Andrew Lumsdaine, Indiana University Armando Solar-Lezama,
Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain Cledat) June 12,
Headline in Arial Bold 30pt HPC User Forum, April 2008 John Hesterberg HPC OS Directions and Requirements.
Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.
David Adams ATLAS Virtual Data in ATLAS David Adams BNL May 5, 2002 US ATLAS core/grid software meeting.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Repeated pattern hints Original plan: attach all EDT hints to the EDT template; there is no field in the ocrEdtCreate call for hints Observation: 1 repeated.
A SPMD Model for OCR Sanjay Chatterjee 2/9/2015 Intel Confidential1.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.
Parallel and Distributed Programming: A Brief Introduction Kenjiro Taura.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
The Post Windows Operating System
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
Auburn University
TensorFlow– A system for large-scale machine learning
Parallel Patterns.
Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.
Machine Learning Library for Apache Ignite
For Massively Parallel Computation The Chaotic State of the Art
DARMA Janine C. Bennett, Jonathan Lifflander, David S. Hollman, Jeremiah Wilke, Hemanth Kolla, Aram Markosyan, Nicole Slattengren, Robert L. Clay (PM)
SHARED MEMORY PROGRAMMING WITH OpenMP
Ptolemy II - Heterogeneous Concurrent Modeling and Design in Java
Lesson 5-2 AP Computer Science Principles
OCR on Knights Landing (Xeon-Phi)
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Objects: Virtualization & In-Process Components
Performance Evaluation of Adaptive MPI
Comparison of the Three CPU Schedulers in Xen
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
IEEE BigData 2016 December 5-8, Washington D.C.
Order Management For Shippers.
EECS 582 Midterm Review Mosharaf Chowdhury EECS 582 – F16.
Department of Computer Science University of California, Santa Barbara
Toward a Unified HPC and Big Data Runtime
MPI-Message Passing Interface
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Ptolemy II - Heterogeneous Concurrent Modeling and Design in Java
Distributed Systems CS
Introduction to parallelism and the Message Passing Interface
Building and running HPC apps in Windows Azure
Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J
The Performance of Big Data Workloads in Cloud Datacenters
Department of Computer Science University of California, Santa Barbara
An Orchestration Language for Parallel Objects
Mattan Erez The University of Texas at Austin
Presentation transcript:

Chandra S. Martha Min Lee 02/10/2016 “MPI+X” support in OCR Chandra S. Martha Min Lee 02/10/2016 Acknowledgment: This material is based upon work supported by the Department of Energy Office of Science under cooperative agreement DE-SC0008717 and Lawrence Livermore National Labs subcontract B608115.

OCR Today – Still Evolving OCR: The good Event-driven Task-based runtime  Resiliency, Introspection, Dynamic Load balancing Platform agnostic: Compute resources are virtualized. Can run on top of homogeneous / heterogeneous set of compute resources The bad: Still in prototype phase Today’s OCR API is very limited. No rich API features. Not a high-level language. As a result, programmer needs to worry about managing a lot of “OCR” objects: Events, Datablocks and Tasks. This leads to a lot of “Races” and garbage collection gets tricky. Machine model is not abstracted clearly yet. No easy way to create millions of tasks and impromptu collectives Not easy to understand native OCR apps (resembles spaghetti code as the task DAG can be expressed in very convoluted ways) Composability: Hard; Need to understand the DAG fairly well to change the app

Effective migration via OCR to Exascale For better community adoption: Let’s have OCR support legacy, intermediate and revolutionary style of refactoring effort Legacy: MPI-lite on top of OCR Subset of MPI supported. Long-lived EDTs. Intermediate: Enable “MPI+X” type of style directly on top of native OCR Should work well for statistically load-balanced, domain-decomposition type of workloads Revolutionary: Native OCR apps exposing full level of parallelism to OCR Should work well for dynamic, load-imbalance workloads (Graph500, etc.) The focus of this talk is about extending OCR API to support “MPI+X” style natively.

“MPI+X” via OCR to enable Migration to Exascale DOE is enamored with “MPI+X” line of thinking – especially for domain-decomposition based applications. So, let’s take the best features of MPI: Machine model: A cluster of communicating “end-points” Asynchronous communications (i.e., No bulk-synchronous communications) And best features of “X”  Tasking, Data-parallel tasks (Note: Ideally, we would like to have “X” supported beyond a coherency domain; “X” scope can be a set of homogenous or heterogeneous set of compute resources)

“MPI+X” via OCR to enable Migration to Exascale Let’s borrow some features of MPI into OCR View MPI NOT as a runtime but a “standard” to enable message passing Ordered set of MPI processes in a group  Ordered set of SPMD EDTs in a group Asynchronous point-to-point and collectives: Enable communication among the SPMD EDTs through “implicit” dependencies (ranks, tags, communication world, etc.) Provide default communicator contexts: COMM_WORLD_VIRTUAL_NODES A virtual node = User defined to suit the application(?), OCR might dynamically “grow”/”shrink” the node size Virtual node = a coherency domain, a bunch of coherency domains, a TG chip, a Xeon node with a Co-processor Borrow some features of X (OpenMP/OpenCL/Kokkos, etc.) for “OCR-lite”: Tasks with explicit dependencies (oversubscription) Virtualized tasks (can be mapped to a uniform/hetero node – keep all the resources busy) Data-sharing constructs, loop schedulers, etc. Preserve affinity relationship: All tasks, datablocks, events (OCR objects) should inherit the affinity properties from the parent SPMD EDT.

“MPI+X” via OCR to enable Migration to Exascale Let’s see what we might need in OCR to support “MPI+X” style Augment “OCR” API to support “MPI+X” style of programming “MPI + X” now becomes “OCR + OCR-lite” “OCR + OCR-lite”: Machine model: OCR across “virtual” node + “OCR-lite” within the “virtual” node Virtual Node Virtual Node Virtual Node Virtual Node Virtual Node Virtual Node Virtual Node Virtual Node

“MPI+X” via OCR to enable Migration to Exascale The application DAG now becomes OCR-lite: A local DAG with most edges staying within the virtual node OCR: Some edges crossing the virtual node boundaries to enable communication Note that most of DAG edges are within a virtual node. Virtual Node Virtual Node Virtual Node Virtual Node Virtual Node Virtual Node Virtual Node Virtual Node

SPMD EDTs SPMD EDTs are created through Hints: ocrTemplateCreate( &SPMD_Template, FNC_SPMD, paramc, depc ); //Set up a strong “hint” for the runt-time about SPMD EDT ocrHint_t SPMD_hint; ocrHintInit( &SPMD_hint, OCR_HINT_EDT_T ); nRanks = 1xe6; ocrSetHintValue( &SPMD_hint, OCR_HINT_EDT_NRANKS, nRanks ); ocrSetHintValue( &SPMD_hint, OCR_HINT_EDT_COMM, OCR_COMM_WORLD_PDS /*Or a derived communicator*/ ); //E.g., Communication world of OCR policy domains ocrEdtCreate( &SPMD_EDT_labelled_guid, SPMD_Template, EDT_PARAM_DEF, paramv, EDT_DEP_DEF, depv, EDT_PROP_SPMD, &SPMD_hint, &EVT_out_labelled_guid ); SPMD SPMD SPMD SPMD

SPMD EDT: Usage scenario – (1/3 slides) ocrGuid_t FNC_SPMD( u32 paramc, u64* paramv, u32 depc, ocrEdtDep_t depv[] ) { //Get the SPMD hint back; and Use it to figure out which communicator context this EDT belongs to; ocrCommSize( OCR_COMM_WORLD_PDs, &nRanks); ocrCommRank( OCR_COMM_WORLD_PDs, &myEdtRank ); if( myEdtRank %2 == 0 ) ocrTemplateCreate( &TML_ocrSend, FNC_ocrSend, paramc, depc ); //OCR defines “FNC_ocrSend” as part of the API; Hence, paramv, depv must follow the spec. See below. paramv[] = {count, datatype, destinationEDTRank, tag, OCR_COMM_WORLD_PDs}; depv[] = {DBK_sendbuf, EVT_trigger}; ocrEdtCreate( &EDT_ocrSend, TML_ocrSend, EDT_PARAM_DEF, &paramv, EDT_DEP_DEF, &depv, EDT_PROP_COMM, NULL_HINT, &EVT_OUT_ocrSend ); } //Continued in next slide

SPMD EDT: Usage scenario (cont.) – (2/3 slides) ocrGuid_t FNC_SPMD( u32 paramc, u64* paramv, u32 depc, ocrEdtDep_t depv[] ) //Continued from previous slide if( myEdtRank %2 == 1 ) { ocrTemplateCreate( TML_ocrRecv, FNC_ocrRecv, paramc, depc ); //OCR defines “FNC_ocrRecv” as part of the API; Hence, paramv, depv must follow the spec. See below. paramv[] = {count, datatype, sourceEDTRank, tag, OCR_COMM_WORLD_PDs}; depv[] = {DBK_Recvbuf, EVT_trigger}; ocrEdtCreate( &EDT_ocrRecv, TML_ocrRecv, EDT_PARAM_DEF, &paramv, EDT_DEP_DEF, &depv, EDT_PROP_COMM, NULL_HINT, &EVT_OUT_ocrRecv ); } //Continued in next slide

SPMD EDT: Usage scenario (cont.) – (3/3 slides) ocrGuid_t FNC_SPMD( u32 paramc, u64* paramv, u32 depc, ocrEdtDep_t depv[] ) //Continued from previous slide ocrTemplateCreate( TML_ocrBarrier, FNC_ocrBarrier, paramc, depc ); //OCR defines “FNC_ocrBarrier” as part of the API; Hence, paramv, depv must follow the spec. See below. paramv[] = {OCR_COMM_WORLD_PDs}; depv[] = {EVT_trigger}; ocrEdtCreate( &EDT_ocrBarrier, TML_ocrBarrier, EDT_PARAM_DEF, &paramv, EDT_DEP_DEF, &depv, EDT_PROP_COMM, NULL_HINT, &EVT_OUT_ocrBarrier ); if( myEdtRank == 0 ) { //Create a wrap-task that depends on the event: EVT_OUT_ocrBarrier //which calls ocrShutDown(); }