LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8, 2014 https://xstack.exascale-tech.com/wiki/

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Lecture 6: Multicore Systems
Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
University of Houston So What’s Exascale Again?. University of Houston The Architects Did Their Best… Scale of parallelism Multiple kinds of parallelism.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
“This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Improving Robustness in Distributed Systems Jeremy Russell Software Engineering Honours Project.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
The Procedure Abstraction Part I: Basics Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412.
3.5 Interprocess Communication
Ritu Varma Roshanak Roshandel Manu Prasanna
Describing Syntax and Semantics
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.
The Procedure Abstraction Part I: Basics Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412.
Secure Web Applications via Automatic Partitioning Stephen Chong, Jed Liu, Andrew C. Meyers, Xin Qi, K. Vikram, Lantian Zheng, Xin Zheng. Cornell University.
Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain Cledat) June 12,
LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8,
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Compiler Construction
OCR Introspection EDT Characterization & Profiling Infrastructure Intel TG Team.
Advanced / Other Programming Models Sathish Vadhiyar.
A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee *, Hyung Gyu Lee *, Soonhoi Ha.
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
LLNL Summer School 07/08/2014 What is OCR? TG Team (presenters: Romain Cledat & Bala Seshasayee) July 8, This.
Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain Cledat) June 12,
LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8,
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
CS333 Intro to Operating Systems Jonathan Walpole.
LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8,
Overview of Previous Lesson(s) Over View  A program must be translated into a form in which it can be executed by a computer.  The software systems.
Generators 1 Object Oriented Generators in Java Michael Chu & Nicholas Weaver.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Parallel Computing Presented by Justin Reschke
OCR hints All of Mark’s suggestions are on the mark – (no pun intended) Scheduling hints – Temporal affinity – Device affinity – Priority – Concurrency.
POLITECNICO DI MILANO A SystemC-based methodology for the simulation of dynamically reconfigurable embedded systems Dynamic Reconfigurability in Embedded.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
NFV Compute Acceleration APIs and Evaluation
Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Chandra S. Martha Min Lee 02/10/2016
For Massively Parallel Computation The Chaotic State of the Art
CS 326 Programming Languages, Concepts and Implementation
Complexity Time: 2 Hours.
CS399 New Beginnings Jonathan Walpole.
Computer Engg, IIT(BHU)
Introduction to cosynthesis Rabi Mahapatra CSCE617
Chapter 4: Threads.
HPC User Forum 2012 Panel on Potential Disruptive Technologies Emerging Parallel Programming Approaches Guang R. Gao Founder ET International.
rePLay: A Hardware Framework for Dynamic Optimization
Presentation transcript:

LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8, This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

LLNL Summer School 07/08/2014 OCR – Open Community Runtime – Developed collaboratively with partners (mainly Rice University and Reservoir Labs) The term ‘OCR’ is used to refer to – A programming model – A user-level API – A runtime framework – One of several reference runtime implementations In this talk – Presentation of the programming model – Presentation of the API and implementations through demosOCR 2

LLNL Summer School 07/08/2014 Design a software stack to meet Exascale goals – Target a strawman architecture – Provide a programming model, API, reference implementation and tools Concerns – Extreme hardware parallelism – Data locality – Fine grained resource management – Resiliency – Power and energy and not just performance – Platform independence Traleika Glacier (TG) X-Stack project goals 3

LLNL Summer School 07/08/2014 mainEdt fibIterEdt sumEdt doneEdt Dataflow programming model 4 Runtime maps the constructed data-flow graph to architecture ……….. Shared LLC Interconnect ……….. N N-2 N-1 Fib(N-2)Fib(N-1) Fib(N) EDT Datablock Data shared between EDTs A non-blocking unit of work. Runnable once all dependences are satisfied. Creation link: Source EDT creates destination Dependence: Source EDT satisfies one of destination’s dependences Both creation and dependence link

LLNL Summer School 07/08/2014 OCR level of abstraction 5 void ParallelAverage( float* output, const float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range ( 1, n ), avg ); } if(!range.empty()) { start_for& a = *new(task::allocate_root()) start_for(range,body,partitioner); task::spawn_root_and_wait(a); } void generic_scheduler::local_spawn_root_and_wait( task& first, task*& next ) { internal::reference_count n = 0; for( task* t=&first; ; t=t->prefix().next ) { ++n; t->prefix().parent = &dummy; if( &t->prefix().next==&next ) break; } dummy.prefix().ref_count = n+1; if( n>1 ) local_spawn( *first.prefix().next, next ); local_wait_for_all( dummy, &first ); } hides… OCR’s level of abstraction is at the very bottom TBB user-friendly API

LLNL Summer School 07/08/2014 Event Driven Task (EDT) – Distinct from the notion of a thread/core – Executes when all required data-blocks have been provided to it – Creates other EDTs and provides data-blocks to them High level OCR concepts 6 Data Globally visible namespace of data-blocks – Explicitly created and destroyed – Only available “global” memory – Data-blocks can move EDT 1 EDT 2 Dependence – EDT 1 provides data to EDT 2 – EDT 1 creates EDT 2 – Visible to the runtime Accessible data-blocks Data-blocks for other EDTs Create other EDTs EDT

LLNL Summer School 07/08/2014 Dynamic dependence construction Producer and consumer need not know about each other Focus on minimum needed for placement and scheduling Example 1: Producer/Consumer 7 Consumer EDT Producer EDT Data ConceptOCR Consumer EDT Producer EDT Data

LLNL Summer School 07/08/2014 EDTs – An EDT executes after all its dependences are satisfied – The number of dependences must be known at creation time – Dependence satisfaction can occur in any order – An EDT can, during its execution: Create other EDTs and Datablocks (DBs) Manipulate the dependence graph for future (not ready) EDTs Access stack and ephemeral local heap, but NO global Access datablocks passed in as a dependence or created by the EDT DBs – Contiguous block of global memory visible to any EDT (via acquire/release semantics) – OCR enforces no restrictions on its access OCR execution model 8

LLNL Summer School 07/08/2014 Steps 1 & 2a, 2b need not know about each other – they may have all been created by another EDT Example 2a: Simple synchronization 9 Concept OCR Step 1 EDT Step 2-a EDT Step 2-b EDT Step 1 EDT Step 2-a EDT Step 2-b EDT Evt1

LLNL Summer School 07/08/2014 Events are used to satisfy (one or more) of an EDT’s dependents Events are also used to change the flow graph dynamically Events can be used for – Data dependence: by passing datablocks during satisfaction – Pure control dependence (as in the example)Events 10

LLNL Summer School 07/08/2014 Slots are used to track dependences of an EDT Example 2b: Multiple dependences 11 Concept OCR Step 2 EDT Step 1b EDT Step 1a EDT Step 2 EDT Step 1b EDT Step 1a EDT E1 E2

LLNL Summer School 07/08/2014 Slots represent the dependences of an EDT, with 1 slot per dependence Slots are initially unsatisfied; when an event along a dependence fires, the slot is satisfied An EDT can be run once all its slots are satisfied, with the order of satisfaction unimportant Slots also “carry” the arguments of an EDT, making the ordering of slots important Slots of an EDT 12 BFIXM E

LLNL Summer School 07/08/2014 Example 3a: Data dependences do not imply ordering 13 Concept OCR Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Shared Data Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Shared Data E1 E2 Parallel_1 & Parallel_2 access the same datablock -> OCR doesn’t enforce ordering of accesses RFIXM E

LLNL Summer School 07/08/2014 Example 3b: Single assignment update 14 Concept OCR Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Data Setup EDT Data Parallel_2 EDT Wrapup EDT Data2Data1 Data2 Parallel_1 EDT Data1

LLNL Summer School 07/08/2014 Example 4: FFT with a Finish-EDT 15 OCR Result FibIter(n-1) EDT FibIter(n) EDT Output(n) EDT FibIter(n-2) EDT Sum(n) EDT BFIXM E

LLNL Summer School 07/08/2014 R: TODO: Explanation/description of finish EDT 16 RFIXM E

LLNL Summer School 07/08/2014 Runtime EDTs – Created by the runtime to handle more complex synchronization situations – 0..N pre slots Slots are initially “unconnected” and “unsatisfied” – Runtime EDTs have a “trigger” rule that determines when they “satisfy” their outgoing edges and what gets propagated Finish EDT (TODO: update description) – 2 pre slots; “waiting-on” count and current count – When: satisfy outgoing edges when number of satisfies on both pre slots matches (similar to reference count in TBB) – What: NULL (incoming data-blocks are ignored) R: TODO: OCR execution model for runtime EDTs 17 RFIXM E

LLNL Summer School 07/08/2014 EDT – Templates: ocrEdtTemplateCreate(), ocrEdtTemplateDestroy() – Tasks: ocrEdtCreate(), ocrEdtDestroy() DBs – Datablock management: ocrDbCreate(), ocrDbDestroy() – Datablock usage: ocrDbAcquire(), ocrDbRelease() Events – Event management: ocrEventCreate(), ocrEventDestroy() – Event satisfaction: ocrEventSatisfy(), ocrEventSatisfySlot() – Dependence definition: ocrAddDependence() Misc – Entry point of OCR: mainEdt() – Shutdown: ocrShutdown() API cheat sheet 18

LLNL Summer School 07/08/2014 OCR ecosystem FSim - TG Architecture Low-level compilers Platforms OCR implementations LLVM OCR targeting TG C, Array DSL CnC Hero Code HC CnC Translator HC Compiler R-Stream HTA PIL Programming platforms OCR API + Tuning Annotations Open Community Runtime x86 GCC OCR targeting x86 Cluster Evaluation platforms

LLNL Summer School 07/08/2014 Handwritten – LULESH – Fast Fourier Transform, Stream, HPCG – Synthetic Aperture Radar (SAR) – Cholesky factorization – CoMD High-level tools generated – LULESH (from Concurrent Collections) – NAS Parallel Benchmarks (from Hierarchical Tile Array) FFT, Conjugate Gradient, Integer Sort, Embarrasingly Parallel, etc. – Jacobi, Successive Over-relaxation, etc. (RStream) List of applications available on OCR 20

LLNL Summer School 07/08/2014 Case Study: FFT in OCR This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

LLNL Summer School 07/08/2014 Final year undergraduate project in Oregon State University OCR implementation of Fast Fourier Transform – Cooley-Tukey algorithm – Evolution from serial version – OCR behaviorBackground 22

LLNL Summer School 07/08/2014 Divide-and-conquer Data-flow friendlyAlgorithm 23 Source:Wikimedia Commons

LLNL Summer School 07/08/2014 Serial implementation 24 Source:Wikimedia Commons

LLNL Summer School 07/08/2014 Naïve implementation 25 Source:Wikimedia Commons

LLNL Summer School 07/08/2014 Bounded implementation 26 Source:Wikimedia Commons

LLNL Summer School 07/08/2014 Bounded implementation with datablock 27 Source:Wikimedia Commons

LLNL Summer School 07/08/2014Behavior 28 VersionNo. of EDTsMean EDT Longevity (us) Load variance across cores (%) Running time (s) Serial Naïve parallel Bounded parallel Bounded parallel w/ datablocks OCR X86 running FFT on 2 32 sized dataset – 2.9GHz Xeon 16 cores; 8 cores made available to OCR Balance to be achieved between number and size of EDTs

LLNL Summer School 07/08/2014 Serial implementation Naïve parallelization – recursive division of DFT Bounded parallelization – division bounded by a working set size Bounded parallelization with datablocks – additionally, use 3 datablocks (input, real, imaginary portions) Possible next steps for better parallelism – Finer datablocks – Staggered creation of EDTs in the combination phaseSummary 29

LLNL Summer School 07/08/2014 OCR API is at the “assembly” level; other tools are meant to sit between it and programmers Few simple concepts, multiple ways to use them – Interested in determining “best” use Dependence graph built on the fly: – Complicates the writing of the program – Scalable approach TODO: Take-aways 30 FIXME

LLNL Summer School 07/08/2014 Development of a specification: – Memory model Tuning hints and annotations More expressive support for collectives Areas of investigation 31 FIXME

LLNL Summer School 07/08/ Backup

LLNL Summer School 07/08/2014 Strawman architecture 33 Intel Confidential / Internal Use Only Heterogeneous Hierarchical architecture Tapered memory bandwidth Global, shared address space Software managed non- coherent memories Functional simulator available DP FP FMAC DP FP FMAC Execution Engine (XE) 32KB I$ 64KB SP RF? Application specific GP Int GP Int Control Engine (CE) 32KB I$ 64KB SP RF? System SW XE CE 1MB shared L2 Block (8 XE + CE) Cluster (16 Blocks) ……….. 8MB Shared LLC Interconnect ……….. Processor Chip (16 Clusters)

LLNL Summer School 07/08/2014 N pre slots (N known at creation time) Optional attached “completion event” OCR concepts: building blocks 34 Evt 0N EDT 0N ( ) Data No pre slots Post slot always “satisfied” N pre slots (N fixed by type of event NOT determined by user) Post slot initially “unsatisfied” Slot is: – Connected (attached to another slot) or unconnected – Satisfied (user-triggered or runtime-triggered) or unsatisfied Pre slots Post slots (multiple connections)

LLNL Summer School 07/08/2014 OCR concepts: add dependence 35 Data Evt 0N OR EDT 0N Evt 0N OR Evt 0N EDT 0N Connected => 1 of 4 possible combinations Argument 1 Argument 2

LLNL Summer School 07/08/2014 OCR concepts: satisfy 36 EDT 0N Evt 0N OR Data OR NULL EDT 0N Satisfied/triggered Data => 1 of 4 possible combinations Argument 1 Argument 2

LLNL Summer School 07/08/2014 Dynamic dependence construction Producer and consumer never know about each other Focus on minimum needed for placement and scheduling Example 1: Producer/Consumer 37 Consumer EDT Producer EDT Data ConceptOCR Evt Consumer EDT Producer EDT Data (1) dbCreate (*) addDep (3) satisfy (2) edit Data Who executes call Data dependence Control dependence

LLNL Summer School 07/08/2014 Control dependence is no different than a data dependence Example 2: Simple synchronization 38 (1) satisfy ConceptOCR Step 1 EDT Step 2-a EDT Step 2-b EDT Evt Step 1 EDT (*) addDep NULL Step 2-a EDT Step 2-b EDT

LLNL Summer School 07/08/2014 Example 3: In place parallel update 39 ConceptOCR Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Data Setup EDT Data Parallel_1 EDT Parallel_2 EDT Finish EDT Wrapup EDT (1) dbCreate (1) edtCreate (3) edtCreate (4) addDep (2) addDep (3) edtCreate

LLNL Summer School 07/08/2014 Example 4: Single assignment update 40 Concept OCR Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Data Setup EDT Data Parallel_1 EDT Parallel_2 EDT Wrapup EDT (1) dbCreate (1) edtCreate (2) addDep Data2Data1 Evt2 Data2Data1 Evt1 (4) dbCreate (5) satisfy (3) addDep (1) evtCreate

LLNL Summer School 07/08/2014 On some code, OCR matches or bests OMP Simple scheduler, no data-blocks (very preliminary but promising) Preliminary results 41

LLNL Summer School 07/08/2014 OCR vs other solutions 42 CnCMPIOCROpenMPTBB Execution modelTasksBulk SyncFine-grained tasks Bulk SyncTasks Memory modelShared memory Explicit message passing Explicit; globalShared memory Separation of concerns? YesNoYesNoYes (but can dig deeper) Synchronization ?APIExplicitImplicit & Explicit mechanisms ?