LLNL Summer School 07/08/2014 What is OCR? TG Team (presenters: Romain Cledat & Bala Seshasayee) July 8, 2014 https://xstack.exascale-tech.com/wiki/ This.

Slides:



Advertisements
Similar presentations
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Context-Sensitive Interprocedural Points-to Analysis in the Presence of Function Pointers Presentation by Patrick Kaleem Justin.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Autonomic Systems Justin Moles, Winter 2006 Enabling autonomic behavior in systems software with hot swapping Paper by: J. Appavoo, et al. Presentation.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
“This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
Threads CSCI 444/544 Operating Systems Fall 2008.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Pregel: A System for Large-Scale Graph Processing
Chapter TwelveModern Programming Languages1 Memory Locations For Variables.
ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.
Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain Cledat) June 12,
LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8,
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.
Programmer's view on Computer Architecture by Istvan Haller.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
A SPMD Model for OCR (with collectives) Sanjay Chatterjee 2/9/2015 Intel Confidential1.
Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain Cledat) June 12,
LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8,
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
CS333 Intro to Operating Systems Jonathan Walpole.
LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8,
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.
LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
A SPMD Model for OCR Sanjay Chatterjee 2/9/2015 Intel Confidential1.
Parallelization Geant4 simulation is an embarrassingly parallel computational problem – each event can possibly be treated independently 1.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
HParC language. Background Shared memory level –Multiple separated shared memory spaces Message passing level-1 –Fast level of k separate message passing.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Parallel Computing Presented by Justin Reschke
Tuning Threaded Code with Intel® Parallel Amplifier.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
Operating Systems A Biswas, Dept. of Information Technology.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Threads prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University 1July 2016Processes.
Computer Architecture: Parallel Task Assignment
Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Chandra S. Martha Min Lee 02/10/2016
For Massively Parallel Computation The Chaotic State of the Art
Conception of parallel algorithms
Parallel Programming By J. H. Wang May 2, 2017.
CS399 New Beginnings Jonathan Walpole.
Pattern Parallel Programming
Computer Engg, IIT(BHU)
Threads and Cooperation
Introduction to cosynthesis Rabi Mahapatra CSCE617
Chapter 4: Threads.
HPC User Forum 2012 Panel on Potential Disruptive Technologies Emerging Parallel Programming Approaches Guang R. Gao Founder ET International.
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Chapter 4: Threads & Concurrency
A Map-Reduce System with an Alternate API for Multi-Core Environments
Presentation transcript:

LLNL Summer School 07/08/2014 What is OCR? TG Team (presenters: Romain Cledat & Bala Seshasayee) July 8, This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

LLNL Summer School 07/08/2014 OCR – Open Community Runtime – Developed collaboratively with partners (mainly Rice University and Reservoir Labs) The term ‘OCR’ is used to refer to – A programming model – A user-level API – A runtime framework – One of several reference runtime implementations In this talk – Presentation of the programming model – Presentation of the API and implementations through demosOCR 2

LLNL Summer School 07/08/2014 Design a software stack to meet Exascale goals – Target a strawman architecture – Provide a programming model, API, reference implementation and tools Concerns – Extreme hardware parallelism – Data locality – Fine grained resource management – Resiliency – Power and energy and not just performance – Platform independence TG X-Stack project goals 3

LLNL Summer School 07/08/2014 mainEdt fibIterEdt sumEdt doneEdt Dataflow programming model 4 Runtime maps the constructed data-flow graph to architecture ……….. Shared LLC Interconnect ……….. N N-2 N-1 Fib(N-2)Fib(N-1) Fib(N) EDT Datablock Data shared between EDTs A non-blocking unit of work. Runnable once all pre-slots are satisfied. Creation link: Source EDT creates destination Event/Data link: Source EDT provides data to the destination Both creation and event/data link

LLNL Summer School 07/08/2014 OCR level of abstraction 5 void ParallelAverage( float* output, const float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range ( 1, n ), avg ); } if(!range.empty()) { start_for& a = *new(task::allocate_root()) start_for(range,body,partitioner); task::spawn_root_and_wait(a); } void generic_scheduler::local_spawn_root_and_wait( task& first, task*& next ) { internal::reference_count n = 0; for( task* t=&first; ; t=t->prefix().next ) { ++n; t->prefix().parent = &dummy; if( &t->prefix().next==&next ) break; } dummy.prefix().ref_count = n+1; if( n>1 ) local_spawn( *first.prefix().next, next ); local_wait_for_all( dummy, &first ); } hides… OCR’s level of abstraction is at the very bottom TBB user-friendly API

LLNL Summer School 07/08/2014 Common – All objects globally and uniquely identifiable and relocate-able Computation – Event Driven Task (EDT) – Does not perform synchronization – Distinct from the notion of thread or core Data – Data-block (DB) – Relocate-able consecutive chunk of data Synchronization, links – Events – Runtime-visible Slots – Positional end-points for dependences OCR concepts 6

LLNL Summer School 07/08/2014 Simplest OCR Concepts – EDTs, datablocks Example – producer/consumer Clarifying concepts – what EDTs can/can’t do, DBs are/aren’t Example – simple synchronization More concepts – events, slots Example – complex synch More concepts – latch eventsOutline 7

LLNL Summer School 07/08/2014 Event Driven Task (EDT) N pre-slots (known at creation time) Available states on a slot: – Connected (attached to another slot) or unconnected – Satisfied or unsatisfied OCR concepts: 3 core building blocks 8 0 N Data Pre-slots Data A globally visible name space of data blocks – Explicitly created – EDTs can only access data either created by them or passed through their pre-slots EDT 1 EDT 2 EDT 1 creates EDT 2 EDT 1 provides data on EDT 2 ’s pre-slot – Possibly through an indirection chain EDT

LLNL Summer School 07/08/2014 EDTs – 0..N in/out pre-slots Slots are initially “unconnected” and “unsatisfied” At creation time, the number of incoming slots must be known – An EDT executes after all pre-slots are “satisfied” Satisfaction of pre-slots can happen in any order – An EDT can access memory: Data-blocks: – passed in through one of its in/out slots (the EDT gets a C pointer) – created by the EDT Stack and ephemeral heap (local) NO global memory – An EDT, during its execution, can at any time: Write to any accessible data-blocks Manipulate the dependence graph for future (not yet runnable) EDTs OCR execution model for EDTs 9

LLNL Summer School 07/08/2014 Dynamic dependence construction Producer and consumer never know about each other Focus on minimum needed for placement and scheduling ANIMATE: Example 1: Producer/Consumer 10 Consumer EDT Producer EDT Data ConceptOCR Creation link Event/Data link Both creation & event Consumer EDT Producer EDT Data

LLNL Summer School 07/08/2014 Control dependence is no different than a data dependence ANIMATE: Example 2: Simple synchronization 11 ConceptOCR Step 1 EDT Step 2-a EDT Step 2-b EDT Step 1 EDT Step 2-a EDT Step 2-b EDT ØØ

LLNL Summer School 07/08/2014 Runtime EDTs – Created by the runtime to handle more complex synchronization situations – 0..N pre slots Slots are initially “unconnected” and “unsatisfied” – Runtime EDTs have a “trigger” rule that determines when they “satisfy” their outgoing edges and what gets propagated Latch runtime EDT (multi-party synchronization) – 2 pre slots; “waiting-on” count and current count – When: satisfy outgoing edges when number of satisfies on both pre slots matches (similar to reference count in TBB) – What: NULL (incoming data-blocks are ignored) OCR execution model for runtime EDTs 12

LLNL Summer School 07/08/2014 ANIMATE: Example 3: In place parallel update 13 ConceptOCR Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Data Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Data Sync REDT Data ØØ Ø

LLNL Summer School 07/08/2014 OCR ecosystem FSim - TG Architecture Low-level compilers Platforms OCR implementations LLVM OCR targeting TG C, Array DSL CnC Hero Code HC CnC Translator HC Compiler R-Stream HTA PIL Programming platforms OCR API + Tuning Annotations Open Community Runtime x86 GCC OCR targeting x86 Cluster Evaluation platforms

LLNL Summer School 07/08/2014 OCR API is at the “assembly” level; other tools are meant to sit between it and programmers Few simple concepts, multiple ways to use them – Interested in determining “best” use Dependence graph built on the fly: – Complicates the writing of the program – Scalable approachTake-aways 15

LLNL Summer School 07/08/2014 On some code, OCR matches or bests OMP Simple scheduler, no data-blocks (very preliminary but promising) Preliminary results 16

LLNL Summer School 07/08/2014 Development of a specification: – Memory model Tuning hints and annotations More expressive support for collectives Areas of investigation 17

LLNL Summer School 07/08/2014 Case Study: FFT in OCR This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

LLNL Summer School 07/08/2014 Final year undergraduate project in Oregon State University OCR implementation of Fast Fourier Transform – Cooley-Tukey algorithm – Evolution from serial version – OCR behaviorBackground 19

LLNL Summer School 07/08/2014 Divide-and-conquer Data-flow friendlyAlgorithm 20 Source:Wikimedia Commons

LLNL Summer School 07/08/2014 (1) Serial implementation – 1 EDT running the entire program (2) Naive parallelization – division of DFT is carried out by EDTs recursively, combination of outputs is done by 1 EDT for each step of recursion (3) Bounded parallelization – both stages of butterfly are parallelized, but upto a user-specified block size (to minimize scheduling overhead) (4) Bounded parallelization with datablocks – previous implementations operated on a single datablock; this uses 3 datablocks (input, output real & imaginary terms) Scope for better parallelism – Finer datablocks – Staggered creation of EDTs in the combination phaseVersions 21

LLNL Summer School 07/08/2014Behavior 22 VersionNo. of EDTsMean EDT Longevity (us) Load variance across cores (%) Running time (s) Serial Naïve parallel Bounded parallel Bounded parallel w/ datablocks OCR X86 running FFT on 2 32 sized dataset – 2.9GHz Xeon 16 cores; 8 cores made available to OCR Balance to be achieved between number and size of EDTs

LLNL Summer School 07/08/ Backup

LLNL Summer School 07/08/2014 Strawman architecture 24 Intel Confidential / Internal Use Only Heterogeneous Hierarchical architecture Tapered memory bandwidth Global, shared address space Software managed non- coherent memories Functional simulator available DP FP FMAC DP FP FMAC Execution Engine (XE) 32KB I$ 64KB SP RF? Application specific GP Int GP Int Control Engine (CE) 32KB I$ 64KB SP RF? System SW XE CE 1MB shared L2 Block (8 XE + CE) Cluster (16 Blocks) ……….. 8MB Shared LLC Interconnect ……….. Processor Chip (16 Clusters)

LLNL Summer School 07/08/2014 OCR vs other solutions 25 CnCMPIOCROpenM P TBB Execution model TasksBulk SyncFine- grained tasks Bulk SyncTasks Memory model Shared memory Explicit message passing Explicit; global Shared memory Separation of concerns? YesNoYesNoYes (but can dig deeper)

LLNL Summer School 07/08/2014 N pre slots (N known at creation time) Optional attached “completion event” OCR concepts: building blocks 26 Evt 0N EDT 0N ( ) Data No pre slots Post slot always “satisfied” N pre slots (N fixed by type of event NOT determined by user) Post slot initially “unsatisfied” Slot is: – Connected (attached to another slot) or unconnected – Satisfied (user-triggered or runtime-triggered) or unsatisfied Pre slots Post slots (multiple connections)

LLNL Summer School 07/08/2014 OCR concepts: add dependence 27 Data Evt 0N OR EDT 0N Evt 0N OR Evt 0N EDT 0N Connected => 1 of 4 possible combinations Argument 1 Argument 2

LLNL Summer School 07/08/2014 OCR concepts: satisfy 28 EDT 0N Evt 0N OR Data OR NULL EDT 0N Satisfied/triggered Data => 1 of 4 possible combinations Argument 1 Argument 2

LLNL Summer School 07/08/2014 Dynamic dependence construction Producer and consumer never know about each other Focus on minimum needed for placement and scheduling Example 1: Producer/Consumer 29 Consumer EDT Producer EDT Data ConceptOCR Evt Consumer EDT Producer EDT Data (1) dbCreate (*) addDep (3) satisfy (2) edit Data Who executes call Data dependence Control dependence

LLNL Summer School 07/08/2014 Control dependence is no different than a data dependence Example 2: Simple synchronization 30 (1) satisfy ConceptOCR Step 1 EDT Step 2-a EDT Step 2-b EDT Evt Step 1 EDT (*) addDep NULL Step 2-a EDT Step 2-b EDT

LLNL Summer School 07/08/2014 Example 3: In place parallel update 31 ConceptOCR Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Data Setup EDT Data Parallel_1 EDT Parallel_2 EDT Finish EDT Wrapup EDT (1) dbCreate (1) edtCreate (3) edtCreate (4) addDep (2) addDep (3) edtCreate

LLNL Summer School 07/08/2014 Example 4: Single assignment update 32 Concept OCR Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Data Setup EDT Data Parallel_1 EDT Parallel_2 EDT Wrapup EDT (1) dbCreate (1) edtCreate (2) addDep Data2Data1 Evt2 Data2Data1 Evt1 (4) dbCreate (5) satisfy (3) addDep (1) evtCreate