Presentation is loading. Please wait.

Presentation is loading. Please wait.

LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8, 2014 https://xstack.exascale-tech.com/wiki/

Similar presentations


Presentation on theme: "LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8, 2014 https://xstack.exascale-tech.com/wiki/"— Presentation transcript:

1 LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8, 2014 https://xstack.exascale-tech.com/wiki/ This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

2 LLNL Summer School 07/08/2014 OCR – Open Community Runtime – Developed collaboratively with partners (mainly Rice University and Reservoir Labs) The term ‘OCR’ is used to refer to – A programming model – A user-level API – A runtime framework – One of several reference runtime implementations In this talk – Presentation of the programming model – Presentation of the API and implementations through demosOCR 2

3 LLNL Summer School 07/08/2014 Design a software stack to meet Exascale goals – Target a strawman architecture – Provide a programming model, API, reference implementation and tools Concerns – Extreme hardware parallelism – Data locality – Fine grained resource management – Resiliency – Power and energy and not just performance – Platform independence Traleika Glacier (TG) X-Stack project goals 3

4 LLNL Summer School 07/08/2014 mainEdt fibIterEdt sumEdt doneEdt Dataflow programming model 4 Runtime maps the constructed data-flow graph to architecture ……….. Shared LLC Interconnect ……….. N N-2 N-1 Fib(N-2)Fib(N-1) Fib(N) EDT Datablock Data shared between EDTs A non-blocking unit of work. Runnable once all dependences are satisfied. Creation link: Source EDT creates destination Dependence: Source EDT satisfies one of destination’s dependences Both creation and dependence link

5 LLNL Summer School 07/08/2014 OCR level of abstraction 5 void ParallelAverage( float* output, const float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range ( 1, n ), avg ); } if(!range.empty()) { start_for& a = *new(task::allocate_root()) start_for(range,body,partitioner); task::spawn_root_and_wait(a); } void generic_scheduler::local_spawn_root_and_wait( task& first, task*& next ) { internal::reference_count n = 0; for( task* t=&first; ; t=t->prefix().next ) { ++n; t->prefix().parent = &dummy; if( &t->prefix().next==&next ) break; } dummy.prefix().ref_count = n+1; if( n>1 ) local_spawn( *first.prefix().next, next ); local_wait_for_all( dummy, &first ); } hides… OCR’s level of abstraction is at the very bottom TBB user-friendly API

6 LLNL Summer School 07/08/2014 Simplest OCR Concepts – EDTs, datablocks – EDTs & DBs and dependences (simple data-flow graph) Example – producer/consumer Clarifying concepts – what EDTs can/can’t do, DBs are/aren’t – Rules of EDTs Examples – simple synchronization – Introduce notion of event (fan-out) and then slots (fan-in) – Representation introduces events (lozenge); examples 2a and 2b More concepts – events, slots – Clarify events and slot notion (take from slide 12) Example – complex synch – Example 3: show that you can shoot yourself in the foot – Example 4: “proper” way to do it (for performance/locality) More concepts – finish EDT – Example Fib: show finish EDT. NO mention to latch events Cheat sheet for OCROutline 6

7 LLNL Summer School 07/08/2014 Event Driven Task (EDT) – Distinct from the notion of a thread/core – Executes when all required data-blocks have been provided to it – Creates other EDTs and provides data-blocks to them High level OCR concepts 7 Data Globally visible namespace of data-blocks – Explicitly created and destroyed – Only available “global” memory – Data-blocks can move EDT 1 EDT 2 Dependence – EDT 1 provides data to EDT 2 – EDT 1 creates EDT 2 – Visible to the runtime Accessible data-blocks Data-blocks for other EDTs Create other EDTs EDT

8 LLNL Summer School 07/08/2014 Dynamic dependence construction Producer and consumer never know about each other Focus on minimum needed for placement and scheduling B: TODO: Example 1: Producer/Consumer 8 Consumer EDT Producer EDT Data ConceptOCR Creation link Event/Data link Both creation & event Consumer EDT Producer EDT Data

9 LLNL Summer School 07/08/2014 EDTs – 0..N in/out pre-slots Slots are initially “unconnected” and “unsatisfied” At creation time, the number of incoming slots must be known – An EDT executes after all pre-slots are “satisfied” Satisfaction of pre-slots can happen in any order – An EDT can access memory: Data-blocks: – passed in through one of its in/out slots (the EDT gets a C pointer) – created by the EDT Stack and ephemeral heap (local) NO global memory – An EDT, during its execution, can at any time: Write to any accessible data-blocks Manipulate the dependence graph for future (not yet runnable) EDTs B: TODO: OCR execution model for EDTs 9

10 LLNL Summer School 07/08/2014 Control dependence is no different than a data dependence TODO: Introduce event node in graph B: TODO: Example 2a: Simple synchronization 10 ConceptOCR Step 1 EDT Step 2-a EDT Step 2-b EDT Step 1 EDT Step 2-a EDT Step 2-b EDT ØØ

11 LLNL Summer School 07/08/2014 B: TODO: Talk about events 11

12 LLNL Summer School 07/08/2014 Control dependence is no different than a data dependence B: TODO: Example 2b: Simple synchronization 12 ConceptOCR Step 1 EDT Step 2-a EDT Step 2-b EDT Step 1 EDT Step 2-a EDT Step 2-b EDT ØØ

13 LLNL Summer School 07/08/2014 B: TODO: Talk about slots 13

14 LLNL Summer School 07/08/2014 Example 3a: Data dependences do not imply ordering 14 ConceptOCR Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Shared Data Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Shared Data ØØ APIs – NEW APIs if any

15 LLNL Summer School 07/08/2014 Example 3b: Single assignment update 15 Concept OCR Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Data Setup EDT Data Parallel_2 EDT Wrapup EDT Data2Data1 Data2 Parallel_1 EDT Data1

16 LLNL Summer School 07/08/2014 mainEdt fibIterEdt sumEdt doneEdt Dataflow programming model 16 Runtime maps the constructed data-flow graph to architecture ……….. Shared LLC Interconnect ……….. N N-2 N-1 Fib(N-2)Fib(N-1) Fib(N) EDT Datablock Data shared between EDTs A non-blocking unit of work. Runnable once all dependences are satisfied. Creation link: Source EDT creates destination Dependence: Source EDT satisfies one of destination’s dependences Both creation and dependence link

17 LLNL Summer School 07/08/2014 Example 4: Fibonacci with a Finish-EDT 17 OCR Result FibIter(n-1) EDT FibIter(n) EDT Output(n) EDT FibIter(n-2) EDT Sum(n) EDT

18 LLNL Summer School 07/08/2014 R: TODO: Explanation/description of finish EDT 18

19 LLNL Summer School 07/08/2014 Runtime EDTs – Created by the runtime to handle more complex synchronization situations – 0..N pre slots Slots are initially “unconnected” and “unsatisfied” – Runtime EDTs have a “trigger” rule that determines when they “satisfy” their outgoing edges and what gets propagated Finish EDT (TODO: update description) – 2 pre slots; “waiting-on” count and current count – When: satisfy outgoing edges when number of satisfies on both pre slots matches (similar to reference count in TBB) – What: NULL (incoming data-blocks are ignored) R: TODO: OCR execution model for runtime EDTs 19

20 LLNL Summer School 07/08/2014 B: TODO: API cheat sheet 20

21 LLNL Summer School 07/08/2014 OCR ecosystem FSim - TG Architecture Low-level compilers Platforms OCR implementations LLVM OCR targeting TG C, Array DSL CnC Hero Code HC CnC Translator HC Compiler R-Stream HTA PIL Programming platforms OCR API + Tuning Annotations Open Community Runtime x86 GCC OCR targeting x86 Cluster Evaluation platforms

22 LLNL Summer School 07/08/2014 TODO: OCR vs other solutions 22 CnCMPIOCROpenM P TBB Execution model TasksBulk SyncFine- grained tasks Bulk SyncTasks Memory model Shared memory Explicit message passing Explicit; global Shared memory Separation of concerns? YesNoYesNoYes (but can dig deeper)

23 LLNL Summer School 07/08/2014 B: TODO: Apps list (what’s available) 23

24 LLNL Summer School 07/08/2014 Case Study: FFT in OCR This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

25 LLNL Summer School 07/08/2014 Final year undergraduate project in Oregon State University OCR implementation of Fast Fourier Transform – Cooley-Tukey algorithm – Evolution from serial version – OCR behaviorBackground 25

26 LLNL Summer School 07/08/2014 Divide-and-conquer Data-flow friendlyAlgorithm 26 Source:Wikimedia Commons

27 LLNL Summer School 07/08/2014 (1) Serial implementation – 1 EDT running the entire program (2) Naive parallelization – division of DFT is carried out by EDTs recursively, combination of outputs is done by 1 EDT for each step of recursion (3) Bounded parallelization – both stages of butterfly are parallelized, but upto a user-specified block size (to minimize scheduling overhead) (4) Bounded parallelization with datablocks – previous implementations operated on a single datablock; this uses 3 datablocks (input, output real & imaginary terms) Scope for better parallelism – Finer datablocks – Staggered creation of EDTs in the combination phaseVersions 27

28 LLNL Summer School 07/08/2014Behavior 28 VersionNo. of EDTsMean EDT Longevity (us) Load variance across cores (%) Running time (s) Serial2167342070.73.36 Naïve parallel125829132535.1877.0 Bounded parallel179319822.70.46 Bounded parallel w/ datablocks 179319462.90.45 OCR X86 running FFT on 2 32 sized dataset – 2.9GHz Xeon 16 cores; 8 cores made available to OCR Balance to be achieved between number and size of EDTs

29 LLNL Summer School 07/08/2014 OCR API is at the “assembly” level; other tools are meant to sit between it and programmers Few simple concepts, multiple ways to use them – Interested in determining “best” use Dependence graph built on the fly: – Complicates the writing of the program – Scalable approach TODO: Take-aways 29

30 LLNL Summer School 07/08/2014 Development of a specification: – Memory model Tuning hints and annotations More expressive support for collectives Areas of investigation 30

31 LLNL Summer School 07/08/2014 31 Backup

32 LLNL Summer School 07/08/2014 Strawman architecture 32 Intel Confidential / Internal Use Only Heterogeneous Hierarchical architecture Tapered memory bandwidth Global, shared address space Software managed non- coherent memories Functional simulator available DP FP FMAC DP FP FMAC Execution Engine (XE) 32KB I$ 64KB SP RF? Application specific GP Int GP Int Control Engine (CE) 32KB I$ 64KB SP RF? System SW XE CE 1MB shared L2 Block (8 XE + CE) Cluster (16 Blocks) ……….. 8MB Shared LLC Interconnect ……….. Processor Chip (16 Clusters)

33 LLNL Summer School 07/08/2014 N pre slots (N known at creation time) Optional attached “completion event” OCR concepts: building blocks 33 Evt 0N EDT 0N ( ) Data No pre slots Post slot always “satisfied” N pre slots (N fixed by type of event NOT determined by user) Post slot initially “unsatisfied” Slot is: – Connected (attached to another slot) or unconnected – Satisfied (user-triggered or runtime-triggered) or unsatisfied Pre slots Post slots (multiple connections)

34 LLNL Summer School 07/08/2014 OCR concepts: add dependence 34 Data Evt 0N OR EDT 0N Evt 0N OR Evt 0N EDT 0N Connected => 1 of 4 possible combinations Argument 1 Argument 2

35 LLNL Summer School 07/08/2014 OCR concepts: satisfy 35 EDT 0N Evt 0N OR Data OR NULL EDT 0N Satisfied/triggered Data => 1 of 4 possible combinations Argument 1 Argument 2

36 LLNL Summer School 07/08/2014 Dynamic dependence construction Producer and consumer never know about each other Focus on minimum needed for placement and scheduling Example 1: Producer/Consumer 36 Consumer EDT Producer EDT Data ConceptOCR Evt Consumer EDT Producer EDT Data (1) dbCreate (*) addDep (3) satisfy (2) edit Data Who executes call Data dependence Control dependence

37 LLNL Summer School 07/08/2014 Control dependence is no different than a data dependence Example 2: Simple synchronization 37 (1) satisfy ConceptOCR Step 1 EDT Step 2-a EDT Step 2-b EDT Evt Step 1 EDT (*) addDep NULL Step 2-a EDT Step 2-b EDT

38 LLNL Summer School 07/08/2014 Example 3: In place parallel update 38 ConceptOCR Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Data Setup EDT Data Parallel_1 EDT Parallel_2 EDT Finish EDT Wrapup EDT (1) dbCreate (1) edtCreate (3) edtCreate (4) addDep (2) addDep (3) edtCreate

39 LLNL Summer School 07/08/2014 Example 4: Single assignment update 39 Concept OCR Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Data Setup EDT Data Parallel_1 EDT Parallel_2 EDT Wrapup EDT (1) dbCreate (1) edtCreate (2) addDep Data2Data1 Evt2 Data2Data1 Evt1 (4) dbCreate (5) satisfy (3) addDep (1) evtCreate

40 LLNL Summer School 07/08/2014 On some code, OCR matches or bests OMP Simple scheduler, no data-blocks (very preliminary but promising) Preliminary results 40


Download ppt "LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8, 2014 https://xstack.exascale-tech.com/wiki/"

Similar presentations


Ads by Google