Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain Cledat) June 12, 2014 https://xstackwiki.modelado.org/Traleika_Glacier/

Slides:



Advertisements
Similar presentations
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
University of Houston So What’s Exascale Again?. University of Houston The Architects Did Their Best… Scale of parallelism Multiple kinds of parallelism.
“This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
Pregel: A System for Large-Scale Graph Processing
ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.
LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8,
An intro to programming. The purpose of writing a program is to solve a problem or take advantage of an opportunity Consists of multiple steps:  Understanding.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
OCR Introspection EDT Characterization & Profiling Infrastructure Intel TG Team.
A SPMD Model for OCR (with collectives) Sanjay Chatterjee 2/9/2015 Intel Confidential1.
LLNL Summer School 07/08/2014 What is OCR? TG Team (presenters: Romain Cledat & Bala Seshasayee) July 8, This.
Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain Cledat) June 12,
LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8,
Unit-1 Introduction Prepared by: Prof. Harish I Rathod
Topic 2d High-Level languages and Systems Software
OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.
CS333 Intro to Operating Systems Jonathan Walpole.
LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8,
CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.
1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.
Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
LLNL Summer School 07/08/2014 What is OCR? Traleika Glacier Team (presenters: Romain Cledat & Bala Seshasayee) July 8,
1 Groove demo (sf.net/projects/groove) Arend Rensink, University of Twente Computer Automated Multi-Paradigm Modelling, April 2012 April 2012Computer Automated.
A SPMD Model for OCR Sanjay Chatterjee 2/9/2015 Intel Confidential1.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
HParC language. Background Shared memory level –Multiple separated shared memory spaces Message passing level-1 –Fast level of k separate message passing.
Parallel Computing Presented by Justin Reschke
Background Computer System Architectures Computer System Software.
Hello world !!! ASCII representation of hello.c.
Tuning Threaded Code with Intel® Parallel Amplifier.
LECTURE 19 Subroutines and Parameter Passing. ABSTRACTION Recall: Abstraction is the process by which we can hide larger or more complex code fragments.
Operating Systems A Biswas, Dept. of Information Technology.
Parallel Programming Models EECC 756 David D. McGann 18 May, 1999.
Computer Engg, IIT(BHU)
NFV Compute Acceleration APIs and Evaluation
Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Chandra S. Martha Min Lee 02/10/2016
CS427 Multicore Architecture and Parallel Computing
For Massively Parallel Computation The Chaotic State of the Art
Welcome: Intel Multicore Research Conference
Parallel Programming By J. H. Wang May 2, 2017.
CS399 New Beginnings Jonathan Walpole.
Computer Engg, IIT(BHU)
Java programming lecture one
Threads and Cooperation
Introduction to cosynthesis Rabi Mahapatra CSCE617
Chapter 4: Threads.
HPC User Forum 2012 Panel on Potential Disruptive Technologies Emerging Parallel Programming Approaches Guang R. Gao Founder ET International.
Dynamic Code Mapping Techniques for Limited Local Memory Systems
Chapter 1 Introduction.
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
Presentation transcript:

Exascale Programming Models Lecture Series 06/12/2014 What is OCR? TG Team (presenter: Romain Cledat) June 12, This research was, in part, funded by the U.S. Government, DOE and DARPA. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

Exascale Programming Models Lecture Series 06/12/2014 OCR – Open Community Runtime – Developed collaboratively with partners (mainly Rice University and Reservoir Labs) The term ‘OCR’ is used to refer to way too many concepts – A programming model – A user-level API – A runtime framework – One of a multitude of reference runtime implementationsOCR 2

Exascale Programming Models Lecture Series 06/12/2014 Design a software stack to meet Exascale goals – Target a strawman architecture – Provide a programming model, API, reference implementation and tools Concerns – Extreme hardware parallelism – Data locality – Fine grained resource management – Resiliency – Power and energy and not just performance – Platform independence TG X-Stack project goals 3

Exascale Programming Models Lecture Series 06/12/2014 mainEdt fibIterEdt sumEdt N finishEdt N-2 N-1 Dataflow programming model 4 EDT Datablock Create Event Runtime maps the constructed data-flow graph to architecture ……….. Shared LLC Interconnect ………..

Exascale Programming Models Lecture Series 06/12/2014 OCR level of abstraction 5 void ParallelAverage( float* output, const float* input, size_t n ) { Average avg; avg.input = input; avg.output = output; parallel_for( blocked_range ( 1, n ), avg ); } if(!range.empty()) { start_for& a = *new(task::allocate_root()) start_for(range,body,partitioner); task::spawn_root_and_wait(a); } void generic_scheduler::local_spawn_root_and_wait( task& first, task*& next ) { internal::reference_count n = 0; for( task* t=&first; ; t=t->prefix().next ) { ++n; t->prefix().parent = &dummy; if( &t->prefix().next==&next ) break; } dummy.prefix().ref_count = n+1; if( n>1 ) local_spawn( *first.prefix().next, next ); local_wait_for_all( dummy, &first ); } hides… OCR’s level of abstraction is at the very bottom TBB user-friendly API

Exascale Programming Models Lecture Series 06/12/2014 Common – All objects globally and uniquely identifiable and relocate-able Computation – Event Driven Task (EDT) – Does not perform synchronization – Distinct from the notion of thread or core Data – Data-block (DB) – Relocate-able consecutive chunk of data Synchronization, links – Events – Runtime-visible Slots – Positional end-points for dependences OCR concepts 6

Exascale Programming Models Lecture Series 06/12/2014 N pre slots (N known at creation time) Optional attached “completion event” OCR concepts: building blocks 7 Evt 0N EDT 0N ( ) Data No pre slots Post slot always “satisfied” N pre slots (N fixed by type of event NOT determined by user) Post slot initially “unsatisfied” Slot is: – Connected (attached to another slot) or unconnected – Satisfied (user-triggered or runtime-triggered) or unsatisfied Pre slots Post slots (multiple connections)

Exascale Programming Models Lecture Series 06/12/2014 OCR concepts: add dependence 8 Data Evt 0N OR EDT 0N Evt 0N OR Evt 0N EDT 0N Connected => 1 of 4 possible combinations Argument 1 Argument 2

Exascale Programming Models Lecture Series 06/12/2014 OCR concepts: satisfy 9 EDT 0N Evt 0N OR Data OR NULL EDT 0N Satisfied/triggered Data => 1 of 4 possible combinations Argument 1 Argument 2

Exascale Programming Models Lecture Series 06/12/2014 EDTs – 0..N in/out pre-slots Slots are initially “unconnected” and “unsatisfied” At creation time, the number of incoming slots must be known – An EDT executes after all pre slots are “satisfied” Satisfaction of pre slots can happen in any order – An EDT can access memory: Data-blocks: – passed in through one of its in/out slots (the EDT gets a C pointer) – created by the EDT Stack and ephemeral heap (local) NO global memory – An EDT, during its execution, can at any time: Write to any accessible data-blocks Manipulate the dependence graph for future (not yet runnable) EDTs by adding dependences, satisfying events, etc. OCR execution model for EDTs 10

Exascale Programming Models Lecture Series 06/12/2014 Dynamic dependence construction Producer and consumer never know about each other Focus on minimum needed for placement and scheduling Example 1: Producer/Consumer 11 Consumer EDT Producer EDT Data ConceptOCR Evt Consumer EDT Producer EDT Data (1) dbCreate (*) addDep (3) satisfy (2) edit Data Who executes call Data dependence Control dependence

Exascale Programming Models Lecture Series 06/12/2014 Control dependence is no different than a data dependence Example 2: Simple synchronization 12 (1) satisfy ConceptOCR Step 1 EDT Step 2-a EDT Step 2-b EDT Evt Step 1 EDT (*) addDep NULL Step 2-a EDT Step 2-b EDT

Exascale Programming Models Lecture Series 06/12/2014 Events – 0..N pre slots Slots are initially “unconnected” and “unsatisfied” – Events have a “trigger” rule that determines when their post slot transitions to “satisfied” and what gets connected to it Simple event (pass-through) – 1 pre slot – When: satisfy post slot on incoming slot satisfaction – What: whatever is on incoming slot (pass GUID) Latch event (multi-party synchronization) – 2 pre slots; “waiting-on” count and current count – When: satisfy outgoing slot when number of satisfies on both pre slots matches (similar to reference count in TBB) – What: NULL (incoming data-blocks are ignored) OCR execution model for events 13

Exascale Programming Models Lecture Series 06/12/2014 Example 3: In place parallel update 14 ConceptOCR Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Data Setup EDT Data Parallel_1 EDT Parallel_2 EDT Finish EDT Wrapup EDT (1) dbCreate (1) edtCreate (3) edtCreate (4) addDep (2) addDep (3) edtCreate

Exascale Programming Models Lecture Series 06/12/2014 Example 4: Single assignment update 15 Concept OCR Setup EDT Parallel_1 EDT Parallel_2 EDT Wrapup EDT Data Setup EDT Data Parallel_1 EDT Parallel_2 EDT Wrapup EDT (1) dbCreate (1) edtCreate (2) addDep Data2Data1 Evt2 Data2Data1 Evt1 (4) dbCreate (5) satisfy (3) addDep (1) evtCreate

Exascale Programming Models Lecture Series 06/12/2014 OCR ecosystem FSim - TG Architecture Low-level compilers Platforms OCR implementations LLVM OCR targeting TG C, Array DSL CnC Hero Code HC CnC Translator HC Compiler R-Stream HTA PIL Programming platforms OCR API + Tuning Annotations Open Community Runtime x86 GCC OCR targeting x86 Cluster Evaluation platforms

Exascale Programming Models Lecture Series 06/12/2014 OCR API is at the “assembly” level; other tools are meant to sit between it and programmers Few simple concepts, multiple ways to use them – Interested in determining “best” use Dependence graph built on the fly: – Complicates the writing of the program – Scalable approachTake-aways 17

Exascale Programming Models Lecture Series 06/12/2014 On some code, OCR matches or bests OMP Simple scheduler, no data-blocks (very preliminary but promising) Preliminary results 18

Exascale Programming Models Lecture Series 06/12/2014 Development of a specification: – Memory model Tuning hints and annotations More expressive support for collectives Areas of investigation 19

Exascale Programming Models Lecture Series 06/12/ Backup

Exascale Programming Models Lecture Series 06/12/2014 Strawman architecture 21 Intel Confidential / Internal Use Only Heterogeneous Hierarchical architecture Tapered memory bandwidth Global, shared address space Software managed non- coherent memories Functional simulator available DP FP FMAC DP FP FMAC Execution Engine (XE) 32KB I$ 64KB SP RF? Application specific GP Int GP Int Control Engine (CE) 32KB I$ 64KB SP RF? System SW XE CE 1MB shared L2 Block (8 XE + CE) Cluster (16 Blocks) ……….. 8MB Shared LLC Interconnect ……….. Processor Chip (16 Clusters)

Exascale Programming Models Lecture Series 06/12/2014 OCR vs other solutions 22 CnCMPIOCROpenM P TBB Execution model TasksBulk SyncFine- grained tasks Bulk SyncTasks Memory model Shared memory Explicit message passing Explicit; global Shared memory Separation of concerns? YesNoYesNoYes (but can dig deeper)