Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD 2013 1.

Slides:

Advertisements

Similar presentations

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Lecture 6: Multicore Systems

Precision Timed Embedded Systems Using TickPAD Memory Matthew M Y Kuo* Partha S Roop* Sidharta Andalam † Nitish Patel* *University of Auckland, New Zealand.

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Testing Concurrent/Distributed Systems Review of Final CEN 5076 Class 14 – 12/05.

ESP: A Language for Programmable Devices Sanjeev Kumar, Yitzhak Mandelbaum, Xiang Yu, Kai Li Princeton University.

Event Driven Real-Time Programming CHESS Review University of California, Berkeley, USA May 10, 2004 Arkadeb Ghosal Joint work with Marco A. Sanvido, Christoph.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Vertically Integrated Analysis and Transformation for Embedded Software John Regehr University of Utah.

Scheduling for Embedded Real-Time Systems Amit Mahajan and Haibo.

Institut für Datentechnik und Kommunikationetze Analysis of Shared Coprocessor Accesses in MPSoCs Overview Bologna, Simon Schliecker Matthias.

1 Concurrent and Distributed Systems Introduction 8 lectures on concurrency control in centralised systems - interaction of components in main memory -

3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.

Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.

3.5 Interprocess Communication

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.

November 18, 2004 Embedded System Design Flow Arkadeb Ghosal Alessandro Pinto Daniele Gasperini Alberto Sangiovanni-Vincentelli

CprE 458/558: Real-Time Systems

Computer Science 12 Design Automation for Embedded Systems ECRTS 2011 Bus-Aware Multicore WCET Analysis through TDMA Offset Bounds Timon Kelter, Heiko.

RCDC SLIDES README Font Issues – To ensure that the RCDC logo appears correctly on all computers, it is represented with images in this presentation. This.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Parallel Programming and Timing Analysis on Embedded Multicores Eugene Yip The University of Auckland Supervisors:Advisor: Dr. Partha RoopDr. Alain Girault.

Programming Safety-Critical Embedded Systems Work mainly by Sidharta Andalam and Eugene Yip Main supervisor:Advisor: Dr. Partha RoopDr. Alain Girault (UoA)(INRIA)

Computer System Architectures Computer System Software

Parallel Programming in Java with Shared Memory Directives.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Parallel Programming and Timing Analysis on Embedded Multicores Eugene Yip The University of Auckland Supervisors:Advisor: Dr. Partha RoopDr. Alain Girault.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

CIS 540 Principles of Embedded Computation Spring Instructor: Rajeev Alur

Reference: Ian Sommerville, Chap 15  Systems which monitor and control their environment.  Sometimes associated with hardware devices ◦ Sensors: Collect.

CSCI-455/552 Introduction to High Performance Computing Lecture 19.

PRET-OS for Biomedical Devices A Part IV Project.

1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.

Copyright ©: University of Illinois CS 241 Staff1 Threads Systems Concepts.

CY2003 Computer Systems Lecture 04 Interprocess Communication.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

Processor Architecture

1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.

Parallelism without Concurrency Charles E. Leiserson MIT.

Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.

By: Rob von Behren, Jeremy Condit and Eric Brewer 2003 Presenter: Farnoosh MoshirFatemi Jan

ILPc: A novel approach for scalable timing analysis of synchronous programs Hugh Wang Partha S. Roop Sidharta Andalam.

CIS 540 Principles of Embedded Computation Spring Instructor: Rajeev Alur

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.

Parallel Computing Presented by Justin Reschke

Agenda  Quick Review  Finish Introduction  Java Threads.

Reachability Testing of Concurrent Programs1 Reachability Testing of Concurrent Programs Richard Carver, GMU Yu Lei, UTA.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)

Distributed Shared Memory

CS5102 High Performance Computer Systems Thread-Level Parallelism

Conception of parallel algorithms

Threads Cannot Be Implemented As a Library

Pattern Parallel Programming

Parallel and Distributed Simulation Techniques

Computer Engg, IIT(BHU)

Shared Memory Programming

Concurrency: Mutual Exclusion and Process Synchronization

EE 4xx: Computer Architecture and Performance Programming

Presentation transcript:

Programming and Timing Analysis of Parallel Programs on Multicores Eugene Yip, Partha Roop, Morteza Biglari-Abhari, Alain Girault ACSD

Introduction Safety-critical systems: – Perform specific real-time tasks. – Strict safety standards (IEC 61508, DO 178). – Time-predictability useful in real-time designs. – Shift towards multicore designs. [Paolieri et al 2011] Towards Functional-Safe Timing-Dependable Real-Time Architectures. [Pellizzoni et al 2009] Handling Mixed-Criticality in SoC-Based Real-Time Embedded Systems. [Cullmann et al 2010] Predictability Considerations in the Design of Multi-Core Embedded Systems. Embedded Systems Safety-critical concerns 2

Introduction Designing safety-critical systems: – Certified Real-Time Operating Systems (RTOS) E.g., VxWorks, LynxOS, and SafeRTOS. Programmer manages shared variables. Hard to verify timing. 3 [VxWorks] [LynxOS] [SafeRTOS] [Sandell et al 2006] Static Timing Analysis of Real-Time Operating System Code

Introduction Designing safety-critical systems: – Certified Real-Time Operating Systems (RTOS) E.g., VxWorks, LynxOS, and SafeRTOS. Programmer manages shared variables. Hard to verify timing. – Synchronous Languages E.g., Esterel, Esterel C Language (ECL), and PRET-C. Deterministic concurrency (Synchrony hypothesis). Difficult to distribute: Instantaneous communication or sequential semantics. 4 [Benveniste et al 2003] The Synchronous Languages 12 Years Later. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design. [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Girault 2005] A Survey of Automatic Distribution Method for Synchronous Programs

Research Objective To design a C-based, parallel programming language that: – has deterministic execution behaviour, – can take advantage of multicore execution, and – is amenable to static timing analysis. 5

Outline Introduction ForeC Language Timing Analysis Results Conclusions 6

Outline Introduction ForeC Language Timing Analysis Results Conclusions 7

ForeC (Foresee) Language C-based, multi-threaded, synchronous language. Inspired by Esterel and PRET-C. Minimal set of synchronous constructs. Fork/join parallelism and shared memory thread communication. Structured preemption. 8

Execution Example shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... } Global synchronisation barrier Fork-join Blocking statement. Arbitrary thread execution order. Shared variable and its combine function 9

Execution Example shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... } Global sum = 1 10 Global tick start

Execution Example shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... } GlobalCopies f1f1 f2f2 sum = 1 sum 1 = 1sum 2 = 1 11 Global tick start Threads get a conceptual copy of the shared variables at the start of every global tick.

Execution Example shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... } GlobalCopies f1f1 f2f2 sum = 1 sum 1 = 1 sum 1 = 2 sum 2 = 1 sum 2 = 3 12 Global tick start Threads modify their own copy during execution.

Execution Example shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... } GlobalCopies f1f1 f2f2 sum = 1 sum 1 = 1 sum 1 = 2 sum 2 = 1 sum 2 = 3 13 Global tick start Global tick end

Execution Example shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... } GlobalCopies f1f1 f2f2 sum = 1 sum 1 = 1 sum 1 = 2 sum 2 = 1 sum 2 = 3 sum = 5 14 Global tick start Global tick end When a global tick ends, the modified copies are combined and assigned to the actual shared variables. Combine function is defined by the programmer and must be commutative and associative.

Execution Example shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... } GlobalCopies f1f1 f2f2 sum = 1 sum 1 = 1 sum 1 = 2 sum 2 = 1 sum 2 = 3 sum = 5 15 Global tick start Global tick end Modifications are isolated. Interleaving does not matter. Do not need locks or critical sections. But, the programmer has to specify the combine function and placement of pauses.

Execution Example shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... } GlobalCopies f1f1 f2f2 sum = 1 sum 1 = 1 sum 1 = 2 sum 2 = 1 sum 2 = 3 sum = 5 sum 1 = 5... sum 2 = Global tick start Global tick end Global tick start

ForeC (Foresee) Language int x = 1; abort { x = 2; pause; x = 3; } when (x > 0); Initialise variable x Abort body starts executing. Check the abort condition. The abort body is preempted. Execution continues. Preemption construct:

ForeC (Foresee) Language Preemption construct: [ weak ] abort { st } when [ immediate ] ( cond ) immediate : The abort condition is checked when execution first reaches the abort. weak : Let the abort body to execute one last time before it is preempted. 18

ForeC (Foresee) Language Variable type-qualifiers: input and output Declares a variable whose value is updated or emitted to the environment at each global tick. E.g., input int x; 19

Scheduling Light-weight static scheduling: – Take advantage of multicore performance while delivering time-predictability (ease static timing analysis). – Thread allocation and scheduling order on each core decided at compile time by the programmer. – Cooperative (non-preemptive) scheduling. – Fork/join semantics and notion of a global tick is preserved via synchronisation. 20

Scheduling Light-weight static scheduling: – One core to perform housekeeping tasks at the end of the global tick. Combining of shared variables. Emitting outputs and sampling inputs. Starting the next global tick. 21

Outline Introduction ForeC Language Timing Analysis Results Conclusions 22

Timing Analysis Compute the program’s Worst-Case Reaction Time (WCRT). Physical time1s2s3s4s Maximum time allowed (design specification) WCRT = max(Reaction times) Must validate: WCRT ≤ Maximum time allowed Reaction time 23 [Boldt et al 2008] Worst Case Reaction Time Analysis of Concurrent Reactive Programs.

Timing Analysis Construct a Concurrent CFG (CCFG) of the executable binary. 24 shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... }

Timing Analysis One existing approach for multicores: [Ju et al 2010] Timing Analysis of Esterel Programs on General-Purpose Multiprocessors. Uses ILP which is NP-Complete, no tightness result, analysis results are only for a 4-core processor. Existing approaches for single-core: – Integer Linear Programming (ILP) – Model Checking/Reachability – Max-Plus 25 [P. S. Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [M. Boldt et al 2008] Worst Case Reaction Time Analysis of Concurrent Reactive Programs.

Reachability 26 g1g1 g2g2 g 3a g 3b g 4a g 4b g 4c RT 1 = Reaction Time of g 1 RT 2 RT 3b RT 4c RT 4b RT 3a RT 4a WCRT = MAX(RT 1 … RT 4c ) Traverse the CCFG to find all possible global ticks. State-space explosion. Precision vs. Analysis time.

Reachability 27 g1g1 g2g2 g 3a g 3b g 4a g 4b g 4c RT 1 RT 2 RT 3b RT 4c WCRT = RT 4b RT 3a RT 4a Identify the path leading to the WCRT. Good for understanding the timing behaviour.

Max-Plus Makes the safe assumption that the program’s WCRT occurs when all threads execute their longest reaction together. – Compute the WCRT of each thread separately. – Compute the program’s WCRT by using WCRT of the threads. – Fast analysis time but over-estimation could be large. 28

Timing Analysis Propose the use of Reachability for multicore analysis: – Trade off analysis time for higher precision. – Analyse inter-core synchronisations in detail. – Handle state-space explosion by reducing the program’s CCFG before reachability analysis. Program binary (annotated) Compute each global tick. WCRT Program’s reduced CCFG 29

Timing Analysis CCFG optimisations: – merge: Reduces the number of CFG nodes that need to be traversed. – merge-b: Reduces the number of alternate paths in the CFG. (Reduces the number of global ticks) mergemerge-b 30

Timing Analysis Computing each global tick: 1.Parallel thread execution and inter-core synchronisations. 2.Scheduling overheads. 3.Variable delay in accessing the shared bus. 31

Timing Analysis 1.Parallel thread execution and inter-core synchronisations. An integer counter to track each core’s execution time. Static scheduling allows us to determine the thread execution order on each core. Synchronisation at fork/join, and end of the global tick. Core 1: Core 2: main f 2 f 1 Core 1 Core 2 main f2f2 f1f1 32

Timing Analysis 2.Scheduling overheads. – Synchronisation: Fork/join and global tick. Via global memory. – Thread context-switching. Copying of shared variables at the start the thread’s local tick via global memory. Synchronisation Thread context-switch Core 1 Core 2 main f2f2 f1f1 Global tick 33

Timing Analysis 2.Scheduling overheads. – Required scheduling routines statically known. – Analyse the control-flow of the routines. – Compute the execution time for each scheduling overhead. Core 1 Core 2 main f1f1 Core 1 Core 2 main f2f2 f1f1 f2f2 34

Timing Analysis 3.Variable delay in accessing the shared bus. – Global memory accessed by scheduling routines. – TDMA bus delay has to be considered. Core 1 Core 2 main f1f1 f2f2 35

Timing Analysis 3.Variable delay in accessing the shared bus. – Global memory accessed by scheduling routines. – TDMA bus delay has to be considered Core 1 Core 2 slots 36 Core 1 Core 2 main f1f1 f2f2

Timing Analysis 3.Variable delay in accessing the shared bus. – Global memory accessed by scheduling routines. – TDMA bus delay has to be considered Core 1 Core 2 main f1f1 f2f2 37 Core 1 Core 2 main f1f1 f2f2

Outline Introduction ForeC Language Timing Analysis Results Conclusions 38

Results For the proposed reachability-based timing analysis, we demonstrate: – the precision of the computed WCRT. – the efficiency of the analysis, in terms of analysis time. 39

Results Timing analysis tool: Program binary (annotated) Proposed Reachability Max-Plus WCRT Program CCFG (optimisations) 40

Results Multicore simulator (Xilinx MicroBlaze): – Based on and extended to be cycle-accurate and support multiple cores and a TDMA bus. Core 0 TDMA Shared Bus Data memory Data memory Instruction memory Core n Data memory Instruction memory 16KB 32KB 5 cycles 1 cycle 5 cycles/core (Bus schedule round = 5 * no. cores) 41

Results Mix of control/data computations, thread structure and computation load. * [Pop et al 2011] A Stream-Computing Extension to OpenMP. # [Nemer et al 2006] A Free Real-Time Benchmark. * * # Benchmark programs. 42

Results Each benchmark program was distributed over 1 to n-number of cores. – n = maximum number of parallel threads. Observed the WCRT: – Input vectors to elicit the worst case execution path identified by Reachability analysis. Computed the WCRT: – Reachability – Max-Plus 43

802.11a Results Observed: WCRT decreases until 5 cores. TDMA Bus is a bottleneck: Global memory becomes more expensive. Synchronisation overheads. 44

802.11a Results Reachability: ~2% over- estimation. Benefit of explicit path exploration. 45

802.11a Results Max-Plus: Assumes one global tick where all threads execute their worst-case. Loss of thread execution context: Max execution time of the scheduling routines. 46

802.11a Results Both approaches: Estimation of synchronisation cost is conservative. Assumed that the receive only starts after the last sender. 47

802.11a Results Max-Plus takes less than 2 seconds. Reachability 48

802.11a Results Reachability merge: Reduction of ~9.34x 49

802.11a Results Reachability merge: Reduction of ~9.34x 50

802.11a Results Reachability merge: Reduction of ~9.34x merge-b: Reduction of ~342x Less than 7 sec. 51

802.11a Results Reduction in states  reduction in analysis time Number of global ticks explored. 52

Results Reachability: ~1 to 8% over-estimation. Loss in precision mainly from over-estimating the synchronisation costs. 53

Results Max-Plus: Over-estimation very dependent on program structure. FmRadio and Life very imprecise. Loops can “amplify” over- estimations. Matrix quite precise. Executes in one global tick. Thus, Max- Plus assumption is valid. 54

Results Our tool generates a timing trace for the computed WCRT: – For each core: Thread start/end time, context- switching, fork/join,... – Can be used to tune the thread distribution. Was used to find good thread distributions for each benchmark program. 55

Outline Introduction ForeC Language Timing Analysis Results Conclusions 56

Conclusions ForeC language for deterministic parallel programming. Based on synchronous framework. Able to achieve WCRT speedup while providing time-predictability. Precise, fast and scalable timing analysis for multicore programs using reachability. 57

Future work Implementation: WCRT-guided, automatic thread distribution. Decrease global synchronisation overhead without increasing analysis complexity. Analysis: Prune additional infeasible paths using value analysis. Include the use of caches/scratchpads in the multicore memory hierarchy. 58

Questions? 59

Introduction Existing parallel programming solutions. – Shared memory model. OpenMP, Pthreads Intel Cilk Plus, Thread Building Blocks Unified Parallel C, ParC, X10 – Message passing model. MPI, SHIM – Provides ways to manage shared resources but not prevent concurrency errors. [OpenMP] [Pthreads] [X10] [Intel Cilk Plus] [Intel Thread Building Blocks] [Unified Parallel C] [Ben-Asher et al] ParC – An Extension of C for Shared Memory Parallel Processing. [MPI] [SHIM] SHIM: A Language for Hardware/Software Integration. 60

Introduction – Desktop variants optimised for average-case performance (FLOPS), not time-predictability. – Threaded programming model. Non-deterministic thread interleaving makes understanding and debugging hard. 61 [Lee 2006] The Problem with Threads.

Introduction Parallel programming: – Programmer manages the shared resources. – Concurrency errors: Deadlock, Race condition, Atomic violation, Order violation. 62 [McDowell et al 1989] Debugging Concurrent Programs. [Lu et al 2008] Learning from Mistakes: A Comprehensive Study on Real World Concurrency Bug Characteristics.

Introduction Deterministic runtime support. – Pthreads dOS, Grace, Kendo, CoreDet, Dthreads. – OpenMP Deterministic OMP – Concept of logical time. – Each logical time step broken into an execution and communication phase. [Bergan et al 2010] Deterministic Process Groups in dOS. [Olszewski et al 2009] Kendo: Efficient Deterministic Multithreading in Software. [Bergan et al 2010] CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution. [Liu et al 2011] Dthreads: Efficient Deterministic Multithreading. [Aviram 2012] Deterministic OpenMP. 63

ForeC language Behaviour of shared variables is similar to: Esterel (Valued signals) Intel Cilk+ (Reducers) Unified Parallel C (Collectives) DOMP (Workspace consistency) Grace (Copy-on-write) Dthreads (Copy-on-write) 64

ForeC language Parallel programming patterns: – Specifying an appropriate combine function. – Sacrifice for deterministic parallel programs. – Map-reduce – Scatter-gather – Software pipelining – Delayed broadcast or point-to-point communication. 65