Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programming Safety-Critical Embedded Systems Work mainly by Sidharta Andalam and Eugene Yip Main supervisor:Advisor: Dr. Partha RoopDr. Alain Girault (UoA)(INRIA)

Similar presentations


Presentation on theme: "Programming Safety-Critical Embedded Systems Work mainly by Sidharta Andalam and Eugene Yip Main supervisor:Advisor: Dr. Partha RoopDr. Alain Girault (UoA)(INRIA)"— Presentation transcript:

1 Programming Safety-Critical Embedded Systems Work mainly by Sidharta Andalam and Eugene Yip Main supervisor:Advisor: Dr. Partha RoopDr. Alain Girault (UoA)(INRIA) 1

2 Outline Introduction Synchronous Languages PRET-C ForeC 2

3 Outline Introduction Synchronous Languages PRET-C ForeC 3

4 Introduction Safety-critical systems: – Perform specific real-time tasks. – Comply with strict safety standards [IEC 61508, DO 178] – Time-predictability useful in real-time designs. [Paolieri et al 2011] Towards Functional-Safe Timing-Dependable Real-Time Architectures. Embedded Systems Safety-critical concerns Timing/Functionality requirements 4 Timing analysis

5 Introduction 5 Domain of application Processor EmbeddedDesktop Single-core Multicore Manycore C RTOS (VxWorks) UPC X10 Intel Cilk Plus SharC Grace SHIM Sigma C ForkLight Esterel SCADE Simulink Protothreads OpenMP OpenCL Pthreads ParC PRET-C ForeC

6 Outline Introduction Synchronous Languages PRET-C ForeC 6

7 Synchronous Languages Deterministic concurrency (formal semantics). – Concurrent control behaviours. – Typically compiled away. Execution model similar to digital circuits. – Threads execute in lock-step to a global clock. – Threads communicate via instantaneous signals. [Benveniste et al 2003] The Synchronous Languages 12 Years Later. Global ticks Inputs Outputs 1234 7

8 Synchronous Languages Physical time1s2s3s4s Time for a tick Must validate: max(Reaction time) < min(Time for each tick) Reaction time Specified by the system’s timing requirements [Benveniste et al 2003] The Synchronous Languages 12 Years Later. 8

9 Synchronous Languages Esterel, Lustre, Signal Synchronous extensions to C: – PRET-C – Reactive Shared Variables – Synchronous C – Esterel C Language 9 [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Boussinot 1993] Reactive Shared Variables Based Systems. [Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design. Retain the essence of C and add deterministic concurrency and thread communication.

10 Outline Introduction Synchronous Languages PRET-C ForeC 10

11 PRET-C 11 Stages 1.PRET-C: Simple synchronous extension to C (using macros). 2.TCCFG: Intermediate format. 3.TCCFG’: Updated after cache analysis. 4.Model Checking: Binary search for the WCRT. PRET-C void main() { while(1) { abort PAR(sampler,display); when(reset); EOT; } TCCFG Cache analysis Model Checker WCRT Final Output

12 PRET-C Simple set of synchronous extensions to C: – Light-weight multi-threading. – Macro-based implementation. – Thread-safe shared memory accesses. – Amenable to timing analysis for ensuring time- predictability. 12

13 PRET-C StatementDescription ReactiveInput IDeclares I as a reactive input coming from the environment. ReactiveOutput ODeclares O as a reactive output emitted to the environment. PAR(T 1,..., T n )Synchronously executes threads T 1 to T n in parallel. Thread Ti has higher execution priority over T i+1. EOTMarks the end of a tick. [weak] abort P when CTerminates P when C is true. The semantics of PRET-C is presented using structural operational style, along with proofs for reactivity and determinism [IEEE TC 2013 March]

14 PRET-C Code... PAR(T1,T2)... T1: A; EOT; C; EOT T2: B; EOT; D; EOT A B C D Time T1 T2 Global Tick Local tick

15 Outline Introduction Synchronous Languages PRET-C ForeC 15

16 Introduction Safety-critical systems: – Shift from single-core to multicore processors. – Cheaper, better power vs. execution performance. Core n Core 0 System bus Resource Shared [Blake et al 2009] A Survey of Multicore Processors. [Cullmann et al 2010] Predictability Considerations in the Design of Multi-Core Embedded Systems. 16

17 Introduction Parallel programming: – From super computers to mainstream computers. – Frameworks designed for systems without resource constraints or safety-concerns. Optimised for average-case performance (FLOPS), not time-predictability. – Threaded programming model. Pthreads, OpenMP, Intel Cilk Plus, ParC,... Non-deterministic thread interleaving makes understanding and debugging hard. 17 [Lee 2006] The Problem with Threads.

18 Introduction Parallel programming: – Programmer responsible for shared resources. – Concurrency errors: Deadlock, Race condition, Atomic violation, Order violation. 18 [McDowell et al 1989] Debugging Concurrent Programs. [Lu et al 2008] Learning from Mistakes: A Comprehensive Study on Real World Concurrency Bug Characteristics.

19 Introduction Synchronous languages – Esterel, Lustre, Signal – Synchronous extensions to C: PRET-C Reactive Shared Variables Synchronous C Esterel C Language 19 [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Boussinot 1993] Reactive Shared Variables Based Systems. [Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design. Sequential execution semantics. Unsuitable for parallel execution.

20 Introduction Synchronous languages – Esterel, Lustre, Signal – Synchronous extensions to C: PRET-C Reactive Shared Variables Synchronous C Esterel C Language 20 [Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs. [Boussinot 1993] Reactive Shared Variables Based Systems. [Hanxleden et al 2009] SyncCharts in C - A Proposal for Light-Weight, Deterministic Concurrency. [Lavagno et al 1999] ECL: A Specification Environment for System-Level Design. Compilation produces sequential programs. Unsuitable for parallel execution.

21 ForeC “Foresee”  ForeC C-based, multi-threaded, synchronous language. Inspired by PRET-C and Esterel. Deterministic parallel execution on embedded multicores. Fork/join parallelism and shared memory thread communication. Program behaviour independent of chosen thread scheduling. 21

22 ForeC 22

23 ForeC Additional constructs to C: – pause : Synchronisation barrier. Pauses the thread’s execution until all threads have paused. – par( st 1,..., st n ) : Forks each statement to execute as a parallel thread. Each statement is implicitly scoped. – [ weak ] abort st when [ immediate ] exp: Preempts the statement st when exp evaluates to a non-zero value. exp is evaluated in each global tick before st is executed. 23

24 ForeC Additional variable type-qualifiers to C: – input and output : Declares a variable whose value is updated or emitted to the environment at each global tick. 24

25 ForeC Additional variable type-qualifiers to C: – shared : Declares a shared variable that can be accessed by multiple threads. 25

26 ForeC Additional variable type-qualifiers to C: – shared : Declares a shared variable that can be accessed by multiple threads. 1.Threads make local copies of shared variables that they may use at the start of their local ticks. 2.Threads only modify their local copies during execution. 3.If a par statement terminates: Modified copies from the child threads are combined (using a commutative & associative function) and assigned to the parent. 3.If the global tick ends: The modified copies are combined and assigned to the actual shared variables. 26 a b

27 Execution Example shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... } Synchronisation Fork-join Shared variable Commutative and associative combine function 27

28 Execution Example 1 shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... } Global sum = 1 28

29 Execution Example 1 shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... } Global sum = 1 29 Global tick start

30 Execution Example 1 shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... } GlobalLocal f1f1 f2f2 sum = 1 sum 1 = 1sum 2 = 1 30 Global tick start

31 Execution Example 1 shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... } GlobalLocal f1f1 f2f2 sum = 1 sum 1 = 1 sum 1 = 2 sum 2 = 1 sum 2 = 3 31 Global tick start

32 Execution Example 1 shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... } GlobalLocal f1f1 f2f2 sum = 1 sum 1 = 1 sum 1 = 2 sum 2 = 1 sum 2 = 3 32 Global tick start Global tick end

33 Execution Example 1 shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... } GlobalLocal f1f1 f2f2 sum = 1 sum 1 = 1 sum 1 = 2 sum 2 = 1 sum 2 = 3 sum = 5 33 Global tick start Global tick end

34 Execution Example 1 shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... } GlobalLocal f1f1 f2f2 sum = 1 sum 1 = 1 sum 1 = 2 sum 2 = 1 sum 2 = 3 sum = 5 sum 1 = 5... sum 2 = 5... 34 Global tick start Global tick end Global tick start

35 Execution Example 2 Sum a set of data. shared int v=0 combine with plus; int[4] data={1,2,3,4}; void main(void) { f(data); } void f(int *data) { par(add(0,data), add(2,data)); } void add(int x, int *data) { v=data[x] + data[x+1]; }

36 Execution Example 2 shared int v=0 combine with plus; int[4] data={1,2,3,4}; int[4] data1={5,6,7,8}; void main(void) { f(data); } void f(int *data) { par(add(0,data), add(2,data)); } void add(int x, int *data) { v=data[x] + data[x+1]; } Sum sets of data in parallel.

37 Execution Example 2 shared int v=0 combine with plus; int[4] data={1,2,3,4}; int[4] data1={5,6,7,8}; void main(void) { par(f(data), f(data1)); } void f(int *data) { par(add(0,data), add(2,data)); } void add(int x, int *data) { v=data[x] + data[x+1]; } Sum sets of data together in parallel.

38 Execution Example 2 main ff add v

39 Execution Example 2 main ff add vv

40 Execution Example 2 int[4] data={1,2,3,4}; int[4] data1={5,6,7,8}; void main(void) { par(f(data), f(data1)); } void f(int *data) { shared int v=0 combine with plus; par(add(0,data,&v), add(2,data,&v)); } void add(int x, int *data, shared int *const v combine with +) { *v=data[x] + data[x+1]; }

41 Execution Example Shared variables: – Threads modify local copies of shared variables. Isolation of thread execution allows threads to truly execute in parallel. Thread interleaving does no affect the program’s behaviour. – Prevents most concurrency errors. Deadlock, Race condition: No locks. Atomic and order violation: Local copies. – Copies for a shared variable can be split into groups and combined in parallel. 41

42 Execution Example Shared variables: – Programmer has to define a suitable combine function for each shared variable. Must ensure the combine function is indeed commutative & associative. – Notion of “combine functions” is not entirely new: Intel Cilk Plus, OpenMP, MPI, UPC, X10 Esterel, Reactive Shared Variables 42 [Intel Cilk Plus] http://software.intel.com/en-us/intel-cilk-plus [OpenMP] http://openmp.org [MPI] http://www.mcs.anl.gov/research/projects/mpi/ [Unified Parallel C] http://upc.lbl.gov/ [X10] http://x10-lang.org/ [Berry et al 1992] The Esterel Synchronous Programming Language: Design, Semantics and Implementation. [Boussinot 1993] Reactive Shared Variables Based Systems.

43 [Intel Cilk Plus] http://software.intel.com/en-us/intel-cilk-plus [OpenMP] http://openmp.org [MPI] http://www.mcs.anl.gov/research/projects/mpi/ [Unified Parallel C] http://upc.lbl.gov/ [X10] http://x10-lang.org/ [Berry et al 1992] The Esterel Synchronous Programming Language: Design, Semantics and Implementation. [Boussinot 1993] Reactive Shared Variables Based Systems. Execution Example Shared variables: – Programmer has to define a suitable combine function for each shared variable. Must ensure the combine function is indeed commutative & associative. – Notion of “combine functions” is not entirely new: Intel Cilk Plus, OpenMP, MPI, UPC, X10 Esterel, Reactive Shared Variables 43 cilk::reducer_op cilk::holder_op shared var reduction(operator: var) MPI_Reduce MPI_Gather shared var collectives Aggregates

44 [Intel Cilk Plus] http://software.intel.com/en-us/intel-cilk-plus [OpenMP] http://openmp.org [MPI] http://www.mcs.anl.gov/research/projects/mpi/ [Unified Parallel C] http://upc.lbl.gov/ [X10] http://x10-lang.org/ [Berry et al 1992] The Esterel Synchronous Programming Language: Design, Semantics and Implementation. [Boussinot 1993] Reactive Shared Variables Based Systems. Execution Example Shared variables: – Programmer has to define a suitable combine function for each shared variable. Must ensure the combine function is indeed commutative & associative. – Notion of “combine functions” is not entirely new: Intel Cilk Plus, OpenMP, MPI, UPC, X10 Esterel, Reactive Shared Variables 44 Valued signals Combine operator shared var Combine operator

45 Shared Variable Design Patterns Point-to-point Broadcast Software pipelining Divide and conquer – Scatter/Gather – Map/Reduce 45

46 Point-to-point 46 shared int sum = 0 combine with plus; void main(void) { par( f(), g() ); } void f(void) { while (1) { sum = comp1(); pause; } void g(void) { while (1) { comp2(sum); pause; } New value of sum is received in the next global tick. Combine operation is not required.

47 Broadcast 47 shared int sum = 0 combine with plus; void main(void) { par( f(), g(), g() ); } void f(void) { while (1) { sum = comp1(); pause; } void g(void) { while (1) { comp2(sum); pause; } Multiple receivers. Combine operation is not required. New value of sum is received in the next global tick.

48 Software Pipelining 48 shared int s1 = 0, s2 = 0 combine with plus; void main(void) { par( stage1(), stage2(), stage3() ); } void stage1(void) { while (1) { s1 = comp1(); pause; } void stage2(void) { pause; while (1) { s2 = comp2(s1); pause; } Outputs from each stage are buffered. Use the delayed behaviour of shared variables to buffer each stage. void stage3(void) { pause; while (1) { comp3(s2); pause; }

49 Divide and Conquer 49 input int[1024] image; shared int edges = 0 combine with plus; void main(void) { par( analyse(0, 511), analyse(512, 1023) ); } void analyse(int start, int end) { while (1) { edges = 0; for (i = start; i < end; ++i) {... image[i]... ; edges++; } pause; } Count the number of edges in an image.

50 Scheduling Light-Weight Static Scheduling: – Take advantage of multicore performance while delivering time-predictability. – Generate code to execute directly on hardware (bare metal/no OS). – Thread allocation and scheduling order on each core decided at compile time by the programmer. Develop a WCRT-aware scheduling heuristic. Thread isolation allows for scheduling flexibility. – Cooperative (non-preemptive) scheduling. 50

51 Scheduling Cores synchronise to fork/join threads and end each global tick. One core to perform housekeeping tasks at the end of the global tick: – Combining shared variables. – Emitting outputs. – Sampling inputs and trigger the next global tick. 51

52 Results Multicore simulator (Xilinx MicroBlaze): – Based on http://www.jwhitham.org/c/smmu.html and extended to be cycle-accurate and support multiple cores and a TDMA bus.http://www.jwhitham.org/c/smmu.html Core 0 TDMA Shared Bus Global memory Data memory Instruction memory Core n Data memory Instruction memory 16KB 32KB 5 cycles 1 cycle 5 cycles/core (Bus schedule round = 5 * no. cores) 52

53 WCRT Execution Results Able to achieve speed ups for all programs. The benefit of multicore execution diminishes with increasing number of cores due to overheads (Bus, memory accesses, scheduling routines). 53

54 Programming PTARM using ForeC shared int sum = 1 combine with plus; int plus(int copy1, int copy2) { return (copy1 + copy2); } void main(void) { par(f(1), f(2)); } void f(int i) { sum = sum + i; pause;... } 54

55 Execution of ForeC int main(void) { SET_THREAD_LOCATION(0, _pt_hwt0); SET_THREAD_LOCATION(1, _pt_hwt1); SET_THREAD_LOCATION(2, _pt_idle); SET_THREAD_LOCATION(3, _pt_idle); _pt_hwt0: initialize code; goto main; _pt_hwt1: wait for par; goto f(2); _pt_idle: goto _pt_idle; continues...

56 Execution of ForeC int main(void) { SET_THREAD_LOCATION(0, _pt_hwt0); SET_THREAD_LOCATION(1, _pt_hwt1); SET_THREAD_LOCATION(2, _pt_idle); SET_THREAD_LOCATION(3, _pt_idle); _pt_hwt0: initialize code; goto main; _pt_hwt1: wait for par; goto f_2; _pt_idle: goto _pt_idle; continues...

57 Execution of ForeC main: fork f_1 and f_2; _par_resume: return 0; f_1: sum = 1; synchronization code; thread termination code; f_2: sum = 2; synchronization code; thread termination code; }

58 Non-Realtime Threads in ForeC A non-realtime thread (NRT): – no strict timing requirements. – possibly unbounded execution time. – asynchronous computation. – E.g., file archiving, compression, data analysis. 58

59 Non-Realtime Threads in ForeC Splitting the execution time of NRTs into periods. Guarantee f() to execute for at least min_t and at most max_t in each global tick. – When the period elapses, the execution pauses. – Execution resumes in the next global tick. 59 // Non-realtime thread. void nrt(void) { do { f(); } until (min_t, max_t); }

60 Non-Realtime Threads in ForeC 60 // Non-realtime thread. void nrt(void) { // Set deadline equal to // the current time + min_t. setDeadline(min_t); // Enable timing exception // and register a handler. enableException(max_t, handler); // Execute the body. f(); // The body is finished executing. // Disable the timing exception. disableException(); goto end; // Timing exception handler. handler: { // Save the execution context. pause; setDeadline(min_t); // Restore the execution context. } end:; } // Non-realtime thread. void nrt(void) { do { f(); } until (min_t, max_t); }

61 PTARM modifications Boot-up – Modified to allow loading of multiple hardware threads. Exceptions – Added the exception handler in boot loader Context Saving – Modified VHDL to save PC to LR – Saves registers onto stack in exception routine

62 Tick Precise Allocation Device Matthew Kuo Main supervisor: Partha Roop

63 Introduction Cache Performance Timing Precision Traditionally Caches – to bridge the memory gap – Small fast piece of memory Temporal locality Spatial locality – Hardware Controlled Hard real time systems – Compute the WCRT Needs to model the architecture Caches models – Complex – Not tight

64 Introduction Scratchpad Performance Timing Precision Small piece of memory Software controlled Requires an allocation algorithm – ILP – Greedy Hard real time systems – Easy to compute tight the WCRT – Reduces the average case performance May also be worse than cache for worst case performance Not as efficient as caches

65 Introduction CacheScratchpad Performance Timing Precision

66 Introduction CacheScratchpad Performance Timing Precision TickPAD

67 Tick Precise Allocation Device TickPAD - Tick Precise Allocation Device Memory controller – Hybrid between caches and scratchpads software controlled memory like a scratchpad Hardware controlled features Hard real-time synchronous programs

68 TickPAD System Specifications 0x000x040x080x0C 4 Instructions 1 Cache Line Takes 1 burst transfer from main memory buffer 4 x 32 bits Buffers are 1 cache line in size

69 TickPAD – scratchpad memory for synchronous programs

70 To accelerate linear code

71 TickPAD – scratchpad memory for synchronous programs For predictable temporal locality – Statically allocated Dynamically loaded

72 TickPAD – scratchpad memory for synchronous programs Stores the resumptions address of active threads Stores the instructions at the resumption of the next active thread – To reduce context switching overhead at state/tick boundaries

73 TickPAD – scratchpad memory for synchronous programs Stores a set of commands to be executed by the TickPAD controller.  Command – the type of operation  Address – the PC value at which the command is activated  Operand- stores data need for the command A buffer to store operands fetched from main memory  Command requiring 2+ operands

74 Spatial Memory Pipeline Exploit spatial locality – Predictability prefetch the next line of instructions

75 Spatial Memory Pipeline

76

77

78

79

80

81

82

83

84

85

86

87 Command Table A Look Up table to dynamically load – Tick Instruction Buffer – Tick Queue – Associative Loop Memory Statically Allocated Command are executed when the PC matches the address stored on the command

88 TickPAD Design flow

89

90

91 Command Table Allocation NodeCommandAddress FORKStore Tick Address Queue x NAddress of FORK EOTStore Tick Address Queue Load Tick Instruction Buffer Address of EOT KILLLoad Tick Instruction BufferAddress of Kill LoopsDiscard Loop Associative Memory Store Loop Associative Memory Address at start of Loop

92 NodeCommandAddress FORKStore Tick Address Queue x NAddress of FORK EOTStore Tick Address Queue Load Tick Instruction Buffer Address of EOT KILLLoad Tick Instruction BufferAddress of Kill LoopsDiscard Loop Associative Memory Store Loop Associative Memory Address at start of Loop Command Table Allocation

93 NodeCommandAddress FORKStore Tick Address Queue x NAddress of FORK EOTStore Tick Address Queue Load Tick Instruction Buffer Address of EOT KILLLoad Tick Instruction BufferAddress of Kill LoopsDiscard Loop Associative Memory Store Loop Associative Memory Address at start of Loop Command Table Allocation

94 NodeCommandAddress FORKStore Tick Address Queue x NAddress of FORK EOTStore Tick Address Queue Load Tick Instruction Buffer Address of EOT KILLLoad Tick Instruction BufferAddress of Kill LoopsDiscard Loop Associative Memory Store Loop Associative Memory Address at start of Loop

95 Command Table Allocation NodeCommandAddress FORKStore Tick Address Queue x NAddress of FORK EOTStore Tick Address Queue Load Tick Instruction Buffer Address of EOT KILLLoad Tick Instruction BufferAddress of Kill LoopsDiscard Loop Associative Memory Store Loop Associative Memory Address at start of Loop

96 Tick Address Queue Tick Instruction Buffer Reduce cost of context switching Make context switching points appear as linear code – Paired using Spatial Memory Pipeline

97 Tick Address Queue Tick Instruction Buffer Reduce cost of context switching Make context switching points appear as linear code – Paired using Spatial Memory Pipeline Stores an ordered list of the resumptions addresses of each thread

98 Tick Address Queue Tick Instruction Buffer Reduce cost of context switching Make context switching points appear as linear code – Paired using Spatial Memory Pipeline Stores the instructions of the next active thread

99 Tick Address Queue Tick Instruction Buffer Reduce cost of context switching Make context switching points appear as linear code – Paired using Spatial Memory Pipeline Stores the instructions of the next active thread

100 PC: 2B0 Tick Address Queue Tick Instruction Buffer

101 PC: 2C0 Tick Address Queue Tick Instruction Buffer

102 PC: 2C0 Tick Address Queue Tick Instruction Buffer

103 PC: 2F0 Tick Address Queue Tick Instruction Buffer

104 PC: 300 Tick Address Queue Tick Instruction Buffer

105 PC: 310 Tick Address Queue Tick Instruction Buffer

106 PC: 310 Tick Address Queue Tick Instruction Buffer

107 PC: 310 Tick Address Queue Tick Instruction Buffer

108 PC: 4F0 Tick Address Queue Tick Instruction Buffer

109 Associative Loop Memory Statically Allocated – Greedy – ILP Fetches Loop Before Executing – Predictable – easy and tight to model – Exploits temporal locality

110 Results

111 8.5% compared to locked scratchpad memory 12.3% compared to thread interleaved of scratchpad

112 Results

113 Results - Synthesis

114 Conclusions C-based synchronous languages for writing deterministic, time-predictable software. – PRET-C: Single-cores – ForeC: Multicores Can achieve WCRT speedup while providing time-predictability. Very precise and fast timing analysis for PRET- C and ForeC programs using reachability.

115 Conclusions A new time precise memory architecture - TickPAD Showed the use TickPAD is comparable to using the cache and scratchpad memories Future direction – The use of TickPAD for data caches – Implement TickPAD on Precise Timed Architecture

116 Questions? 116

117 Outline Introduction ForeC Language Timing Analysis Results Conclusions 117

118 Timing Analysis Compute the program’s worst-case reaction time (WCRT). Physical time1s2s3s4s Time for a tick Must validate: max(Reaction time) < min(Time for each tick) Reaction time Specified by the system’s timing requirements [Benveniste et al 2003] The Synchronous Languages 12 Years Later. 118

119 Timing Analysis Existing approaches for synchronous programs: Integer Linear Programming (ILP) “Coarse-grained” Reachability (Max-Plus) Model Checking One existing approach for analysing the WCRT of synchronous programs on multicores: [Ju et al 2010] Timing Analysis of Esterel Programs on General-Purpose Multiprocessors. Uses ILP, no tightness result, all experiments performed 4-core processor. 119

120 Timing Analysis Existing approaches for synchronous programs. Integer Linear Programming (ILP) – Execution time of the program described as a set of integer equations. – Solving ILP is NP-complete. [Ju et al 2010] Timing Analysis of Esterel Programs on General-Purpose Multiprocessors. 120

121 Timing Analysis Existing approaches for synchronous programs. “Coarse-grained” Reachability (Max-Plus) – Compute the WCRT of each thread. – Using the thread WCRTs, the WCRT of the program is computed. – Assumes there is a global tick where all threads execute their worst-case. [M. Boldt et al 2008] Worst Case Reaction Time Analysis of Concurrent Reactive Programs. 121

122 Timing Analysis Existing approaches for synchronous programs. Model Checking – Computes the execution time along all possible execution paths. – State-space explosion problem. – Binary search: Check the WCRT is less than “x”. – Trades-off analysis time for precision. – Counter example: Execution trace for the WCRT. 122 [P. S. Roop et al 2009] Tight WCRT Analysis of Synchronous C Programs.

123 Timing Analysis Proposed “fine-grained” Reachability approach: Only consider local ticks that can execute together in the same global tick. Timed execution trace for the WCRT. To handle the state-space explosion: – Reduce the program’s CCFG before analysis. Program binary (annotated) Find all global ticks (Reachability) WCRT Reconstruct the program’s CCFG 123

124 Timing Analysis Programs executed on the following multicore architecture: Core 0 TDMA Shared Bus Global memory Data memory Instruction memory Core n Data memory Instruction memory 124

125 Timing Analysis Computing the execution time: 1.Overlapping of thread execution time from parallelism and inter-core synchronizations. 2.Scheduling overheads. 3.Variable delay in accessing the shared bus. 125

126 Timing Analysis 1.Overlapping of thread execution time from parallelism and inter-core synchronisations. An integer counter to track each core’s execution time. Synchronisation occurs when forking/joining, and ending the global tick. Advance the execution time of participating cores. Core 1: Core 2: main f 2 f 1 Core 1 Core 2 main f2f2 f1f1 126

127 Timing Analysis 2.Scheduling overheads. – Synchronisation: Fork/join and global tick. Via global memory. – Thread context-switching. Copying of shared variables at the start the thread’s local tick via global memory. Synchronisation Thread context-switch Core 1 Core 2 main f2f2 f1f1 Global tick 127

128 Timing Analysis 2.Scheduling overheads. – Required scheduling routines statically known. – Analyse the scheduling control-flow. – Compute the execution time for each scheduling overhead. Core 1 Core 2 main f1f1 Core 1 Core 2 main f2f2 f1f1 f2f2 128

129 Timing Analysis 3.Variable delay in accessing the shared bus. – Global memory accessed by scheduling routines. – TDMA bus delay has to be considered. Core 1 Core 2 main f1f1 f2f2 129

130 Timing Analysis 3.Variable delay in accessing the shared bus. – Global memory accessed by scheduling routines. – TDMA bus delay has to be considered. 1 2 1 2 1 2 1 2 1 2 1 2 Core 1 Core 2 slots 130 Core 1 Core 2 main f1f1 f2f2

131 Timing Analysis 3.Variable delay in accessing the shared bus. – Global memory accessed by scheduling routines. – TDMA bus delay has to be considered. 1 2 1 2 1 2 1 2 1 2 1 2 Core 1 Core 2 main f1f1 f2f2 131 Core 1 Core 2 main f1f1 f2f2

132 Timing Analysis CCFG optimisations: – merge: Reduces the number of CFG nodes that need to be traversed. – merge-b: Reduces the number of alternate paths in the CFG. (Reduces the number of global ticks) – Precision of the analysis is unaffected because we are not performing value analysis to prune infeasible paths. 132

133 Timing Analysis CCFG optimisations: – merge: Reduces the number of CFG nodes that need to be traversed. – merge-b: Reduces the number of alternate paths in the CFG. (Reduces the number of global ticks) mergemerge-b 133

134 Outline Introduction ForeC Language Timing Analysis Results Conclusions 134

135 Results For the proposed reachability-based timing analysis, we demonstrate: – the precision of the computed WCRT. – the efficiency of the analysis, in terms of analysis time. 135

136 Results Timing analysis tool: Program binary (annotated) Fine-grained Reachability (Proposed) Coarse- grained Reachability (Max-Plus) Taking into account the 3 factors WCRT Program CCFG (optimisations) 136

137 Results Multicore simulator (Xilinx MicroBlaze): – Based on http://www.jwhitham.org/c/smmu.html and extended to be cycle-accurate and support multiple cores and a TDMA bus.http://www.jwhitham.org/c/smmu.html Core 0 TDMA Shared Bus Global memory Data memory Instruction memory Core n Data memory Instruction memory 16KB 32KB 5 cycles 1 cycle 5 cycles/core (Bus schedule round = 5 * no. cores) 137

138 Results Mix of control/data computations, thread structure and computation load. * [Pop et al 2011] A Stream-Computing Extension to OpenMP. # [Nemer et al 2006] A Free Real-Time Benchmark. * * # Benchmark programs. 138

139 Results Each benchmark program was distributed over varying number of cores. – Up to the maximum number of parallel threads. Observed the WCRT: – Test vectors to elicit different execution paths. Computed the WCRT: – Proposed – Max-Plus 139

140 802.11a Results Observed: WCRT decreases until 5 cores. Global memory increasingly expensive. Scheduling overheads. 140

141 802.11a Results Proposed: ~2% over- estimation. Benefit of fine- grained reachability. 141

142 802.11a Results Max-Plus: Loss of execution context: Uses only the thread WCRTs. Assumes one global tick where all threads execute their worst-case. Max execution time of the scheduling routines. 142

143 802.11a Results Both approaches: Estimation of synchronisation cost is conservative. Assumed that the receive only starts after the last sender. 143

144 802.11a Results Max-Plus takes less than 2 seconds. Proposed 144

145 802.11a Results Proposed Max-Plus takes less than 2 seconds. merge: Reduction of ~9.34x 145

146 802.11a Results Proposed Max-Plus takes less than 2 seconds. merge: Reduction of ~9.34x merge-b: Reduction of ~342x Less than 7 sec. 146

147 Results Reduction in states  reduction in analysis time Number of global ticks explored. 147

148 Results Proposed: ~1 to 8% over-estimation. Loss in precision mainly from over-estimating the synchronisation costs. 148

149 Results Max-Plus: Over-estimation very dependent on program structure. FmRadio and Life very imprecise. Loops iterating over par statement(s) multiple times. Over-estimations accumulate. Matrix quite precise. Executes in one global tick. Thus, thread WCRT assumption is valid. 149

150 Results Our tool generates a timed execution trace for the computed WCRT: – For each core: Thread start/end time, context- switching, fork/join,... – Can be used to tune the thread distribution. Was used to manually find good thread distributions for each benchmark program. 150

151 Outline Introduction ForeC Language Timing Analysis Results Conclusions

152 ForeC language for deterministic parallel programming of embedded multicores. Based on the synchronous framework, but amenable to parallel execution. Can achieve WCRT speedup while providing time-predictability. Very precise and fast timing analysis for parallel programs using reachability.

153 Future work Complete the formal semantics of ForeC. Automatic WCRT-aware scheduling. Cache hierarchy. Prune additional infeasible paths using value analysis.

154 Questions? 154

155 Design Patterns Point-to-point Broadcast Software pipelining Divide and conquer – Scatter/Gather – Map/Reduce 155

156 Point-to-point 156 shared int sum = 0 combine with plus; void main(void) { par( f(), g() ); } void f(void) { while (1) { sum = comp1(); pause; } void g(void) { while (1) { comp2(sum); pause; } New value of sum is received in the next global tick. Combine operation is not required.

157 Broadcast 157 shared int sum = 0 combine with plus; void main(void) { par( f(), g(), g() ); } void f(void) { while (1) { sum = comp1(); pause; } void g(void) { while (1) { comp2(sum); pause; } Multiple receivers. Combine operation is not required. New value of sum is received in the next global tick.

158 Software Pipelining 158 shared int s1 = 0, s2 = 0 combine with plus; void main(void) { par( stage1(), stage2(), stage3() ); } void stage1(void) { while (1) { s1 = comp1(); pause; } void stage2(void) { pause; while (1) { s2 = comp2(s1); pause; } Outputs from each stage are buffered. Use the delayed behaviour of shared variables to buffer each stage. void stage3(void) { pause; while (1) { comp3(s2); pause; }

159 Divide and Conquer 159 input int[1024] image; int edges = 0; void main(void) { analyse(0, 1023); } void analyse(int start, int end) { while (1) { edges = 0; for (i = start; i < end; ++i) {... image[i]... ; edges++; } pause; } Count the number of edges in an image. Sequential 1

160 Divide and Conquer 160 input int[1024] image; shared int edges = 0 combine with plus; void main(void) { par( analyse(0, 511), analyse(512, 1023) ); } void analyse(int start, int end) { while (1) { edges = 0; for (i = start; i < end; ++i) {... image[i]... ; edges++; } pause; } Parallel 1

161 Divide and Conquer 161 input int[1024] image; int edges = 0; void main(void) { analyse(0, 1023); } void analyse(int start, int end) { while (1) { edges = 0; for (i = start; i < end; ++i) {... image[i]... ; edges++; } pause; } Keep a running total of the number of edges in an image. For the parallel version, it is not as easy as this. Sequential 2

162 Divide and Conquer 162 input int[1024] image; shared int edges = 0 combine with plus; void main(void) { par( analyse(0, 511), analyse(512, 1023) ); } void analyse(int start, int end) { while (1) { edges = 0; for (i = start; i < end; ++i) {... image[i]... ; edges++; } pause; } edges = (1+2) + (1+2) = 6 Parallel 2

163 Divide and Conquer 163 input int[1024] image; shared int edges = 0 combine with plus; void main(void) { par( analyse(0, 511), analyse(512, 1023) ); } void analyse(int start, int end) { while (1) { edges = 0; for (i = start; i < end; ++i) {... image[i]... ; edges++; } pause; } GlobalLocal analyse (0,511) analyse (512,1023) edges = 0 edges = 1 edges = 0 edges = 2 edges = (1+2) + (1+2) = 6 Parallel 2

164 Divide and Conquer 164 input int[1024] image; shared int edges = 0 combine with plus; void main(void) { par( analyse(0, 511), analyse(512, 1023) ); } void analyse(int start, int end) { while (1) { edges = 0; for (i = start; i < end; ++i) {... image[i]... ; edges++; } pause; } GlobalLocal analyse (0,511) analyse (512,1023) edges = 0 edges = 3 edges = 0 edges = 1 edges = 0 edges = 2 edges = (1+2) + (1+2) = 6 Parallel 2

165 Divide and Conquer 165 input int[1024] image; shared int edges = 0 combine with plus; void main(void) { par( analyse(0, 511), analyse(512, 1023) ); } void analyse(int start, int end) { while (1) { edges = 0; for (i = start; i < end; ++i) {... image[i]... ; edges++; } pause; } GlobalLocal analyse (0,511) analyse (512,1023) edges = 0 edges = 3 edges = 0 edges = 1 edges = 0 edges = 2 edges = 3 edges = 4 edges = 3 edges = 5 edges = (1+2) + (1+2) = 6 Parallel 2

166 Divide and Conquer 166 input int[1024] image; shared int edges = 0 combine with plus; void main(void) { par( analyse(0, 511), analyse(512, 1023) ); } void analyse(int start, int end) { while (1) { edges = 0; for (i = start; i < end; ++i) {... image[i]... ; edges++; } pause; } GlobalLocal analyse (0,511) analyse (512,1023) edges = 0 edges = 3 edges = 0 edges = 1 edges = 0 edges = 2 edges = 9 edges = 3 edges = 4 edges = 3 edges = 5 edges = (1+2) + (1+2) = 6 Parallel 2

167 Divide and Conquer 167 input int[1024] image; shared int edges = 0 combine with plus; void main(void) { par( analyse(0, 511), analyse(512, 1023) ); } void analyse(int start, int end) { while (1) { edges = 0; for (i = start; i < end; ++i) {... image[i]... ; edges++; } pause; } GlobalLocal analyse (0,511) analyse (512,1023) edges = 0 edges = 3 edges = 0 edges = 1 edges = 0 edges = 2 edges = 9 edges = 3 edges = 4 edges = 3 edges = 5 edges = (1+2) + (1+2) = 6 We should track the running total separately from the number of new edges. Parallel 2

168 Divide and Conquer input int[1024] image; typedef struct { int total; int new } Edges; shared Edges edges = {.total = 0,.new = 0 } combine with accum; Edges accum(Edges copy1, Edges copy2) { copy1.total = copy1.total + copy1.new + copy2.new; copy1.new = 0; return copy1; } void main(void) { par( analyse(0, 511), analyse(512, 1023) ); } void analyse(int start, int end) { while (1) { edges.new = 0; for (i = start; i < end; ++i) {... image[i]... ; edges.new++; } pause; } edges = (1+2) + (1+2) = 6 Parallel 3 168

169 Divide and Conquer input int[1024] image; typedef struct { int total; int new } Edges; shared Edges edges = {.total = 0,.new = 0 } combine with accum; Edges accum(Edges copy1, Edges copy2) { copy1.total = copy1.total + copy1.new + copy2.new; copy1.new = 0; return copy1; } void main(void) { par( analyse(0, 511), analyse(512, 1023) ); } void analyse(int start, int end) { while (1) { edges.new = 0; for (i = start; i < end; ++i) {... image[i]... ; edges.new++; } pause; } edges = (1+2) + (1+2) = 6 Parallel 3 169

170 Divide and Conquer input int[1024] image; typedef struct { int total; int new } Edges; shared Edges edges = {.total = 0,.new = 0 } combine with accum; Edges accum(Edges copy1, Edges copy2) { copy1.total = copy1.total + copy1.new + copy2.new; copy1.new = 0; return copy1; } void main(void) { par( analyse(0, 511), analyse(512, 1023) ); } void analyse(int start, int end) { while (1) { edges.new = 0; for (i = start; i < end; ++i) {... image[i]... ; edges.new++; } pause; } edges = (1+2) + (1+2) = 6 Parallel 3 170

171 Divide and Conquer input int[1024] image; typedef struct { int total; int new } Edges; shared Edges edges = {.total = 0,.new = 0 } combine with accum; Edges accum(Edges copy1, Edges copy2) { copy1.total = copy1.total + copy1.new + copy2.new; copy1.new = 0; return copy1; } void main(void) { par( analyse(0, 511), analyse(512, 1023) ); } void analyse(int start, int end) { while (1) { edges.new = 0; for (i = start; i < end; ++i) {... image[i]... ; edges.new++; } pause; } edges = (1+2) + (1+2) = 6 Parallel 3 171

172 Divide and Conquer input int[1024] image; typedef struct { int total; int new } Edges; shared Edges edges = {.total = 0,.new = 0 } combine with accum; Edges accum(Edges copy1, Edges copy2) { copy1.total = copy1.total + copy1.new + copy2.new; copy1.new = 0; return copy1; } void main(void) { par( analyse(0, 511), analyse(512, 1023) ); } void analyse(int start, int end) { while (1) { edges.new = 0; for (i = start; i < end; ++i) {... image[i]... ; edges.new++; } pause; } edges = (1+2) + (1+2) = 6 Parallel 3 172

173 Introduction Existing parallel programming solutions. – Shared memory model. OpenMP, Pthreads Intel Cilk Plus, Thread Building Blocks Unified Parallel C, ParC, X10 – Message passing model. MPI, SHIM – Provides ways to manage shared resources but not prevent concurrency errors. [OpenMP] http://openmp.org [Pthreads] https://computing.llnl.gov/tutorials/pthreads/ [X10] http://x10-lang.org/ [Intel Cilk Plus] http://software.intel.com/en-us/intel-cilk-plus [Intel Thread Building Blocks] http://threadingbuildingblocks.org/ [Unified Parallel C] http://upc.lbl.gov/ [Ben-Asher et al] ParC – An Extension of C for Shared Memory Parallel Processing. [MPI] http://www.mcs.anl.gov/research/projects/mpi/ [SHIM] SHIM: A Language for Hardware/Software Integration.

174 Introduction Deterministic runtime support. – Pthreads dOS, Grace, Kendo, CoreDet, Dthreads. – OpenMP Deterministic OMP – Concept of logical time. – Each logical time step broken into an execution and communication phase. [Bergan et al 2010] Deterministic Process Groups in dOS. [Olszewski et al 2009] Kendo: Efficient Deterministic Multithreading in Software. [Bergan et al 2010] CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution. [Liu et al 2011] Dthreads: Efficient Deterministic Multithreading. [Aviram 2012] Deterministic OpenMP.

175 ForeC Language Behaviour of shared variables is similar to: – Intel Cilk+ (Reducers) – Unified Parallel C (Collectives) – DOMP (Workspace consistency) – Grace (Copy-on-write) – Dthreads (Copy-on-write)


Download ppt "Programming Safety-Critical Embedded Systems Work mainly by Sidharta Andalam and Eugene Yip Main supervisor:Advisor: Dr. Partha RoopDr. Alain Girault (UoA)(INRIA)"

Similar presentations


Ads by Google