T IME -P REDICTABLE E XECUTION OF E MBEDDED S OFTWARE ON M ULTI - CORE P LATFORMS Sudipta Chattopadhyay under the guidance of A/P Abhik Roychoudhury 1
E MBEDDED S YSTEMS 2
R EAL - TIME C ONSTRAINTS 3 Embedded system Hard real-time Soft real-time
T IMING A NALYSIS Hard real time systems require absolute timing guarantees System level analysis Single task analysis Worst case execution time (WCET) analysis An upper bound on execution time for all possible inputs Sound over-approximation is obtained by static analysis 4
WCET A NALYSIS Program Micro-architectural modeling Control flow graph WCET of basic blocks constraints Infeasible path constraints Loop bound Path analysis WCET boun d 5
A RCHITECTURE Core 1Core n L1 cache Shared L2 cache Memory Shared bus Resource sharing 6
O VERVIEW 7 Dissertation work ( Time-predictable execution in multi-core ) Unified cache Shared cache + shared bus A multi-core WCET tool Cache related preemption delay analysis Coherence miss modeling Shared scratchpad allocation Core 1 Core n L1 cache Shared L2 cache Memory Shared bus Resource sharing Main Memory L1 instruction cache Instr. accesses Data accesses Bus L1 data cache L2 unified cache Processor Conflicts with different instruction and data memory blocks
M ICRO - ARCHITECTURAL M ODELING pipelinecache branch predictor shared cache shared bus Single CoreMulti Core 8
(AI+MC)MC > RTSS’10= RTSS’10 C OMPARISON 9 WorkMicro-arch. level technique Program level technique PrecisionScalability Classical abstract interpretation (AI) AI × √ Classical model checking (MC) MC √ × RTS’00 (aiT, Chronos) AIInteger linear programming Can be improved √ RTSS’10AIMC Can be improved _ Our approach(AI+MC)Integer linear programming > RTS’00= RTS’00
I MPRECISION IN A BSTRACT I NTERPRETATION p1 p2 Cache state = C1 Cache state = C2 Joined Cache state = C3 10 a b b x Abstract cache set Abstract cache set young b Joined cache state Path p1 or path p2? Joined cache state loses information about path p1 and p2
M ODEL C HECKING ALONE ? A path sensitive search Path sensitive search is expensive – path explosion Worse, combined with possible cache states p1 p2 Cache state = C1 Cache state = C2 11
M ODEL C HECKING ALONE ? A path-sensitive search Path sensitive search is expensive – path explosion Worse, combined with possible cache states p1 p2 12 a b young b x Abstract LRU cache set young a b Abstract LRU cache set young b x Abstract LRU cache set young State Explosion
C ACHE ANALYSIS Program Pipeline analysis Branch predictor modeling WCET of basic blocks constraints Infeasible path constraints Loop bound IPET Micro architectural modeling Path analysis Cache analysis by abstract interpretatio n Analysis outcome Refine by model checker All checked Timeout 13 Refinement by model checker can be terminated at any point Model checker refinement steps are inherently parallel Each model checker refinement step checks light assertion property
R EFINEMENT (I NTER - CORE ) 14 m m Task Cache hit start exit Conflictin g task Cache miss m1m1 m2m2 m cache x < y x == y Infeasible m1m1 m2m2 Spurious ≠m young
R EFINEMENT (I NTER - CORE ) m m Task start exit Conflictin g task m1m1 m2m2 m cache x < y x == y Infeasible m1m1 m2m2 C_m++ Increment conflict C_m++ Increment conflict assert (C_m <= 1) Verified m A Cache Hit 15 young
R EFINEMENT (W HY IT WORKS ?) 16 Path 2 Cache miss m m Conflict to m m’ C_m++ Increment conflict assert (C_m <= 0) Property Does not affect the value of C_m x < y x == y m’ m
E XPERIMENTAL S ETUP (C HRONOS T OOLKIT ) 17 C source GCC simplescalar Binary codeCFG Micro architectural modeling cachepipelineBranch prediction Micro-architectural constraints ILP Flow constraints WCET CBMC C bounded model checking
E XPERIMENTAL R ESULT 18
E XPERIMENTAL R ESULT 19 L1 cache Shared L2 cache WCET 4-way associative, 8 KB Direct-mapped, 256 bytes Average time = 70 secs Tasks cnt jfdctint edn fir fdct ndes
E XTENSION U SING S YMBOLIC E XECUTION Conflictin g task m1m1 m2m2 x < y x == y m1m1 m2m2 C_m++ Increment conflict C_m++ Increment conflict assert (C_m <= 1) x < y constraint solver x = y x < yx ≥ y x < y ˄ x = y unknown NO assert (C_m <= 1) satisfied abort 20
E XTENSION U SING KLEE 21 C source GCC simplescalar Binary codeCFG Micro architectural modeling cachepipelineBranch prediction Micro-architectural constraints ILP Flow constraints WCET CBMC/KLEE
A G ENERIC F RAMEWORK Three different architectural/application settings Intra task (WCET in single core) High priority Low priority Inter task (Cache Related Preemption Delay analysis) cache L1 cache Shared L2 cache Task in Core 1 Task in Core 2 Inter core (WCET in multi-core) 22 Cache conflict Cache conflict Cache conflict
M ICRO - ARCHITECTURAL M ODELING pipelinecache branch predictor shared cache shared bus Single CoreMulti Core 23
T ASK - LEVEL INTERFERENCE Timeline T3 T2 T1 T2 T3 Task interference graph 24 Core 1Core n L1 cache Shared L2 cache T1 T2 T3 Shared bus Tasks
S HARED C ACHE + TDMA S HARED B US T1 T2 T3 T4 Core 1 slot Core 2 slot Core 1 slot Core 2 slot T1 T2 T3 T4 L2 miss due to T2 Disjoint lifetime WAIT T4 25 Core 1 Core 2 L1 cache Shared L2 cache Shared bus Task graphs Time Division Multiple Access (TDMA) T1T2 T3T4 Bus access
O VERVIEW OF THE FRAMEWORK L1 cache analysis L2 cache analysis Filter L1 cache analysis L2 cache analysis L2 conflict analysis Initial interference Filter Bus aware analysis WCRT computation Interference changes ? Yes Estimated WCRT No Task interference monotonically decreases 26
E VALUATION (2- CORE ) One core runs statemate another core runs the program under evaluation 27
E VALUATION (4- CORE ) Either runs (edn, adpcm, compress, statemate) or runs (matmult, fir, jfdcint, statemate) in 4 different cores 28
M ICRO - ARCHITECTURAL M ODELING pipelinecache branch predictor Single Core Interactions shared cache shared bus Multi Core 29
T IMING A NOMALY ( SHARED C ACHE ) hitmiss hit miss hit miss hit May not be the worst case path 30
B ASELINE A BSTRACTION – T IMING I NTERVAL Representing each pipeline stage as a timing interval IF ID EX WB CM Structural dependency R1 := R2 + 5 R5 := R1 * R7 R3 := R5 * 5 Contention A fixed-point analysis derives the timing of each stage as an interval 31 [3,7][4,10] startfinish latency [1,3] End = Start + cache miss latency interval
TDMA S HARED B US A NALYSIS Time Division Multiple Access (TDMA) Offset abstraction Core 0Core 1Core 0Core 1 Core 0Core 1Core 0Core 1 T (core 1) offset round offsetdelay T’ (core 0) delay = 0 32
L OOP C ONSTRUCT How do we define bus context? IF ID EX WB CM previous iteration current iteration Property: If the bus offsets of the cross-iteration edges do not change, WCET of the loop iteration cannot change 33
L OOP C ONSTRUCT Bus context flow graph C1C1 C2C2 C3C3 C4C4 C 5 C 3 C5C5 Property: If C i C j, then C i+k C j+k for any k > 0 34 C i = bus context of the loop body at i-th iteration
L OOP C ONSTRUCT C1C1 C2C2 C3C3 C4C4 Compute WCET for each bus context E(C 1 ) = number of times context C 1 is executed Generate linear constraints: E(C 1 ) + E(C 2 ) + E(C 3 ) + E(C 4 ) ≤ loop bound E(C 1 ) ≥ E(C 2 ) Bus context flow graph 35 loop bound Program Micro-architectural modeling Control flow graph WCET of basic blocks constraints Infeasible path constraints Loop bound Path analysis ILP solve r ILP = Integer Linear Programming
B RANCH PREDICTION + C ACHE m’ m m Branch location Maximum number of speculated instructions JOIN Unclear cache access Cache content Cache content 36 Cache conflict
E XPERIMENTAL S ETUP (C HRONOS T OOLKIT ) C source GCC simplescalar Binary codeCFG Micro architectural modeling Private cache pipelineBranch prediction Micro-architectural constraints ILP Flow constraints WCET Shared cacheShared bus 37
E VALUATION ( CACHE + PIPELINE ) jfdctint statemate Imprecision of shared cache analysis 38 Core 1Core 2 Vertically partition Core 1 Core 2 Horizontally partition
E VALUATION (C ACHE + PIPELINE + S PECULATION ) Imprecision of modeling speculation 39
E VALUATION (B US + PIPELINE ) Imprecision of shared bus analysis Imprecision of path analysis 40
R ECAP 41 Dissertation work ( Time-predictable execution in multi-core ) Unified cache Shared cache + shared bus A multi-core WCET tool Cache related preemption delay analysis Coherence miss modeling Shared scratchpad allocation Core 1Core n L1 data cache L1 data cache Shared L2 cache Memory Shared bus Coherence miss traffic Stale data items Core 1Core n L1 cache Shared L2 cache High priority task Low priority task Cache conflict Task c PE-0PE-1PE-N SPM-0SPM-1SPM-N Shared off-chip data bus Off-chip memory External Memory Interface …… Fast on-chip communication media
P ERSPECTIVE 42 Time-predictable execution in single-core Time-predictable execution in multi-core Resource sharing (cache and bus) Data sharing (cache coherence) TestingStatic analysis Shared cache Shared bus Cache coherence Customized hardware Shared scratchpad ARM Cortex A9 MPCore Samsung Exynos Nvidia Tegra II (smart phones) Time Division Multiple Access Aethreal Network-on-chip Sony PSP IBM Cell
P ERSPECTIVE Spurious counter example Abstraction Property Concrete domain Verifier Abstraction refinement Functionality Verification Verified SLAM (Microsoft) BLAST (UC Berkley) MAGIC (CMU) Abstract domain in abstract Interpretation (AI) AI Concrete domain May be spurious Generate Quantitative property Path-sensitive Verification Quantitative Verification Refinement Anytime Verification of Quantitative properties
F UTURE W ORK 44 Battery life Mobile devices x < y x == y m1m1 m2m2 x < y x = y x < y x ≥ y assert (C_m <= 1) Symbolic Execution Static performance analysis + testing Performance testing abort Energy analysis of software Energy-aware software testing x < y ˄ x ≠ y Input (Quantitative property e.g. cache conflict)
T HANK Y OU 45 My sincere thanks to all the Examiners and especially the anonymous Examiner 1 for his comment on symbolic execution