F A S T Frequency-Aware Static Timing Analysis By Kiran Seth, Aravindh Anantaraman, Frank Mueller and Eric Rotenberg Center for Embedded Systems Research Departments of CS & ECE North Carolina State University
Real-Time Systems Tasks have a deadline must terminate on time Classification Hard Real-time: missed deadline catastrophe Soft Real-time: missed deadline low QoS. Multi-tasking real-time systems require scheduling algorithms Scheduler ensures task arbitration online Schedulability test ensures met deadlines (static test) requires known Worst-Case Execution Time (WCET)
Static Timing Analysis To schedule tasks in Real-time systems, need Worst-case Execution Time (WCET) and Worst-case Execution Cycles (WCEC) Experimental WCET unsafe bounds Due to input & hardware complexity Use static timing analysis toolset to obtain safe WCET bounds
Static Instruction Cache Analysis Work explained in [Mueller RTS-J’00] Interprocedural data-flow analysis Predicts each cache reference as one of always-hit always-miss first-hit first-miss Each instruction categorized for each loop level and function (loop w/ 1 iteration)
Static Data Cache Simulation For accurate static timing analysis need data cache analysis Currently, data cache analysis tool not accurate enough Too many restrictions, not general enough for real code Improvements by [Vera RTSS’03] Solutions All data accesses hits… highly underestimated. All data accesses misses… highly overestimated. Assume big enough cache to fit all data set Assume first-time accesses as misses (cold misses, only), o/w hits Accurate? Yes. But what is caches smaller? No significant impact on this study
Static Timing Analyzer Path & tree-based approach [Healy IEEE TC’99] Find nodes in the CFG and derive WCEC for each node A node is a function or loop WCET is calculated bottom-up Standard timing analysis assumptions apply No recursion All loop bounds must be known No function pointers
Motivation of FAST Dynamic Voltage Scaling (DVS) scheduling schemes Change frequency/voltage for system save power without missing deadlines Several DVS scheduling schemes available Good fit for real-time systems Most real-time systems have low utilization are low-power embedded systems Potential for considerable energy savings with DVS
Problem Current DVS schemes: Ignore effects of frequency scaling on WCEC DVS schemes assume: WCEC constant with frequency Overestimate WCET at lower frequencies To demonstrate the problem WCET of C-Lab benchmark static timing analysis tool For frequencies 100MHz – 1GHz Assess observed WCEC & WCET vs. assumption made by DVS schemes
Actual vs. Assumed WCEC for FFT WCEC changes with frequency modulation WCEC increases with higher frequency Constant memory latency: 100ns
Actual vs. Assumed WCET for FFT Difference in chosen frequency for DVS w/ WCET=5ms assumed: ~ 550 MHz actual: ~ 150 MHz
Parametric Frequency Model Problem: DVS Considers processor frequency scaling Ignores effect of frequency scaling on memory accesses With frequency scaling: Cycles for processor operations remains constant Except for memory operations problem DVS schemes overestimate the WCET at lower frequencies Cannot fully utilize available slack Power savings potential largely wasted
Parametric Frequency Model Solution: Calculate WCEC accounting for effects of memory accesses using the new parametric frequency model Model: WCEC(f) = i + mN = i + mLf i: Invariant # of worst-case cycles (for non-memory operations) m: # of worst-case memory accesses N: # of cycles per memory access depends on memory latency L and frequency f: N = Lf
Using the Parametric Frequency Model A: add R2, R1, R3 B: load R4, [M1] C: add R2, R1, R4 D: add R2, R1, R5 Instruction sequence simulated through simple pipeline explain parametric frequency model Simple pipeline: 6 stages Data & instruction cache N = 10
Example 0: Cache Hits Recall: B is load instruction WCEC = 9 + 0N Each row represents pipeline stage. Time (and cycle count) increases horizontally.
Example 1: Effect of I-cache miss WCEC = 9 + 1N Stall due to I-cache miss is shown Model accurately captures memory latency, however long
Example 2: Effect of D-cache miss Recall: B is load instruction WCEC = 9 + 1N Stall due to D-cache miss is shown Again, model captures memory latency, however long Notice: during stall cycles, no useful work is done
Example 3: Effect of I- & D-cache Miss WCEC = 9 + 2N I-cache miss first, then D-cache miss Overlap between useful cycles & stall cycles Also during high-latency execution operations E.g. floating-point, multiply, … overlap w/ D-cache miss Leads to overestimation in practice rare, still safe WCET
Experimental Validation Combine frequency model with our static timing analyzer FAST tool WCEC FAST equations Experiment to validate results from FAST tool Run benchmarks through FAST tool An equation representing WCEC for benchmark obtained Run same benchmarks through traditional timing analysis tool Vary frequencies: 100MHz-1GHz
Frequency-Aware Static Timing Analysis (FAST) FAST tool “as accurate” as traditional static timing analysis Slight overestimation in case of floating-point benchmarks
FAST in EDF Scheduling with DVS DVS with EDF: Ck/Pk , where =fc/fm FAST with EDF: (ik+mkLfm)/Pkfm Schedulability test: (ik/Pk) / fm (1 - L mk/Pk) Implemented frequency model for 3 EDF-DVS algorithms Algorithms by [Pillai & Shin] Look-ahead improved: @ completion, consider next deadline up to 34% additional energy savings (5-11% on avg.), low U but 0.5-8% less savings at high utilization
Improving DVS schemes Use parametric frequency model to improve DVS schemes provide accurate WCET Improved energy savings Architectural Simulator: SimpleScalar+Wattch [Brooks ISCA’00] 6-stage simple in-order pipeline processor model I-cache and D-cache (8KB each) Run 4-8 tasks simultaneously (scheduler runs as its own task) More accurate than E ~ V2f model ? Results newer than paper
Static RT-DVS vs. FAST Static RT-DVS Base case: EDF Tasks at 1GHz Idle: 100MHz no sleep mode small task periods tasksets 1: integer 2: float 3: mix High: 0.9 utilization Low: 0.5 Static scheme better than base EDF 12-60% energy savings FAST-Static even better 40-78% savings high + lower utilization
Cycle-conserving RT-DVS vs. FAST cycle-conserving RT-DVS dynamic scheduling early completion, reclaimed as slack Cycle-conserving 57-72% energy savings FAST 71-80% savings
FAST Look-ahead RT-DVS Look-ahead RT-DVS vs. FAST Look-ahead RT-DVS most aggressive DVS: early completion + max. deferral Look-ahead: slightly higher savings than cycle-conserving @ 68-80% FAST: slightly better in most cases @ 72-83%
Look-ahead RT-DVS vs. FAST Look-ahead RT-DVS E ~ V2f model Higher savings: up to 96% ? Ratio look-ahead / FAST similar Wattch detailed power model Probably more accurate
Conclusion Energy savings in real-time systems can be significantly improved by considering the effects of frequency scaling on WCET FAST + Static RT-DVS as good as Look-Ahead RT-DVS less overhead The parameterized frequency model can easily track effects of frequency scaling on WCET FAST tool works best when Many cache misses If D-cache analysis is highly inaccurate (usually true) FAST can make up for it High memory latency Insufficient dynamic slack reclaiming (during DVS scheduling) Integrated into real-time hardware support [VISA ISCA’03]
BACKUP SLIDES
The V2f model
Old DVS Scheduling Simulator Event based simulator of scheduler. Have to assume miss rate for the tasks in dynamic schemes. Uses E ~ V2f energy model. Gives a good idea about savings, BUT accurate ??
Static RT-DVS vs. FAST Static RT-DVS
Cycle-conserving RT-DVS vs. FAST cycle-conserving RT-DVS
Look-ahead RT-DVS vs. FAST Look-ahead RT-DVS
DVS schemes (Pillai & Shin) Static RT-DVS – Uses static slack available in the schedule. Cycle-conserving RT-DVS – Uses static slack + dynamic slack due to early completion. Look-ahead RT-DVS – Uses static slack + dynamic slack due to early completion + latest possible scheduling (look-ahead).
Complexity Original EDF test O(n) Modified EDF test still O(n)