Architectural Vulnerability Factor (AVF) Computation for Address-Based Structures Arijit Biswas, Paul Racunas, Shubu Mukherjee FACT Group, DEG, Intel Joel Emer VSSAD, Intel Razvan Cheveresan Sun Microsystems, Intern FACT Group Ram Rangan Princeton University, Intern FACT Group
FACT Group, Intel2 Moore’s Law Graph Soft errors are a serious problem –Assuming a certain error rate, failure rate of whole chip increases 12x GAP Chart based on 200,000 latches as used in the Fujitsu SPARC Processor (2003)
FACT Group, Intel3 All bits are not created equal! Bit 1 0 Particle Strike Causes Bit Flip!
FACT Group, Intel4 All bits are not created equal! Bit Read? Bit has error protection benign fault no error yes no Does bit matter? no Does bit matter? Particle Strike Causes Bit Flip! Detection only Detection & Correction benign fault no error benign fault no error Silent Data Corruption yes no True Detected Unrecoverable Error False Detected Unrecoverable Error yes no
FACT Group, Intel5 Does bit matter? Architectural Vulnerability Factor (AVF) –Probability that a bit flip will cause user-visible error Soft Error Rate of a Structure = (AVF bit ) x (# Bits) x (Intrinsic Error Rate) bit Reducing AVF reduces SER –High AVF indicates need for protection –Low AVF can help remove protection hardware SER Protection can be Expensive –Impacts Area, Power, Performance, Design Time
FACT Group, Intel6 Simple Examples Committed Program Counter AVF ~ 100% Branch Predictor AVF = 0%
FACT Group, Intel7 Complex Examples Instruction Queue AVF = 29% Execution Units AVF = 9% Used a new concept –Architecturally Correct Execution (ACE)
FACT Group, Intel8 Architecturally Correct Execution (ACE) ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine) Anything else (un-ACE path) can be derated away Program Input Program Outputs
FACT Group, Intel9 Example of un-ACE instruction: Dynamically Dead Instruction Dynamically Dead Instruction Most bits of an un-ACE instruction do not affect program output
FACT Group, Intel10 ACE Breakdown of Instruction Queue Average across all of Spec2K slices for an IA64-like processor ACE % = AVF = 29%
FACT Group, Intel11 A New AVF Analysis – Address-Based Structures Caches, data translation buffers, store buffers –Make up large portions of a modern chip Simple ACE analysis is no longer enough Data & Tag structures need new concepts –Extended Lifetime Analysis –Hamming-Distance-1 Analysis –Cooldown –AVF Reduction - Flushing
FACT Group, Intel12 Lifetime Analysis Idle is unACE –Assuming all time intervals are equal –For 3/5 of the lifetime the bit is valid –Gives a measure of the structure’s utilization Number of useful bits Amount of time useful bits are resident in structure Valid for a particular trace Idle Valid FillRead Evict
FACT Group, Intel13 Lifetime Analysis of Write-through Data Cache Valid is not necessarily ACE ACE % = AVF = 2/5 = 40% Example Lifetime Components –ACE: fill-to-read, read-to-read –unACE: idle, read-to-evict, write-to-evict Idle FillRead Evict Write-through Data Cache
FACT Group, Intel14 Lifetime Analysis of Write-through Data Cache Data ACEness is a function of instruction ACEness Second Read is by an unACE instruction AVF = 1/5 = 20% Idle FillRead Evict Write-through DCache
FACT Group, Intel15 Tags are Hard A fault associated with a tag that is nominally associated with a particular instruction can impact the correct execution of a different independent instruction False Negatives only error if writeback is necessary –Uses standard lifetime analysis False Positives always result in error –Need bit-level analysis
FACT Group, Intel16 False Positive Expected Tag Miss, but got Hit – Error How do you compute the AVF? Fault injection? Incoming AddressTag Address Incoming Address Tag Address MISS HIT Expect: Acquire:
FACT Group, Intel17 Hamming-Distance-1 Analysis Assuming a single-bit error model Now we can use lifetime analysis on the identified bit(s) Tag Array Incoming Address Hamming-Distance-1 Match
FACT Group, Intel18 Edge Effects Simulation introduces unknown component –Simulation not run to completion –Only execute small segment of code Worst Case AVF = Known AVF + Unknown AVF How do we reduce/eliminate unknown? Idle Unknown FillRead Evict Not Simulated Sim End
FACT Group, Intel19 Cooldown run simulation beyond end interval. –Any bits that were already valid (the unknown bits), are resolved Trend: unknown AVF primarily resolves to unACE Best Estimate AVF = Known AVF after Cooldown 10 Million Instructions Simulation 10 Million Instructions Cooldown No Cooldown Cooldown
FACT Group, Intel20 Data AVFs (Average) STB AVF lower due to large idle component and bytemasks DTB AVF higher due to high average utilization Dcache (WB) AVF higher than Dcache (WT) since dirty bytes still ACE after last read
FACT Group, Intel21 Data AVF of DTB Large variability in AVF Ranges from ~0% to 80% Based on structure utilization by benchmark
FACT Group, Intel22 Tag AVFs (Average) Tag AVFs lower than expected for DTB and DCache (WT) –Only Hamming-Distance-1 matches contribute ACE time Tag AVFs higher than data for STB and DCache (WB) –Dynamically dead tags are still ACE for dirty bytes
FACT Group, Intel23 Tag AVF of DTB AVFs surprisingly small, little variation Protection added to DTB CAMs prior to AVF calculation (large # bits) AVF calculation shows NO protection was needed in this case
FACT Group, Intel24 AVF Observations DTB and Write-through Data Cache –Typically Tag AVF < Data AVF only hamming-distance 1 hits contribute to Tag AVF dynamic dead data are unACE STB and Write-back Data Cache –Typically Tag AVF ≥ Data AVF Tag AVF ACE till eviction if line is dirty dynamic dead data can be ACE Bytemasks and writes may make certain bytes of data unACE while all bits of tag are always ACE
FACT Group, Intel25 AVF Reduction: Flushing Flushing (emulates a context switch) –Also eliminates unknowns by flushing all live entries at end of simulation Main concept: Transform part of ACE time into unACE at the Expense of some Performance Idle ACE FillRead Evict Flush Fill
FACT Group, Intel26 AVF Reduction: Flushing –>50% AVF reduction for 100K cycle Flush (Flush takes 0 time) Max IPC reduction: 1.77% DTB, 1.25% WT/WB DCache Avg IPC reduction: 0.56% DTB, 0.19% WT/WB DCache Data Tags No Flushing 5M cycle Flush 1M cycle Flush 100K cycle Flush
FACT Group, Intel27 Summary SER is an ever-increasing problem –Need standard, quantitative way to evaluate design cost of adding protection/recovery to structures AVF Gives us a Quantitative way to Measure the cost of adding Protection Presented a Methodology to Compute the AVF of Address Based Structures –Lifetime Analysis –False Negatives and False Positives Hamming Distance-1 Analysis for False Positives –Edge Effects and Cooldown Analogous to Warmup –AVF Reduction - Flushing
FACT Group, Intel28 Backup Slides
FACT Group, Intel29 Simulation Setup (Backup) Simulated Regions of several Spec2000 Benchmarks for 10 Million instructions Simulated AVFs for 3 address-based structures on a IA64-like processor using ASIM –Data Translation Buffer (DTB) RAM and CAM arrays 128 entry, 92 bits/entry –Store Buffer (STB) Data and Address arrays 32 entry, 16 bytes/entry –Level 1 Data Cache (DCache) Data and Tags Simulated both Write-Through and Write-Back Modes 16 KB, 4-way set associative, 32 byte lines
FACT Group, Intel30 Lifetime Analysis – DCache (WT) Lifetime Breakdown Tracks Data AVFs Red components – unACE Black components - ACE FP Benchmarks show a low utilization of the cache (large Idle component)
FACT Group, Intel31 Lifetime Analysis - STB STB Lifetime shows a low utilization STB Address AVF always higher than Data AVF due to Bytemasks On Average, 6 bytes out of every 16 byte valid entry are used (valid) Red components – unACE Black components - ACE
FACT Group, Intel32 Data AVF for Write-through Data Cache (Backup)
FACT Group, Intel33 Data AVF for Write-back Data Cache (Backup)
FACT Group, Intel34 Data AVF for STB (Backup)
FACT Group, Intel35 Data AVF for DTB (Backup)
FACT Group, Intel36 Tag AVF for Write-through Data Cache (Backup)
FACT Group, Intel37 Tag AVF for Write-back Data Cache (Backup)
FACT Group, Intel38 Tag AVF for STB (Backup)
FACT Group, Intel39 Tag AVF for DTB (Backup)