Dynamic Prediction of Architectural Vulnerability Kristen Walcott, Greg Humphreys, Sudhanva Gurumurthi University of Virginia {walcott, humper, gurumurthi}@cs.virginia.edu Challenge As soft errors become more of a problem, protection will be needed even for every day PCs. Providing total redundancy is too expensive and assumes that AVF is 100%. Our work shows that AVF varies over time and across applications. Transient faults due to particle strikes are a key challenge in microprocessor design. As transistor counts increase exponentially, per-chip faults are a growing burden. Spatial and temporal redundancy techniques are used to protect against faults. Redundancy techniques assume that any fault will result in a visible program error (i.e., the Architectural Vulnerability Factor (AVF) is 100 percent). Over-design can hurt performance and drain power. 2 SimPoints of bzip2 Rising Problem Dynamic AVF Prediction Outliers (Correlation to AVF) Intel Corporation Microarchitectural Metrics FIT = Failure in Time = 1 failure in a billion hours Prediction Results We identify strong correlations between structural AVF values and a small set of processor metrics. Particle Strike Causes Bit Flip! Bit Read? yes no Detection & Correction Using linear and quadratic regression, we determined an AVF characterization that uses only a few easily measurable variables. These characterizations can be used to predict AVF accurately. Bit has error protection benign fault no error benign fault no error no Detection only Does bit matter? Does bit matter? no yes no yes True Detected Unrecoverable Error False Detected Unrecoverable Error Silent Data Corruption benign fault no error galgel benchmark What bits matter? Future Work Calculating Vulnerability With an accurate predictor, redundancy may be turned on only when vulnerability is high. Preliminary results show that partial redundancy provides a significant performance boost over full redundancy. Next we will perform a more rigorous exploration of the design space of partial redundant multithreading implementations and investigate redundancy toggling policies. AVFbit = Probability that a Bit Matters = # of Visible Errors # of Bit Flips from Particle Strikes Computer Science http://www.cs.virginia.edu/~krw7c/avf.html at the UNIVERSITY of VIRGINIA