Cost-Effective Register File Soft Error reduction Pablo Montesinos, Wei Liu and Josep Torellas, University of Illinois at Urbana-Champaign
Overview Study of register file vulnerability to SDC(Silent Data Corruption) Shield – cost effective protection to register files Highighting policies and techniques used in shield Experiment - Results
Register File AVF RF-AVF is the probability that a fault that occurs will lead to error. Register lifetime is divided into PreWrite, Useful, and PostLastRead parts. Based on AVF calculation we can divide lifetime of bit into ACE (Architecturally Correct Execution) and un-ACE cycles.
Register File AVF During PreWrite Period – un-ACE If used atleast once after write the reg switches to ACE state. After last read on reg, switches back to un-ACE during PostLastRead
Highlighting Insights (1) The combined %-USEFUL time of all registers is small
Highlighting Insights (1) The average number of useful (live) registers is less than 20 (SPECint) and 17(SPECfp). It is thus possible to redue the vulnerability of the register file by only protecting a subset of carefully chosen registers at a time.
Highlighting Insights (2) Only a few long-lived registers contribute to overall Total useful time On average less than 10% of register versions are long-lived.
Highlighting Insights (2) On average 40% of useful time comes from the few long-lived versions. In SPECfp, 5% of long-lived versions account for 46% of the useful time.
Motivation Register files have a very high access rate. High temperature thus leading to lesser Qcrit for the devices. An error in an RF can propagate with hght failure probability If we isolate a few register versions, predicting their life- time, and protect these register versions alone, high reliability can be achieved with limited overhead.
Shield - Architecture Life-Time Prediction Shielding Decision Register Error Check Error Recovery
Reg-Version Lifetime Prediction P12 => Used(1), Renamed(1) P7 => Used(0), Renamed(1)
Shielding Decision These prediction bits are stored as status in the ECC table. The decision to shield an incoming register version written is by: Availability of free ECC-Table entry Same register# present in the ECC table will be replaced with new entry. Existing reg-version with lesser lifetime than incoming reg- version will be replaced. Replacement policy:
Register Error Check & Recovery On a read request the register data is sent to the original datapath and shield. If the Reg# matches with a tag entry, then the reg-data is checked for errors at the ECC-Checker. If Error is detected Processor stalls the instruction I reading reg P Reg-data is corrected and written into RF Oldest read instruction reading reg P in ROB and all succeeding instructions is flushed. Processor resumes from flushed instruction.
Experiments- Results AVF computation for RF with shield
Experiments-Results AVF of intREG reduced by different replacement policies: LRU = 31% Effective = 63% OptEffective = 84% ( pinning of global pointers to particular ECC entries + Effective ) AVF for fpREG can be reduced maximum by 100%, because fewer fp-registers are in useful state.
Power and Area Impact Shield only uses 3ECC generators and 3 ECC checkers. Shield has 45% power overhead over a plain register file. (Full ECC has 2X) Shield introduces an overall 10% area overhead.
Conclusion A cost-effective architectural technique has been proposed to reduce the vulnerability of RF by 84% The area and power overhead indicated is a marginal tradeoff for reliability achieved.