F AULT S IM : A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems* Memory Reliability Tutorial: HPCA-2016 Prashant Nair Georgia Tech David Roberts AMD Research Moinuddin QureshiGeorgia Tech *HiPEAC-2016
HOW TO EVALUATE MEMORY RELIABILITY? 2 Goal: Accurately evaluate memory reliability across different systems & solutions, in less than one minute Memory System PerformancePower Reliability DRAMSim2 USIMM NVMain DRAMSim2 USIMM NVSim Cactii Fast and accurate simulators vital to compare effectiveness of different solutions
TYPES OF MEMORY FAILURES 3 Cores + Caches DRAM devices can encounter faults during operation Permanent Failure Transient Failure Memory reliability evaluations must account for both transient failures as well as permanent failures
GRANULARITY OF MEMORY FAILURES 4 Failures occur at small and large granularities: Bit, Word, Column, Row, Bank, Multi-Bank Memory reliability simulator should capture interaction of failures at different granularities Single DRAM Die (Top View) Banks
Failure Mode Transient Fault Rate (FIT) Permanent Fault Rate (FIT) Bit Word Column Row Bank0.810 Total REAL WORLD FAILURE RATE [SRIDHARAN+ SC’13] 5 1. Permanent faults >2x as likely as transient faults 2. Large granularity faults as common as bit faults } = 24.1 ✔ SECDED CHIPKILL ✔
COMPLEXITY IN FAULT INTERACTIONS WITH ECC 6 Several techniques: SECDED, Chipkill, Sparing often used in combination with periodic Scrubbing Complex interactions of techniques with fault modes and granularities How to evaluate effectiveness? Chipkill SECDED SPARING SCRUBBING
ANALYTICAL MODELS FOR MEMORY RELIABILITY Complex, Cumbersome, Changes with Fault Models A PRDC paper* has nearly 3 page model for Chipkill 7 Use empirical evaluation instead of analytical models Small change in ECC Massive changes in the model *Jian et. al. PRDC 2013
OVERVIEW WHY FAULTSIM? FAULTSIM: WHAT AND HOW? FAULTSIM: LESS THAN 1 MINUTE FAULTSIM: APPLIED TO 3D MEMORY FAULTSIM: APPLIED WITH ON-DIE ECC SUMMARY 8
FAULTSIM: A MONTE-CARLO FAULT SIMULATOR 9 FaultSim is written in C++. Configuration at command line or file Describes memory system Including chips per rank Number of channels And interconnect to processor Describes memory system Including chips per rank Number of channels And interconnect to processor Describes fault rate/component Derived from field studies Can be changed Describes fault rate/component Derived from field studies Can be changed Describes mitigation technique(s) Can be combination Used with/without scrubbing Describes mitigation technique(s) Can be combination Used with/without scrubbing
FAULTSIM OPERATION 10 FaultSim performs 20K*1million interval simulations per chip for each fault type days of simulation time Divide system lifetime (7 years) into smaller interval (3 hours) Check for uncorrectable failures in every interval 20,000 Intervals ~ 1 million trials
FAULTSIM: DATA STRUCTURES Memory chips are organized as Fault Domains Fault Domain (FD) consists of Fault Ranges (FR) Each FR uses Address (ADDR) and Mask fields 11 Space Efficient Representation: Large + Small granularity faults use only one type of FR data structure Channel Chip Fault
FAULT REPRESENTATION: EXAMPLE Memory with 8 rows and 8 bits per row Fault ranges A, B and C (A and B intersect) Mask field: fault address bit i can be 0 or 1 Address field: specific address bit values where Mask i == 0 Faults intersection computed based on mask and address 12 FRAddressMask A B C
VALIDATION: WITH ANALYTICAL MODELS 13 FaultSim closely follows the analytical model (within 2%) System: 1 rank, 18-chips
RESULTS: SIMULATION TIME 14 FaultSim still has simulation time in the order of days How to we reduce this to less than a minute? REPAIR SCHEME Simulation Time (Wall Clock) SECDED49.5 hours ChipKill49.2 hours Time for a million trials with FaultSim
OVERVIEW WHY FAULTSIM? FAULTSIM: WHAT AND HOW? FAULTSIM: LESS THAN 1 MINUTE FAULTSIM: APPLIED TO 3D MEMORY FAULTSIM: APPLIED WITH ON-DIE ECC SUMMARY 15
OBSERVATION: FEW FAULTS IN SYSTEM LIFETIME FaultSim consults random number generator at-least once during each interval (20K) System with 2 DIMMs, 9 chips each, over 7 years Num. Faults Encountered (Total)TRIALS 092.9% 16.7% 20.2% 3+0.2% Can we consult random number generator in proportion to faults, instead of every time interval?
INSIGHT: COMPUTE DISTANCE TO NEXT FAULT Example: Let the likelihood of a lottery ticket be a winner be 1/1000. We buy 5000 tickets. What is the likelihood of “X” winning tickets? 17 The time between events in a process in which events occur continuously and independently at a constant average is exponentially distributed Naïve Method: Draw 5000 tickets, for each ticket check if it is winner Distance Method: Compute distance to winning ticket using exponential distribution (avg=1000). Do until sum of distance > K 1K 2K 3K 4K 5K dist=1.5K dist=2K dist=0.5K dist=1.5K exceeds
FAULTSIM: EVENT-BASED FAULT INJECTION 18 Calls to random number reduced from 20K to 1 (or 2) Event-Based Fault Injection: When is the next fault? Time-Stamp of all faults computed at start of simulation. Simulation skips from one fault to another
RESULTS: SIMULATION TIME 19 FaultSim ~5000x faster with Event-Based Fault Injection reliability simulation in less than one minute SCHEME Simulation Time (Wall Clock) SECDED (Interval Based)49.5 hours SECDED (Event Based)34 seconds ChipKill(Interval Based)49.2 hours Chipkill (Event Based)33 seconds Time for a million trials with FaultSim
OBTAINING AND RUNNING FAULTSIM Clone it from github $git clone Running FaultSim. /faultsim --help for a list of command line parameters./faultsim --configfile configs/DIMM_none.ini --outfile out.txt 20
OVERVIEW WHY FAULTSIM? FAULTSIM: WHAT AND HOW? FAULTSIM: LESS THAN 1 MINUTE FAULTSIM: APPLIED TO 3D MEMORY FAULTSIM: APPLIED WITH ON-DIE ECC SUMMARY 21
Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures [MICRO 2014] Prashant NairGeorgia Tech David Roberts AMD Research Moinuddin QureshiGeorgia Tech
DRAM systems face a bandwidth wall Stack DRAM Dies over each other 3D DRAM Use Through Silicon Vias (TSV) to connect Dies Higher density of TSV Higher Bandwidth 23 INTRODUCTION TO 3D DRAM 23 Courtesy MICRON, Extremetech Go 3D to Scale Bandwidth Wall
3D DRAM Communicate using TSVs A New Failure Mode: TSV Failures TSV Failures Large Granularity Failures FAILURES IN 3D DRAM 24 TSVs Present New Kind of Large Granularity Failures
A NEW FAILURE MODE FROM TSVs 25 TSVs conduit for Address and Data Mainly Two Types TSV Faults –Data (Incorrect Data fetched from DRAM Die) –Address (Incorrect address presented to DRAM Die) Logic Die TSVs Address TSV Fault TSV Faults cause unavailability of Data and Addresses DataTSV Fault
CAPTURING THE EFFECT OF TSV FAULTS Data TSV Fault Few Columns Faulty Address TSV Fault 50% Memory Loss 26 TSVs faults manifested as column/bank failure DRAM Bank Row Decoder Column Decoder Addr. TSVs Data TSVs Faulty Data TSV Faulty Addr. TSV Address TSV fault: 50% memory unavailable Address TSV fault: 50% memory unavailable
IMPACT OF TSV FAULTS 27 Efficient Techniques to Mitigate TSV Faults Prob. System Failure TSV Faults Yes No 1 System: 8GB Stacked Memory (HBM) Prob. System Failure Prob(Uncorrectable Error) 22X
OVERVIEW WHY FAULTSIM? FAULTSIM: WHAT AND HOW? FAULTSIM: LESS THAN 1 MINUTE FAULTSIM: APPLIED TO 3D MEMORY Citadel: TSV-SWAP 28
DESIGN-TIME TSV SPARING Designers provision spares TSVs alongside Data TSVs and Address TSVs 29 DRAM Bank Row Decoder Column Decoder SPARE TSVs Additional Spare TSVs can replace faulty TSVs
DESIGN-TIME TSV SPARING: OPERATION Deactivate Broken TSVs Activate SPARE TSVs 30 DRAM Bank Row Decoder Column Decoder Faulty Data TSV Faulty Addr. TSV Address TSV fault: 50% memory unavailable Address TSV fault: 50% memory unavailable SPARE TSVs ✖ ✖ Deactivation of Faulty TSVs and Activation of Spare TSVs is performed at design time
DESIGN-TIME TSV SPARING: PROBLEMS Additional TSVs are required for TSV Sparing and What happens if TSVs turn faulty at runtime? 31
TSV-SWAP: RUNTIME TSV SPARING Few Data TSVs as Standby TSVs Replicate Standby Data in ECC 32 DRAM Bank Row Decoder Column Decoder (standby TSV) Data TSVs reused as Standby TSVs STEP-1: CREATE STANDBY TSVs Data Cache ECC Standby
TSV-SWAP: RUNTIME TSV SPARING 33 DRAM Bank Row Decoder Column Decoder (standby TSV) Address TSV fault: 50% memory unavailable Address TSV fault: 50% memory unavailable Data vs Address TSV Faults Using CRC-32+BIST CRC-32 address + data BIST diagnoses faulty TSVs STEP-2: DETECTING FAULTY TSVs
TSV-SWAP: RUNTIME TSV SPARING 34 DRAM Bank Row Decoder Column Decoder (standby TSV) SWAP Address TSV fault: 50% memory unavailable Address TSV fault: 50% memory unavailable SWAP TSV-SWAP is a runtime technique that does not rely on additional spare TSVs Swap Faulty TSVs with Standby TSVs at runtime STEP-3: REDIRECTING FAULTY TSVs
USING FAULTSIM TO EVALUATE 3D MEMORY 35 FaultSim used to evaluate TSV sparing in 3D memory Probability of Failure Placement of Cache Line [“Citadel”, MICRO 2014]
OVERVIEW WHY FAULTSIM? FAULTSIM: WHAT AND HOW? FAULTSIM: LESS THAN 1 MINUTE FAULTSIM: APPLIED TO 3D MEMORY FAULTSIM: APPLIED WITH ON-DIE ECC SUMMARY 36
XED: Exposing On-Die Error Detection Information for Strong Memory Reliability [ISCA 2016] Prashant NairGeorgia Tech Vilas SridharanAMD Inc. Moinuddin QureshiGeorgia Tech
DRAM Building block for main memory for decades DRAM scaling provides higher memory capacity. Moving to smaller node provides ~2x capacity Shrinking DRAM cells is difficult threat to scaling Broken Cells Reduce Yield!! Solution ECC inside DRAM Chips (On-Die ECC) 38 Future DRAM Chips On-Die ECC to enable scaling INTRODUCTION Mechanically Unstable Cell DRAM Cell Capacitor (tilting towards ground)
UTILITY OF ON-DIE ECC On-Die ECC cannot fix multi-bit errors Even using an ECC-DIMM with On-Die ECC does not increase reliability Ideally we want Chipkill-level reliability 43x more But Chipkill requires activating 18 chips (atleast)!! Only if we can use ECC-DIMM and activate only 9 chips to get Chipkill Can we can use On-Die Error Information for this? 39 Goal: Use ECC-DIMM Expose On-Die Error Information Chipkil
Each Chip in a DIMM stores 8B of Data On-Die ECC Computes ECC within each DRAM Die CHIP LOOKING AT A MEMORY REQUEST 40 Cache Line = 64 Bytes For compatibility (DDR4) and reducing bandwidth, On-Die ECC is invisible to the memory controller On-Die ECC
XED: ERROR DETECTION + PARITY 41 On-Die ECC can be reused as Error Detection Code CHIP XED: Expose Error Detection Information in a 9- chip ECC-DIMM to fix chip-faults Chipkill On-Die ECC Detects Chip Failure CHIP Parity Chip Can Recover The Error
HOW TO EXPOSE ON-DIE ERROR INFO? Use additional wires –Makes the solution incompatible with current systems –Current systems are starved for pin bandwidth Add one additional burst or transaction –Consumes bandwidth –8 burst becomes 10 bursts 20% increase in bandwidth –Additional transaction takes 100% bandwidth 42 Goal: Expose On-Die Error Detection with negligible changes
EXPOSING ON-DIE ECC FOR FREE 43 If there are no errors, DRAM sends all Data+Parity CHIP Parity Chip The memory controller detects that there is no-error CHIP D0 D1 D2 D3 D4 D5 D6 D7 PA Memory Controller
EXPOSING ON-DIE ECC USING CATCHWORD 44 On detecting an error, the DRAM chip sends a 64-bit “Catch-Word” instead of data CHIP Parity Chip 64-bit Catch-Words efficiently identify the faulty chip CHIP CW D1 D2 D3 D4 D5 D6 D7 PA Memory Controller 64-bit
WHY CATCH-WORDS WORK If Catch-Word is different from data, then parity there is a parity mismatch Detected Error!! If Catch-Word is the same as the data (Collision), then no parity mismatch No Error –Update Catch-Word on Collision 45 Collision of Catch-Words (rare-event) still leads to correct results
WHY COLLISIONS ARE NOT A PROBLEM A chip stores 64 bits/cache-line 2 64 combinations However even a 16Gb chip has 2 28 cache-lines Thus even if the entire chip contained different data there are still ~10^19 combinations free A random 64-bit catch-word most likely not collide 46
RESULTS A look at current commercial ECC schemes Chipkill activates more chips Low Performance SECDED activates few chips Low Reliability XED High Reliability + High Performance 47 ECC-SCHEMEChipsStrength of ECC SECDED91x Chipkill1843x XED9172x XED provides very high reliability with no performance loss
OVERVIEW WHY FAULTSIM? FAULTSIM: WHAT AND HOW? FAULTSIM: LESS THAN 1 MINUTE FAULTSIM: APPLIED TO 3D MEMORY FAULTSIM: APPLIED WITH ON-DIE ECC SUMMARY 48
SUMMARY Memory-Reliability is becoming increasing important and there is a need for evaluation tools We introduce FaultSim An efficient and fast memory reliability simulator FaultSim is ~ 5000x faster than interval-based Monte-Carlo simulator FaultSim can be used for evaluating the reliability of for Stacked Memories and Future Memory Systems 49