Presentation is loading. Please wait.

Presentation is loading. Please wait.

F AULT S IM : A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems* Memory Reliability Tutorial: HPCA-2016 Prashant.

Similar presentations


Presentation on theme: "F AULT S IM : A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems* Memory Reliability Tutorial: HPCA-2016 Prashant."— Presentation transcript:

1 F AULT S IM : A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems* Memory Reliability Tutorial: HPCA-2016 Prashant Nair Georgia Tech David Roberts AMD Research Moinuddin QureshiGeorgia Tech *HiPEAC-2016

2 HOW TO EVALUATE MEMORY RELIABILITY? 2 Goal: Accurately evaluate memory reliability across different systems & solutions, in less than one minute Memory System PerformancePower Reliability DRAMSim2 USIMM NVMain DRAMSim2 USIMM NVSim Cactii Fast and accurate simulators vital to compare effectiveness of different solutions

3 TYPES OF MEMORY FAILURES 3 Cores + Caches DRAM devices can encounter faults during operation Permanent Failure Transient Failure Memory reliability evaluations must account for both transient failures as well as permanent failures

4 GRANULARITY OF MEMORY FAILURES 4 Failures occur at small and large granularities: Bit, Word, Column, Row, Bank, Multi-Bank Memory reliability simulator should capture interaction of failures at different granularities Single DRAM Die (Top View) Banks

5 Failure Mode Transient Fault Rate (FIT) Permanent Fault Rate (FIT) Bit14.218.6 Word1.40.3 Column1.45.6 Row0.28.2 Bank0.810 Total1842.7 REAL WORLD FAILURE RATE [SRIDHARAN+ SC’13] 5 1. Permanent faults >2x as likely as transient faults 2. Large granularity faults as common as bit faults } = 24.1 ✔ SECDED CHIPKILL ✔

6 COMPLEXITY IN FAULT INTERACTIONS WITH ECC 6 Several techniques: SECDED, Chipkill, Sparing often used in combination with periodic Scrubbing Complex interactions of techniques with fault modes and granularities  How to evaluate effectiveness? Chipkill SECDED SPARING SCRUBBING

7 ANALYTICAL MODELS FOR MEMORY RELIABILITY Complex, Cumbersome, Changes with Fault Models A PRDC paper* has nearly 3 page model for Chipkill 7 Use empirical evaluation instead of analytical models Small change in ECC  Massive changes in the model *Jian et. al. PRDC 2013

8 OVERVIEW  WHY FAULTSIM?  FAULTSIM: WHAT AND HOW?  FAULTSIM: LESS THAN 1 MINUTE  FAULTSIM: APPLIED TO 3D MEMORY  FAULTSIM: APPLIED WITH ON-DIE ECC  SUMMARY 8

9 FAULTSIM: A MONTE-CARLO FAULT SIMULATOR 9 FaultSim is written in C++. Configuration at command line or file Describes memory system Including chips per rank Number of channels And interconnect to processor Describes memory system Including chips per rank Number of channels And interconnect to processor Describes fault rate/component Derived from field studies Can be changed Describes fault rate/component Derived from field studies Can be changed Describes mitigation technique(s) Can be combination Used with/without scrubbing Describes mitigation technique(s) Can be combination Used with/without scrubbing

10 FAULTSIM OPERATION 10 FaultSim performs 20K*1million interval simulations per chip for each fault type  days of simulation time Divide system lifetime (7 years) into smaller interval (3 hours) Check for uncorrectable failures in every interval 20,000 Intervals ~ 1 million trials

11 FAULTSIM: DATA STRUCTURES Memory chips are organized as Fault Domains Fault Domain (FD) consists of Fault Ranges (FR) Each FR uses Address (ADDR) and Mask fields 11 Space Efficient Representation: Large + Small granularity faults use only one type of FR data structure Channel Chip Fault

12 FAULT REPRESENTATION: EXAMPLE Memory with 8 rows and 8 bits per row Fault ranges A, B and C (A and B intersect) Mask field: fault address bit i can be 0 or 1 Address field: specific address bit values where Mask i == 0 Faults intersection computed based on mask and address 12 FRAddressMask A000001011000 B010000000111 C110000000111

13 VALIDATION: WITH ANALYTICAL MODELS 13 FaultSim closely follows the analytical model (within 2%) System: 1 rank, 18-chips

14 RESULTS: SIMULATION TIME 14 FaultSim still has simulation time in the order of days How to we reduce this to less than a minute? REPAIR SCHEME Simulation Time (Wall Clock) SECDED49.5 hours ChipKill49.2 hours Time for a million trials with FaultSim

15 OVERVIEW  WHY FAULTSIM?  FAULTSIM: WHAT AND HOW?  FAULTSIM: LESS THAN 1 MINUTE  FAULTSIM: APPLIED TO 3D MEMORY  FAULTSIM: APPLIED WITH ON-DIE ECC  SUMMARY 15

16 OBSERVATION: FEW FAULTS IN SYSTEM LIFETIME FaultSim consults random number generator at-least once during each interval (20K) System with 2 DIMMs, 9 chips each, over 7 years Num. Faults Encountered (Total)TRIALS 092.9% 16.7% 20.2% 3+0.2% Can we consult random number generator in proportion to faults, instead of every time interval?

17 INSIGHT: COMPUTE DISTANCE TO NEXT FAULT Example: Let the likelihood of a lottery ticket be a winner be 1/1000. We buy 5000 tickets. What is the likelihood of “X” winning tickets? 17 The time between events in a process in which events occur continuously and independently at a constant average is exponentially distributed Naïve Method: Draw 5000 tickets, for each ticket check if it is winner Distance Method: Compute distance to winning ticket using exponential distribution (avg=1000). Do until sum of distance > 5000. 0K 1K 2K 3K 4K 5K dist=1.5K dist=2K dist=0.5K dist=1.5K exceeds

18 FAULTSIM: EVENT-BASED FAULT INJECTION 18 Calls to random number reduced from 20K to 1 (or 2) Event-Based Fault Injection: When is the next fault? Time-Stamp of all faults computed at start of simulation. Simulation skips from one fault to another

19 RESULTS: SIMULATION TIME 19 FaultSim ~5000x faster with Event-Based Fault Injection  reliability simulation in less than one minute SCHEME Simulation Time (Wall Clock) SECDED (Interval Based)49.5 hours SECDED (Event Based)34 seconds ChipKill(Interval Based)49.2 hours Chipkill (Event Based)33 seconds Time for a million trials with FaultSim

20 OBTAINING AND RUNNING FAULTSIM Clone it from github $git clone https://github.com/Prashant-GTech/FaultSim-A-Memory-Reliability-Simulatorhttps://github.com/Prashant-GTech/FaultSim-A-Memory-Reliability-Simulator Running FaultSim. /faultsim --help for a list of command line parameters./faultsim --configfile configs/DIMM_none.ini --outfile out.txt 20

21 OVERVIEW  WHY FAULTSIM?  FAULTSIM: WHAT AND HOW?  FAULTSIM: LESS THAN 1 MINUTE  FAULTSIM: APPLIED TO 3D MEMORY  FAULTSIM: APPLIED WITH ON-DIE ECC  SUMMARY 21

22 Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures [MICRO 2014] Prashant NairGeorgia Tech David Roberts AMD Research Moinuddin QureshiGeorgia Tech

23 DRAM systems face a bandwidth wall Stack DRAM Dies over each other 3D DRAM Use Through Silicon Vias (TSV) to connect Dies Higher density of TSV Higher Bandwidth 23 INTRODUCTION TO 3D DRAM 23 Courtesy MICRON, Extremetech Go 3D to Scale Bandwidth Wall

24 3D DRAM Communicate using TSVs A New Failure Mode: TSV Failures TSV Failures Large Granularity Failures FAILURES IN 3D DRAM 24 TSVs Present New Kind of Large Granularity Failures

25 A NEW FAILURE MODE FROM TSVs 25 TSVs conduit for Address and Data Mainly Two Types TSV Faults –Data (Incorrect Data fetched from DRAM Die) –Address (Incorrect address presented to DRAM Die) Logic Die TSVs Address TSV Fault TSV Faults cause unavailability of Data and Addresses DataTSV Fault

26 CAPTURING THE EFFECT OF TSV FAULTS Data TSV Fault Few Columns Faulty Address TSV Fault 50% Memory Loss 26 TSVs faults manifested as column/bank failure DRAM Bank Row Decoder Column Decoder Addr. TSVs Data TSVs Faulty Data TSV Faulty Addr. TSV Address TSV fault: 50% memory unavailable Address TSV fault: 50% memory unavailable

27 IMPACT OF TSV FAULTS 27 Efficient Techniques to Mitigate TSV Faults 10 -1 10 -2 10 -3 Prob. System Failure TSV Faults Yes No 1 System: 8GB Stacked Memory (HBM) Prob. System Failure Prob(Uncorrectable Error) 22X

28 OVERVIEW  WHY FAULTSIM?  FAULTSIM: WHAT AND HOW?  FAULTSIM: LESS THAN 1 MINUTE  FAULTSIM: APPLIED TO 3D MEMORY  Citadel: TSV-SWAP 28

29 DESIGN-TIME TSV SPARING Designers provision spares TSVs alongside Data TSVs and Address TSVs 29 DRAM Bank Row Decoder Column Decoder SPARE TSVs Additional Spare TSVs can replace faulty TSVs

30 DESIGN-TIME TSV SPARING: OPERATION Deactivate Broken TSVs Activate SPARE TSVs 30 DRAM Bank Row Decoder Column Decoder Faulty Data TSV Faulty Addr. TSV Address TSV fault: 50% memory unavailable Address TSV fault: 50% memory unavailable SPARE TSVs ✖ ✖ Deactivation of Faulty TSVs and Activation of Spare TSVs is performed at design time

31 DESIGN-TIME TSV SPARING: PROBLEMS Additional TSVs are required for TSV Sparing and What happens if TSVs turn faulty at runtime? 31

32 TSV-SWAP: RUNTIME TSV SPARING Few Data TSVs as Standby TSVs Replicate Standby Data in ECC 32 DRAM Bank Row Decoder Column Decoder (standby TSV) Data TSVs reused as Standby TSVs STEP-1: CREATE STANDBY TSVs Data Cache ECC Standby

33 TSV-SWAP: RUNTIME TSV SPARING 33 DRAM Bank Row Decoder Column Decoder (standby TSV) Address TSV fault: 50% memory unavailable Address TSV fault: 50% memory unavailable Data vs Address TSV Faults Using CRC-32+BIST CRC-32 address + data BIST diagnoses faulty TSVs STEP-2: DETECTING FAULTY TSVs

34 TSV-SWAP: RUNTIME TSV SPARING 34 DRAM Bank Row Decoder Column Decoder (standby TSV) SWAP Address TSV fault: 50% memory unavailable Address TSV fault: 50% memory unavailable SWAP TSV-SWAP is a runtime technique that does not rely on additional spare TSVs Swap Faulty TSVs with Standby TSVs at runtime STEP-3: REDIRECTING FAULTY TSVs

35 USING FAULTSIM TO EVALUATE 3D MEMORY 35 FaultSim used to evaluate TSV sparing in 3D memory Probability of Failure Placement of Cache Line [“Citadel”, MICRO 2014]

36 OVERVIEW  WHY FAULTSIM?  FAULTSIM: WHAT AND HOW?  FAULTSIM: LESS THAN 1 MINUTE  FAULTSIM: APPLIED TO 3D MEMORY  FAULTSIM: APPLIED WITH ON-DIE ECC  SUMMARY 36

37 XED: Exposing On-Die Error Detection Information for Strong Memory Reliability [ISCA 2016] Prashant NairGeorgia Tech Vilas SridharanAMD Inc. Moinuddin QureshiGeorgia Tech

38 DRAM  Building block for main memory for decades DRAM scaling provides higher memory capacity. Moving to smaller node provides ~2x capacity Shrinking DRAM cells is difficult  threat to scaling Broken Cells  Reduce Yield!! Solution  ECC inside DRAM Chips (On-Die ECC) 38 Future DRAM Chips  On-Die ECC to enable scaling INTRODUCTION Mechanically Unstable Cell DRAM Cell Capacitor (tilting towards ground)

39 UTILITY OF ON-DIE ECC On-Die ECC cannot fix multi-bit errors Even using an ECC-DIMM with On-Die ECC does not increase reliability Ideally we want Chipkill-level reliability  43x more But Chipkill requires activating 18 chips (atleast)!! Only if we can use ECC-DIMM and activate only 9 chips to get Chipkill Can we can use On-Die Error Information for this? 39 Goal: Use ECC-DIMM  Expose On-Die Error Information  Chipkil

40 Each Chip in a DIMM stores 8B of Data On-Die ECC Computes ECC within each DRAM Die CHIP LOOKING AT A MEMORY REQUEST 40 Cache Line = 64 Bytes For compatibility (DDR4) and reducing bandwidth, On-Die ECC is invisible to the memory controller On-Die ECC

41 XED: ERROR DETECTION + PARITY 41 On-Die ECC can be reused as Error Detection Code CHIP XED: Expose Error Detection Information in a 9- chip ECC-DIMM to fix chip-faults  Chipkill On-Die ECC Detects Chip Failure CHIP Parity Chip Can Recover The Error

42 HOW TO EXPOSE ON-DIE ERROR INFO? Use additional wires –Makes the solution incompatible with current systems –Current systems are starved for pin bandwidth Add one additional burst or transaction –Consumes bandwidth –8 burst becomes 10 bursts  20% increase in bandwidth –Additional transaction takes 100% bandwidth 42 Goal: Expose On-Die Error Detection with negligible changes

43 EXPOSING ON-DIE ECC FOR FREE 43 If there are no errors, DRAM sends all Data+Parity CHIP Parity Chip The memory controller detects that there is no-error CHIP D0 D1 D2 D3 D4 D5 D6 D7 PA Memory Controller

44 EXPOSING ON-DIE ECC USING CATCHWORD 44 On detecting an error, the DRAM chip sends a 64-bit “Catch-Word” instead of data CHIP Parity Chip 64-bit Catch-Words efficiently identify the faulty chip CHIP CW D1 D2 D3 D4 D5 D6 D7 PA Memory Controller 64-bit

45 WHY CATCH-WORDS WORK If Catch-Word is different from data, then parity there is a parity mismatch  Detected Error!! If Catch-Word is the same as the data (Collision), then no parity mismatch  No Error –Update Catch-Word on Collision 45 Collision of Catch-Words (rare-event) still leads to correct results

46 WHY COLLISIONS ARE NOT A PROBLEM A chip stores 64 bits/cache-line  2 64 combinations However even a 16Gb chip has 2 28 cache-lines Thus even if the entire chip contained different data there are still ~10^19 combinations free A random 64-bit catch-word most likely not collide 46

47 RESULTS A look at current commercial ECC schemes Chipkill activates more chips  Low Performance SECDED activates few chips  Low Reliability XED  High Reliability + High Performance 47 ECC-SCHEMEChipsStrength of ECC SECDED91x Chipkill1843x XED9172x XED provides very high reliability with no performance loss

48 OVERVIEW  WHY FAULTSIM?  FAULTSIM: WHAT AND HOW?  FAULTSIM: LESS THAN 1 MINUTE  FAULTSIM: APPLIED TO 3D MEMORY  FAULTSIM: APPLIED WITH ON-DIE ECC  SUMMARY 48

49 SUMMARY Memory-Reliability is becoming increasing important and there is a need for evaluation tools We introduce FaultSim  An efficient and fast memory reliability simulator FaultSim is ~ 5000x faster than interval-based Monte-Carlo simulator FaultSim can be used for evaluating the reliability of for Stacked Memories and Future Memory Systems 49


Download ppt "F AULT S IM : A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems* Memory Reliability Tutorial: HPCA-2016 Prashant."

Similar presentations


Ads by Google