F AULT S IM : A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems* Memory Reliability Tutorial: HPCA-2016 Prashant.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

A Case for Refresh Pausing in DRAM Memory Systems
+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.
Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures June 14 th 2014 Prashant J. Nair - Georgia Tech David A. Roberts- AMD Research.
Availability in Globally Distributed Storage Systems
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures Dec 15 th 2014 MICRO-47 Cambridge UK Prashant Nair - Georgia Tech David.
Reducing Read Latency of Phase Change Memory via Early Read and Turbo Read Feb 9 th 2015 HPCA-21 San Francisco, USA Prashant Nair - Georgia Tech Chiachen.
Moinuddin K. Qureshi ECE, Georgia Tech
1 The Basic Memory Element - The Flip-Flop Up until know we have looked upon memory elements as black boxes. The basic memory element is called the flip-flop.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.
1 Lecture 22: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
COMPUTER ARCHITECTURE & OPERATIONS I Instructor: Hao Ji.
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.
Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.
Prashant Nair Dae-Hyun Kim Moinuddin K. Qureshi
Memory and Programmable Logic
L i a b l eh kC o m p u t i n gL a b o r a t o r y Yield Enhancement for 3D-Stacked Memory by Redundancy Sharing across Dies Li Jiang, Rong Ye and Qiang.
Reducing Refresh Power in Mobile Devices with Morphable ECC
IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.
1 Towards Phase Change Memory as a Secure Main Memory André Seznec IRISA/INRIA.
MODULE 5: Main Memory.
P AY -A S -Y OU -G O S TORAGE -E FFICIENT H ARD E RROR C ORRECTION Moinuddin K. Qureshi ECE, Georgia Tech Research done while at: IBM T. J. Watson Research.
Memory Intro Computer Organization 1 Computer Science Dept Va Tech March 2006 ©2006 McQuain & Ribbens Built using D flip-flops: 4-Bit Register Clock input.
RDIS: A Recursively Defined Invertible Set Scheme to Tolerate Multiple Stuck-At Faults in Resistive Memory Rami Melhem, Rakan Maddah and Sangyeun cho Computer.
Memory Cell Operation.
1 Lecture 24: Fault Tolerance Papers: Token Coherence: Decoupling Performance and Correctness, ISCA’03, Wisconsin A Low Overhead Fault Tolerant Coherence.
Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.
Implicit-Storing and Redundant- Encoding-of-Attribute Information in Error-Correction-Codes Yiannakis Sazeides 1, Emre Ozer 2, Danny Kershaw 3, Panagiota.
Physical Memory and Physical Addressing By Alex Ames.
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /3/2013 Lecture 9: Memory Unit Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE CENTRAL.
This project has received funding from the European Union's Seventh Framework Programme for research, technological development.
1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.
ECE/CS 552: Main Memory and ECC © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and.
1 Lecture 27: Disks Today’s topics:  Disk basics  RAID  Research topics.
1 Lecture 5: Refresh, Chipkill Topics: refresh basics and innovations, error correction.
1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.
FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.
1 Lecture 3: Memory Energy and Buffers Topics: Refresh, floorplan, buffers (SMB, FB-DIMM, BOOM), memory blades, HMC.
1 Lecture 7: PCM Wrap-Up, Cache coherence Topics: handling PCM errors and writes, cache coherence intro.
1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)
IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©
Gunjeet Kaur Dronacharya Group of Institutions. Outline I Random-Access Memory Memory Decoding Error Detection and Correction Read-Only Memory Programmable.
“With 1 MB RAM, we had a memory capacity which will NEVER be fully utilized” - Bill Gates.
Types of RAM (Random Access Memory) Information Technology.
1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.
Chapter 5 - Internal Memory 5.1 Semiconductor Main Memory 5.2 Error Correction 5.3 Advanced DRAM Organization.
1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,
Reducing Memory Interference in Multicore Systems
Types of RAM (Random Access Memory)
Vladimir Stojanovic & Nicholas Weaver
Computer Architecture & Operations I
Lecture 15: DRAM Main Memory Systems
Lecture: Memory, Multiprocessors
Lecture: Memory Technology Innovations
Lecture 6: Reliability, PCM
RAID Redundant Array of Inexpensive (Independent) Disks
SYNERGY: Rethinking Secure-Memory Design for Error-Correcting Memories
Chapter 4: MEMORY.
Use ECP, not ECC, for hard failures in resistive memories
15-740/ Computer Architecture Lecture 19: Main Memory
DRAM Hwansoo Han.
Dynamic Verification of Sequential Consistency
4-Bit Register Built using D flip-flops:
Presentation transcript:

F AULT S IM : A Fast, Configurable Memory-Reliability Simulator for Conventional and 3D-Stacked Systems* Memory Reliability Tutorial: HPCA-2016 Prashant Nair Georgia Tech David Roberts AMD Research Moinuddin QureshiGeorgia Tech *HiPEAC-2016

HOW TO EVALUATE MEMORY RELIABILITY? 2 Goal: Accurately evaluate memory reliability across different systems & solutions, in less than one minute Memory System PerformancePower Reliability DRAMSim2 USIMM NVMain DRAMSim2 USIMM NVSim Cactii Fast and accurate simulators vital to compare effectiveness of different solutions

TYPES OF MEMORY FAILURES 3 Cores + Caches DRAM devices can encounter faults during operation Permanent Failure Transient Failure Memory reliability evaluations must account for both transient failures as well as permanent failures

GRANULARITY OF MEMORY FAILURES 4 Failures occur at small and large granularities: Bit, Word, Column, Row, Bank, Multi-Bank Memory reliability simulator should capture interaction of failures at different granularities Single DRAM Die (Top View) Banks

Failure Mode Transient Fault Rate (FIT) Permanent Fault Rate (FIT) Bit Word Column Row Bank0.810 Total REAL WORLD FAILURE RATE [SRIDHARAN+ SC’13] 5 1. Permanent faults >2x as likely as transient faults 2. Large granularity faults as common as bit faults } = 24.1 ✔ SECDED CHIPKILL ✔

COMPLEXITY IN FAULT INTERACTIONS WITH ECC 6 Several techniques: SECDED, Chipkill, Sparing often used in combination with periodic Scrubbing Complex interactions of techniques with fault modes and granularities  How to evaluate effectiveness? Chipkill SECDED SPARING SCRUBBING

ANALYTICAL MODELS FOR MEMORY RELIABILITY Complex, Cumbersome, Changes with Fault Models A PRDC paper* has nearly 3 page model for Chipkill 7 Use empirical evaluation instead of analytical models Small change in ECC  Massive changes in the model *Jian et. al. PRDC 2013

OVERVIEW  WHY FAULTSIM?  FAULTSIM: WHAT AND HOW?  FAULTSIM: LESS THAN 1 MINUTE  FAULTSIM: APPLIED TO 3D MEMORY  FAULTSIM: APPLIED WITH ON-DIE ECC  SUMMARY 8

FAULTSIM: A MONTE-CARLO FAULT SIMULATOR 9 FaultSim is written in C++. Configuration at command line or file Describes memory system Including chips per rank Number of channels And interconnect to processor Describes memory system Including chips per rank Number of channels And interconnect to processor Describes fault rate/component Derived from field studies Can be changed Describes fault rate/component Derived from field studies Can be changed Describes mitigation technique(s) Can be combination Used with/without scrubbing Describes mitigation technique(s) Can be combination Used with/without scrubbing

FAULTSIM OPERATION 10 FaultSim performs 20K*1million interval simulations per chip for each fault type  days of simulation time Divide system lifetime (7 years) into smaller interval (3 hours) Check for uncorrectable failures in every interval 20,000 Intervals ~ 1 million trials

FAULTSIM: DATA STRUCTURES Memory chips are organized as Fault Domains Fault Domain (FD) consists of Fault Ranges (FR) Each FR uses Address (ADDR) and Mask fields 11 Space Efficient Representation: Large + Small granularity faults use only one type of FR data structure Channel Chip Fault

FAULT REPRESENTATION: EXAMPLE Memory with 8 rows and 8 bits per row Fault ranges A, B and C (A and B intersect) Mask field: fault address bit i can be 0 or 1 Address field: specific address bit values where Mask i == 0 Faults intersection computed based on mask and address 12 FRAddressMask A B C

VALIDATION: WITH ANALYTICAL MODELS 13 FaultSim closely follows the analytical model (within 2%) System: 1 rank, 18-chips

RESULTS: SIMULATION TIME 14 FaultSim still has simulation time in the order of days How to we reduce this to less than a minute? REPAIR SCHEME Simulation Time (Wall Clock) SECDED49.5 hours ChipKill49.2 hours Time for a million trials with FaultSim

OVERVIEW  WHY FAULTSIM?  FAULTSIM: WHAT AND HOW?  FAULTSIM: LESS THAN 1 MINUTE  FAULTSIM: APPLIED TO 3D MEMORY  FAULTSIM: APPLIED WITH ON-DIE ECC  SUMMARY 15

OBSERVATION: FEW FAULTS IN SYSTEM LIFETIME FaultSim consults random number generator at-least once during each interval (20K) System with 2 DIMMs, 9 chips each, over 7 years Num. Faults Encountered (Total)TRIALS 092.9% 16.7% 20.2% 3+0.2% Can we consult random number generator in proportion to faults, instead of every time interval?

INSIGHT: COMPUTE DISTANCE TO NEXT FAULT Example: Let the likelihood of a lottery ticket be a winner be 1/1000. We buy 5000 tickets. What is the likelihood of “X” winning tickets? 17 The time between events in a process in which events occur continuously and independently at a constant average is exponentially distributed Naïve Method: Draw 5000 tickets, for each ticket check if it is winner Distance Method: Compute distance to winning ticket using exponential distribution (avg=1000). Do until sum of distance > K 1K 2K 3K 4K 5K dist=1.5K dist=2K dist=0.5K dist=1.5K exceeds

FAULTSIM: EVENT-BASED FAULT INJECTION 18 Calls to random number reduced from 20K to 1 (or 2) Event-Based Fault Injection: When is the next fault? Time-Stamp of all faults computed at start of simulation. Simulation skips from one fault to another

RESULTS: SIMULATION TIME 19 FaultSim ~5000x faster with Event-Based Fault Injection  reliability simulation in less than one minute SCHEME Simulation Time (Wall Clock) SECDED (Interval Based)49.5 hours SECDED (Event Based)34 seconds ChipKill(Interval Based)49.2 hours Chipkill (Event Based)33 seconds Time for a million trials with FaultSim

OBTAINING AND RUNNING FAULTSIM Clone it from github $git clone Running FaultSim. /faultsim --help for a list of command line parameters./faultsim --configfile configs/DIMM_none.ini --outfile out.txt 20

OVERVIEW  WHY FAULTSIM?  FAULTSIM: WHAT AND HOW?  FAULTSIM: LESS THAN 1 MINUTE  FAULTSIM: APPLIED TO 3D MEMORY  FAULTSIM: APPLIED WITH ON-DIE ECC  SUMMARY 21

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures [MICRO 2014] Prashant NairGeorgia Tech David Roberts AMD Research Moinuddin QureshiGeorgia Tech

DRAM systems face a bandwidth wall Stack DRAM Dies over each other 3D DRAM Use Through Silicon Vias (TSV) to connect Dies Higher density of TSV Higher Bandwidth 23 INTRODUCTION TO 3D DRAM 23 Courtesy MICRON, Extremetech Go 3D to Scale Bandwidth Wall

3D DRAM Communicate using TSVs A New Failure Mode: TSV Failures TSV Failures Large Granularity Failures FAILURES IN 3D DRAM 24 TSVs Present New Kind of Large Granularity Failures

A NEW FAILURE MODE FROM TSVs 25 TSVs conduit for Address and Data Mainly Two Types TSV Faults –Data (Incorrect Data fetched from DRAM Die) –Address (Incorrect address presented to DRAM Die) Logic Die TSVs Address TSV Fault TSV Faults cause unavailability of Data and Addresses DataTSV Fault

CAPTURING THE EFFECT OF TSV FAULTS Data TSV Fault Few Columns Faulty Address TSV Fault 50% Memory Loss 26 TSVs faults manifested as column/bank failure DRAM Bank Row Decoder Column Decoder Addr. TSVs Data TSVs Faulty Data TSV Faulty Addr. TSV Address TSV fault: 50% memory unavailable Address TSV fault: 50% memory unavailable

IMPACT OF TSV FAULTS 27 Efficient Techniques to Mitigate TSV Faults Prob. System Failure TSV Faults Yes No 1 System: 8GB Stacked Memory (HBM) Prob. System Failure Prob(Uncorrectable Error) 22X

OVERVIEW  WHY FAULTSIM?  FAULTSIM: WHAT AND HOW?  FAULTSIM: LESS THAN 1 MINUTE  FAULTSIM: APPLIED TO 3D MEMORY  Citadel: TSV-SWAP 28

DESIGN-TIME TSV SPARING Designers provision spares TSVs alongside Data TSVs and Address TSVs 29 DRAM Bank Row Decoder Column Decoder SPARE TSVs Additional Spare TSVs can replace faulty TSVs

DESIGN-TIME TSV SPARING: OPERATION Deactivate Broken TSVs Activate SPARE TSVs 30 DRAM Bank Row Decoder Column Decoder Faulty Data TSV Faulty Addr. TSV Address TSV fault: 50% memory unavailable Address TSV fault: 50% memory unavailable SPARE TSVs ✖ ✖ Deactivation of Faulty TSVs and Activation of Spare TSVs is performed at design time

DESIGN-TIME TSV SPARING: PROBLEMS Additional TSVs are required for TSV Sparing and What happens if TSVs turn faulty at runtime? 31

TSV-SWAP: RUNTIME TSV SPARING Few Data TSVs as Standby TSVs Replicate Standby Data in ECC 32 DRAM Bank Row Decoder Column Decoder (standby TSV) Data TSVs reused as Standby TSVs STEP-1: CREATE STANDBY TSVs Data Cache ECC Standby

TSV-SWAP: RUNTIME TSV SPARING 33 DRAM Bank Row Decoder Column Decoder (standby TSV) Address TSV fault: 50% memory unavailable Address TSV fault: 50% memory unavailable Data vs Address TSV Faults Using CRC-32+BIST CRC-32 address + data BIST diagnoses faulty TSVs STEP-2: DETECTING FAULTY TSVs

TSV-SWAP: RUNTIME TSV SPARING 34 DRAM Bank Row Decoder Column Decoder (standby TSV) SWAP Address TSV fault: 50% memory unavailable Address TSV fault: 50% memory unavailable SWAP TSV-SWAP is a runtime technique that does not rely on additional spare TSVs Swap Faulty TSVs with Standby TSVs at runtime STEP-3: REDIRECTING FAULTY TSVs

USING FAULTSIM TO EVALUATE 3D MEMORY 35 FaultSim used to evaluate TSV sparing in 3D memory Probability of Failure Placement of Cache Line [“Citadel”, MICRO 2014]

OVERVIEW  WHY FAULTSIM?  FAULTSIM: WHAT AND HOW?  FAULTSIM: LESS THAN 1 MINUTE  FAULTSIM: APPLIED TO 3D MEMORY  FAULTSIM: APPLIED WITH ON-DIE ECC  SUMMARY 36

XED: Exposing On-Die Error Detection Information for Strong Memory Reliability [ISCA 2016] Prashant NairGeorgia Tech Vilas SridharanAMD Inc. Moinuddin QureshiGeorgia Tech

DRAM  Building block for main memory for decades DRAM scaling provides higher memory capacity. Moving to smaller node provides ~2x capacity Shrinking DRAM cells is difficult  threat to scaling Broken Cells  Reduce Yield!! Solution  ECC inside DRAM Chips (On-Die ECC) 38 Future DRAM Chips  On-Die ECC to enable scaling INTRODUCTION Mechanically Unstable Cell DRAM Cell Capacitor (tilting towards ground)

UTILITY OF ON-DIE ECC On-Die ECC cannot fix multi-bit errors Even using an ECC-DIMM with On-Die ECC does not increase reliability Ideally we want Chipkill-level reliability  43x more But Chipkill requires activating 18 chips (atleast)!! Only if we can use ECC-DIMM and activate only 9 chips to get Chipkill Can we can use On-Die Error Information for this? 39 Goal: Use ECC-DIMM  Expose On-Die Error Information  Chipkil

Each Chip in a DIMM stores 8B of Data On-Die ECC Computes ECC within each DRAM Die CHIP LOOKING AT A MEMORY REQUEST 40 Cache Line = 64 Bytes For compatibility (DDR4) and reducing bandwidth, On-Die ECC is invisible to the memory controller On-Die ECC

XED: ERROR DETECTION + PARITY 41 On-Die ECC can be reused as Error Detection Code CHIP XED: Expose Error Detection Information in a 9- chip ECC-DIMM to fix chip-faults  Chipkill On-Die ECC Detects Chip Failure CHIP Parity Chip Can Recover The Error

HOW TO EXPOSE ON-DIE ERROR INFO? Use additional wires –Makes the solution incompatible with current systems –Current systems are starved for pin bandwidth Add one additional burst or transaction –Consumes bandwidth –8 burst becomes 10 bursts  20% increase in bandwidth –Additional transaction takes 100% bandwidth 42 Goal: Expose On-Die Error Detection with negligible changes

EXPOSING ON-DIE ECC FOR FREE 43 If there are no errors, DRAM sends all Data+Parity CHIP Parity Chip The memory controller detects that there is no-error CHIP D0 D1 D2 D3 D4 D5 D6 D7 PA Memory Controller

EXPOSING ON-DIE ECC USING CATCHWORD 44 On detecting an error, the DRAM chip sends a 64-bit “Catch-Word” instead of data CHIP Parity Chip 64-bit Catch-Words efficiently identify the faulty chip CHIP CW D1 D2 D3 D4 D5 D6 D7 PA Memory Controller 64-bit

WHY CATCH-WORDS WORK If Catch-Word is different from data, then parity there is a parity mismatch  Detected Error!! If Catch-Word is the same as the data (Collision), then no parity mismatch  No Error –Update Catch-Word on Collision 45 Collision of Catch-Words (rare-event) still leads to correct results

WHY COLLISIONS ARE NOT A PROBLEM A chip stores 64 bits/cache-line  2 64 combinations However even a 16Gb chip has 2 28 cache-lines Thus even if the entire chip contained different data there are still ~10^19 combinations free A random 64-bit catch-word most likely not collide 46

RESULTS A look at current commercial ECC schemes Chipkill activates more chips  Low Performance SECDED activates few chips  Low Reliability XED  High Reliability + High Performance 47 ECC-SCHEMEChipsStrength of ECC SECDED91x Chipkill1843x XED9172x XED provides very high reliability with no performance loss

OVERVIEW  WHY FAULTSIM?  FAULTSIM: WHAT AND HOW?  FAULTSIM: LESS THAN 1 MINUTE  FAULTSIM: APPLIED TO 3D MEMORY  FAULTSIM: APPLIED WITH ON-DIE ECC  SUMMARY 48

SUMMARY Memory-Reliability is becoming increasing important and there is a need for evaluation tools We introduce FaultSim  An efficient and fast memory reliability simulator FaultSim is ~ 5000x faster than interval-based Monte-Carlo simulator FaultSim can be used for evaluating the reliability of for Stacked Memories and Future Memory Systems 49