Download presentation
Presentation is loading. Please wait.
1
Virtualized and Flexible ECC for Main Memory
Doe Hyun Yoon and Mattan Erez Dept. Electrical and Computer Engineering The University of Texas at Austin Introduce the paper ASPLOS 2010
2
Memory Error Protection
Applying ECC uniformly – ECC DIMMs Simple and transparent to programmers Error protection level Fixed, design-time decision Chipkill-correct used in high-end servers Constrain memory module design space Allow only x4 DRAMs Lower energy efficiency than x8 DRAMs Virtualized ECC – objectives To provide flexible memory error protection To relax design constraints of chipkill
3
Virtualized ECC Two-tiered error protection Tier-1 Error Code (T1EC)
Simple error code for detection or light-weight correction Tier-2 Error Code (T2EC) Strong error correcting code Store T2EC within the memory namespace itself OS manages T2EC Flexible memory error protection Different T2EC for different data pages Stronger protection for more important data
4
Virtualized ECC – Example
Error Protection Level Physical Memory Virtual Address space Page frame – i Virtual page – i Low Virtual Page to Physical Frame mapping Page frame – j Virtual page – j Page frame – k Virtual page – k T2EC for Chipkill High ECC page – j ECC page – k Physical Frame to ECC Page mapping T2EC for Double Chipkill Data T1EC
5
Virtualized ECC
6
Observations on Memory Errors
Per-system error rate is still low Most of time, we try to detect errors finding no error To detect errors is a common case operation Need a low latency, low complexity error detection mechanism T1EC To correct errors is an uncommon case operation Correction can be complex, take a long time But, still need to manage error correction info somewhere Virtualized T2EC
7
Uniform ECC VA PA PA VPN PFN Physical Memory offset Virtual Memory
Page Frame PA Virtual Memory PA PFN offset Data ECC
8
Virtualized ECC VA PA PA EA ECC Address VPN PFN T2EC ECC page number
Physical Memory VA VPN offset Page Frame PA Virtual Memory PA PFN offset OS manages PFN to EPN translation ECC page number Scale according to T2EC size offset T2EC EA ECC Page ECC Address Data T1EC
9
Don’t need T2EC in most cases Read: fetch data and T1EC
Virtualized ECC operation ECC Address Translation Unit: fast PA to EA translation Write: update data, T1EC, and T2EC T2EC lines can be partially valid Update only valid T2EC to DRAM T2ECs of consecutive data lines map to a T2EC line PA: 0x0200 3 Wr: 0x0200 2 B0 Rd: 0x00c0 1 A ECC Address Translation Unit EA: 0x0540 4 0 LLC 1 2 3 0 Wr: 0x0540 5 DRAM Rank 0 Rank 1 0000 0040 0080 00c0 A 0100 0140 0180 01c0 0200 0240 B1 B2 B3 0280 02c0 0300 0340 0380 03c0 0400 0440 0480 04c0 T2EC for Rank 1 data T2EC for Rank 0 data 0500 0540 1 2 3 0580 05c0 Data T1EC Data T1EC
10
Penalty with V-ECC Increased data miss rate
T2EC lines in LLC reduce effective LLC size Increased traffic due to T2EC write-back One-way write-back traffic Not in a critical-path
11
Chipkill-Correct
12
Chipkill-correct Single Device-error Correct Double Device-error Detect Can tolerate a DRAM failure Can detect a second DRAM failure Chipkill requires x4 DRAMs x8 chipkill is impractical But, x8 DRAM is more energy efficient
13
Baseline x4 Chipkill Two x4 ECC DIMMs Access granularity
128bit data + 16bit ECC (redundancy overhead: 12.5%) 4 check symbol error code using 4-bit symbol Access granularity 64B in DDR2 (min. burst 4 x 128 bit) 128B in DDR3 (min. burst 8 x 128 bit) x4 144-bit wide data bus
14
x8 Chipkill x8 chipkill with the same access granularity
152-bit wide data path 128-bit data + 24-bit ECC Redundancy overhead: 18.75% Need a custom-designed DIMM Increase the system cost a lot 152-bit wide data bus x8
15
x8 Chipkill /w Standard DIMMs
Increase access granularity 128B in DDR2 (min. burst 4 x 256 bit) 256B in DDR3 (min. burst 8 x 256 bit) x8 280-bit wide data bus
16
V-ECC for Chipkill Use 3 check symbol error codes T1EC T2EC
Single Symbol-error Correct and Double Symbol-error Detect T1EC 2 check symbols Detect up to 2 symbol error T2EC 3rd check symbol Combined T1EC/T2EC provides Chipkill
17
V-ECC: ECC x4 configuration
Use 8-bit symbol error code 2 bursts out of a x4 DRAM form an 8bit-symbol Modern DRAMs have minimum burst of 4 or 8 1 x4 ECC DIMM + 1 x4 Non-ECC DIMM Each DRAM access in DDR2 (burst 4) 64B data, 4B T1EC 2B T2EC is virtualized within memory namespace 32 T2ECs per 64B cache line Virtualized within memory T2EC x4 136-bit wide data bus Data T1EC
18
V-ECC: ECC x8 configuration
Use 8-bit symbol error code 2 x8 ECC DIMMs Each DRAM access in DDR2 (burst 4) 64B data, 8B T1EC 4B T2EC is virtualized 16 T2ECs per 64B cache line Virtualized within memory T2EC 144-bit wide data bus x8 Data T1EC
19
Flexible Error Protection
Single HW with V-ECC can provide Chipkill-detect, Chipkill-correct, and Double chipkill-correct Use different T2EC for different pages Reliability – Performance tradeoff Maximize performance/power efficiency with Chipkill-Detect Stronger protection at the cost of additional T2EC access Chipkill-Detect Chipkill-Correct Double Chipkill-Correct ECC x4 0B 2B 4B ECC x8 8B
20
Evaluation
21
Simulator/Workload GEMS + DRAMsim Power model Workloads
An out-of-order SPARC V9 core Exclusive two-level cache hierarchy DDR2 800MHz – 12.8GB/s (128-bit wide data path) 1 channel 4 ranks Power model WATTCH for processor power – scaled to 45nm CACTI for cache power – cacti 45nm Micron model for DRAM power – commodity DRAMs Workloads 12 data intensive applications from SPEC CPU 2006 and PARSEC Microbenchmarks: STREAM and GUPS Say WHY these apps are chosen – mem intensive, worst behavior with V-ECC ALSO, explain STREAM and GUPS briefly
22
Normalized Execution Time
Less than 1% penalty on average Performance penalty Spatial locality Write-back traffic Low spatial locality + high write-back traffic: omnetpp, canneal, GUPS Low spatial locality, but low write-back traffic: mcf High write-back traffic, but high spatial locality: lbm
23
System Energy Efficiency
Energy Delay Product (EDP) gain ECC x4: 1.1% on average ECC x8: 12.0% on average 1.23 10% 20% 17% 12% Emphasize – Same or stronger error protection level
24
Flexible Error Protection
Chipkill-Detect Chipkill-Correct Double Chipkill-Correct Single HW can provide chipkill-detect, chipkill, double chipkill, dynamically.
25
Conclusion Virtualized ECC
Two-tiered error protection, virtualized T2EC Improved system energy efficiency with chipkill Reduce DRAM power consumption by 27% Improve system EDP by 12% Performance penalty – 1% on average Error protection even for Non-ECC DIMMs Can be used for GPU memory error protection Flexibility in error protection Adaptive error protection level by user/system demand Cost of error protection is proportional to protection level Say there’re more details in the paper: Error protection for Non-ECC DIMMs, PA to EA translation, ECC address translation unit, T2EC management, …
26
Virtualized and Flexible ECC for Main Memory
Doe Hyun Yoon and Mattan Erez Dept. Electrical and Computer Engineering The University of Texas at Austin
27
Backup
28
Virtualized ECC Operations
DRAM read Fetch data and T1EC – detect errors Don’t need T2EC in most cases DRAM write-back Update data, T1EC, and T2EC Cache T2EC for locality on T2EC access Need to translate PA to EA On-chip ECC address translation unit TLB-like structure for fast PA to EA translation Error correction Need to read T2EC; maybe in the LLC or DRAM
29
ECC Address Translation Unit
LLC PA: 0x0200 3 Wr: 0x0200 2 B0 Rd: 0x00c0 1 A ECC Address Translation Unit EA: 0x0540 4 0 1 2 3 0 Wr: 0x0540 5 DRAM Rank 0 Rank 1 0000 0040 0080 00c0 A 0100 0140 0180 01c0 0200 0240 B1 B2 B3 0280 02c0 0300 0340 0380 03c0 0400 0440 0480 04c0 T2EC for Rank 1 data T2EC for Rank 0 data 0500 0540 1 2 3 0580 05c0 Data T1EC Data T1EC
30
RECAP: V-ECC Two-tiered error protection V-ECC for chipkill
Uniform T1EC Virtualized T2EC V-ECC for chipkill ECC x4 configuration: saves 8 data pins ECC x8 configuration: more energy efficient Flexible error protection Different T2EC for different pages Stronger protection for important data No protection for not important data
31
Power Consumption DRAM power saving Total power saving ECC x4: 4.2%
32
Caching T2EC T2EC occupancy: Less than 10% on average
MPKI overhead: Very small The higher spatial locality, the less impact on caching behavior T2EC occupancy: less than 10% on average -- over 10% only in omnetpp, milc, canneal, and fluidanimate MPKI overhead: very small x8 affects more: 32 T2ECs per cache line in x4, but 16 T2ECs per cache line in x8
33
Traffic Traffic increase – less than 10% on average
Increased demand misses; T2EC traffic Spatial locality is important, so is the amount of write-back traffic Traffic: -- increased demand misses due to T2EC occupancy -- T2EC traffic mcf: doesn’t have spatial locality, but doesn’t have much write-back traffic.
34
Virtualized ECC Uniform T1EC Virtualized T2EC
Low-cost error detection or light-weight correction Virtualized T2EC Correct errors detected uncorrectable by T1EC Cacheable and memory mapped Read accesses data and T1EC Don’t need T2EC in most times Simpler common case read operations Write updates data, T1EC, and T2EC
35
Flexible Error Protection
ECC x8 DRAM configuration Stronger error protection at the cost of more T2EC accesses Additional cost of double chip-kill (relative to chip-kill) is quite small Adaptation is with per-page granularity
36
What if BW is limited? Half DRAM BW – 6.4GB/s
Emulate CMP where BW is more scarce
37
Virtualized ECC for Non-ECC DIMMs
38
ECC for non-ECC DIMMs Virtualize ECC in memory namespace
Not a two-tiered error protection No uniform ECC storage (for T1EC) But, let’s say the ECC as ‘T2EC’ to keep notation consistent Virtualized T2EC both detects and corrects errors Now, a DRAM read also triggers a T2EC access Increased T2EC traffic, increased T2EC occupancy, and more penalty But, we can detect and correct errors with non-ECC DIMMs
39
ECC Address Translation Unit
LLC 2 PA: 0x0180 6 PA: 0x00c0 A C ECC Address Translation Unit 7 EA: 0x0510 D B 3 EA: 0x0550 1 Rd: 0x0180 8 Rd: 0x0510 5 Wr: 0x0140 4 Rd: 0x0540 DRAM Rank 0 Rank 1 0000 0040 0080 00c0 C 0100 0140 A 0180 01c0 0200 0240 0280 02c0 0300 0340 0380 03c0 0400 0440 0480 04c0 T2EC for Rank 1 data T2EC for Rank 0 data D 0500 0540 B 0580 05c0 Data Data
40
DIMM configurations Use 2 check symbol error codes DIMM configurations
Can detect and correct up to 1 symbol error No 2 symbol error detection Weaker protection than Chip-Kill, but it’s better than nothing DIMM configurations Can even use x16 DRAMs (way more energy efficient than x4 DRAMs) DRAM type # Data DRAMs per rank T2EC per 64B cache line Non-ECC x4 x4 32 4B Non-ECC x8 x8 16 8B Non-ECC x16 x16 8 16B
41
Performance and Energy Efficiency
More performance degradation (compared to ECC DIMMs) Every read accesses T2EC More T2EC traffic more T2EC occupancy in LLC Energy efficiency is sometimes better x16 DRAMs save a lot of DRAM power Performance degradation is low if spatial locality is good
42
Flexible error protection
A page can have different T2EC sizes Error protection level of a page can be No protection 1 chip-kill detect 1 chip-kill correct (but can’t detect 2 chip-kill) 2 chip-kill correct Penalty is proportional to protection level T2EC size per 64B cache line No protection 1 Chip-Kill detect 1 Chip-Kill Correct* 2 chip-kill correct Non-ECC x4 0B 2B 4B 8B Non-ECC x8 16B Non-ECC x16 32B * It cannot detect 2 chip-kill
43
Non-ECC x8 Non-ECC x16
44
Managing T2EC
45
OS manages T2EC PA to EA translation structure T2EC storage
Only dirty pages require T2EC (with ECC DIMMs) Can use Copy-On-Write T2EC allocation Every data page needs T2EC in non-ECC implementation Free T2EC when a data page is freed/evicted
46
PA to EA Translation Every write-back (with ECC DIMMs) or read/write (with non-ECC DIMMs) needs to access T2EC Translation is similar to VA to PA translaation OS manages a single translation structure
47
Example Translation Physical address (PA) Level 1 Level 2 Level 3
Page offset ECC page table Base register + >> log2(T2EC) ECC table entry + ECC table entry + ECC table entry ECC page number ECC Page offset ECC address (EA)
48
Accelerating Translation
ECC address translation unit Cache PA to EA translation Like TLBs Hierarchical caching – 2 levels 1st level manages consistency with TLB 2nd level as a victim cache Read triggered translation 100% hit; L1 EA cache is consistent with TLB Only occurs with non-ECC DIMMs Write triggered translation Probably hit; L2 EA cache can be relatively large
49
ECC Address Translation Unit
TLB ECC address translation unit To manage consistency between TLB and L1 EA cache PA L1 EA cache EA L2 EA cache Control logic EA MSHR 2-level EA cache External EA translation
50
Possible Impacts TLB miss penalty EA cache misses per 1000 instrs
VA to PA translation, then PA to EA translation Seems like negligible – already assumed doubled TLB miss penalty in the evaluation Design alternative: to translate VA to EA directly Need to manage per-process translation structure But potentially less impact on TLB miss penalty EA cache misses per 1000 instrs Configuration 16 entry FA L1 EA cache 4k entry 8 way L2 EA cache ~3 in omnetpp and canneal ~12 in GUPS Less than 1 in other apps Things might get messed up with a software TLB handler
51
Chip-Kill-Correct Single device error correct, Double device error detect Other names: DRAM RAID, Extended ECC, Advanced ECC, … Can tolerate a DRAM device failure Using x1 DRAMs SEC-DED effectively does chip-kill-correct But, there’s no x1 DRAM any more (really?) x1 … 64 data bits 8 ECC bits
52
Interleaved SEC-DED 4 interleaved SEC-DED – x4 Chip-Kill
256bit data width Works with old DRAMs Modern DRAMs use burst access Granularity – DDR2: 128B, DDR3: 256B x4 64 data DRAMs 8 ECC DRAMs (72,64) SEC-DED …
53
x4 Non ECC-DIMM x4 ECC-DIMM data Virtualized T1EC T2EC x8 ECC-DIMM
Burst 4 T1EC T2EC x8 ECC-DIMM data Virtualized Burst 4 T1EC T2EC
54
Why is x8 chipkill impractical?
With the same access granularity Higher redundancy overhead 128-bit data + 24-bit ECC (18.75%) Need custom-designed DIMMs Using standard ECC DIMMs Wider data-path 256-bit data + 24-bit ECC (9.375%) Increase access granularity 128B in DDR2 256B in DDR3 Using x8 DRAM is preferable, since x8 DRAM consumes 30% less power than x4 DRAMs if the total capacity is same. But, chipkill-correct using x8 DRAMs is impractical. It either requires custom-designed DIMMs if we want to maintain the access granularity, or Increases access granularity if we want to use commodity DIMMs
55
DRAM Modules Non-ECC DIMMs ECC DIMMs SEC-DED 64-bit wide data path
Additional DRAMs dedicated to storing ECC Additional pins to transfer ECC SEC-DED Single-bit Error Correction Double-bit Error Detection 64bit data + 8bit ECC ECC DIMMs provide only additional storage and data pins for storing and transferring redundant information, And the actual ECC encoding / decoding takes place at the memory controllers, so that the system designers can choose an error protection mechanism. Typical memory error protection using 72-bit wide ECC DIMMs is based on the SEC-DEC, single-bit error correction and double-bit error detection code. With SEC-DED, each 64bit data is protected by 8bit SEC-DED code.
56
64-bit x4 Non-ECC DIMM 64-bit x8 Non-ECC DIMM 72-bit x4 ECC DIMM
This shows the structures of standard memory modules, DIMMs without ECC support. Depending on the types of DRAMs used, there are x4, x8, and x16 DIMMs. Standard DIMMs have 64bit-wide data path, so there’re 16 x4 DRAMs per rank in x4 DIMM, 8 x8 DRAMs per rank in x8 DIMM, and 4 x16 DRAMs per rank in x16 DIMM. 72-bit x8 x8 ECC DIMM
57
High-end Servers Need BOTH reliability and energy efficiency
Chipkill-correct But, chipkill requires x4 configurations Using more energy efficient x8 configurations is impractical with chipkill
58
High-level Memory Models
VA space PA space VA space PA space T2EC VA PA PA EA VA Program Program This compares the high-level memory models of the conventional architecture and the virtualized ECC architecture. VM translates program’s VA into PA that points to data and ECC in the conventional architecture. In V-ECC, PA only points to data and T1EC. As mentioned, read operations don’t need to access T2EC. But when an error is detected or when a write is performed, T2EC should be also accessed. In order to access T2EC, we need ECC address, EA in short. PA to EA translation is done similar to VA to PA translation, and OS can manage this translation. Data ECC Data T1EC Conventional Architecture Virtualized ECC Architecture
59
Example Application 1’s VA space Application 2’s VA space
VA to PA mapping DRAM Data T1EC PA to EA mapping
60
Standard DIMMs x4 Non-ECC DIMMs x4 ECC DIMMs 16 x4 DRAMs per rank
64bit-wide data bus x4 x4 Non-ECC DIMM This shows the structures of standard memory modules, DIMMs without ECC support. Depending on the types of DRAMs used, there are x4, x8, and x16 DIMMs. Standard DIMMs have 64bit-wide data path, so there’re 16 x4 DRAMs per rank in x4 DIMM, 8 x8 DRAMs per rank in x8 DIMM, and 4 x16 DRAMs per rank in x16 DIMM. 72bit-wide data bus x4 x4 ECC DIMM
61
Standard DIMMs – Cont’d
8 x8 DRAMs per rank in Non-ECC DIMMs 9 x8 DRAMs per rank in ECC DIMMs x8 consumes 30% less power than x4 64bit-wide data bus x8 x8 Non-ECC DIMM This shows the structures of standard memory modules, DIMMs without ECC support. Depending on the types of DRAMs used, there are x4, x8, and x16 DIMMs. Standard DIMMs have 64bit-wide data path, so there’re 16 x4 DRAMs per rank in x4 DIMM, 8 x8 DRAMs per rank in x8 DIMM, and 4 x16 DRAMs per rank in x16 DIMM. 72bit-wide data bus x8 x8 ECC DIMM
62
Standard DIMMs – Cont’d
4 x16 DRAMs per rank in Non-ECC DIMMs No x16 ECC DIMMs More power efficient than x8 DRAMs 64bit-wide data bus x16 x16 Non-ECC DIMM ECC DIMMs, on the other hand, have 72bit-wide data path, of which 8bit is for ECC. X4 ECC DIMMs have 18 x4 DRAMs, and x8 ECC DIMMs have 9 x8 DRAMs, but there’s no x16 ECC DIMM. NO x16 ECC DIMM
63
Configurations Baseline x4 Virtualized ECC
Traditional uniform Chip-Kill Note: x8 Chip-Kill is not practical Virtualized ECC ECC x4 Save 8 data pins ECC x8 Use more energy efficient x8 DRAM Baseline x4 ECC x4 ECC x8 128bit data 16bit ECC 128bit data 8bit ECC 128bit data 16bit ECC x4 ECC DIMM x4 ECC DIMM x4 ECC DIMM x4 Non ECC DIMM x8 ECC DIMM x8 ECC DIMM
64
Symbol based error code
b-bit symbol GF(2^b) based arithmetic Simple rules 1 check symbol 1 symbol error detect 2 check symbols 1 symbol error correct 2 symbol error detect 3 check symbols 1 symbol error correct + 2 symbol error detect 3 symbol error detect 4 check symbols 2 symbol error correct + 2 symbol error detect 4 symbol error detect 3 check symbol error code provides Chip-Kill-Correct Max codeword length: 2^b+2 symbols b=4: 60bit data + 12bit ECC b=8: 2008bit data + 24bit ECC
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.