Download presentation
Presentation is loading. Please wait.
Published byJoel Byrd Modified over 9 years ago
1
Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer, Yi Lu (alizade, adelj)@stanford.edu, (dachuang, sundaes)@memoir-systems.com, yilu4@illinois.edu ACM Sigmetrics/Performance 2012
2
What Is Embedded DRAM? 1.2 nd Most Common Embedded Memory Consists of 1 Transistor, 1 Capacitor cell 2X-3X denser than SRAM 2X-4X slower than SRAM 2.Supported by Key ASIC and IP Vendors IBM, TSMC, NEC, Mosys, ST 3.Used in a Number of Applications Servers, Networking, Storage, Gaming, Mobile 4.Industry Examples IBM's P7 Sony Playstations, Nintendo GameCube, Wii Apple iPhone, Microsoft Zune HD, Xbox 360 Cisco Catalyst 3K-10K eDRAM 1T1C Memory Cell Data Select Storage Capacitor 2
3
ACM Sigmetrics/Performance 2012 R/W Port Refresh Port Bank 1 R Rows DRAM Capacitor has Finite Retention Time (W = T ref ) Problem: eDRAM Refresh Causes Memory Bandwidth Loss Example: W= 18us @ 100C = 4050 cycles @ 225 MHz Example: R = 64 rows Solution: Periodic Refresh --- Reserve Refresh Cycles for Every Cell in Memory Causes Bandwidth Loss = R/W = 64 rows/4050 cycles ~ 1.58% All 64 rows will lose data in 4050 cycles! 3
4
ACM Sigmetrics/Performance 2012 Trend: Higher Density Multi-banked Macros (Mb/mm 2 ) (2) More Banks are Packed Together and Need to be Refreshed Shared Refresh and R/W Ports 1 M 1 2 B 1 R Memory Banks Rows (3) Smaller Capacitor with Lower Geometry → Smaller W (1) More Rows are Packed Together and Need to be Refreshed (4) Smaller W with Higher Temperature (5) Low Clock Speed Mode Decreases ‘Clock-time’ to Refresh 4 Periodic Refresh Bandwidth Loss ~ RB/W (Note: W ≥ RB) Shared Circuitry to Conserve Area Does not Scale with Larger Macros, Geometry & Low Power Modes
5
ACM Sigmetrics/Performance 2012 Examples of Periodic Refresh with Multi-banked Macros 5 M=1 (Ports) W (@ 100C) R Rows B Banks Periodic Refresh Loss Normal (250 MHz) 18.2 us = 4050 cycles 64812.6% 128825% Low Power (150 MHz) 18.2 us = 2699 cycles 64818.9% 128838% M = Number of Memory Ports, B = Number of Banks, R = Number of Rows, W = Cell Retention Time The Problem is Only Getting Worse Over Time …
6
ACM Sigmetrics/Performance 2012 Vendor Solution: Concurrent Refresh R/W Ports 1 M 1 2 B 1 R Memory Banks Rows Concurrent Refresh ++ : Refresh a Bank Which is Not Being Concurrently Accessed ++ T. Kirihata et. al., An 800-MHz embedded DRAM with a concurrent refresh mode. Solid-State Circuits, IEEE Journal of, 40(6):1377–1387, June 2005. Refresh Port Concurrent Refresh Port 6
7
ACM Sigmetrics/Performance 2012 How is Concurrent Refresh Used Today? 7 RP 3 RP 1 RP 2 RP 4 RP 16 1 2 B Memory Banks Next Concurrent Refresh Pointer Bank 2 Deficit Register Tracks Non-refreshed Bank(s) Our Observation: This is Incorrect! In Some Cases, Refresh Overhead can be Very Bad, Close to 100% for ANY Concurrent Refresh Scheduler Deficit Register 3 Count Standard Observation: N-1 out of N Banks Get Refreshed for Any Pattern Concurrent Refresh Overhead is Proportional to 1 bank Concurrent Refresh Overhead = R/W = 64 rows /4050 cycles = ~1.58% Accessed Bank
8
ACM Sigmetrics/Performance 2012 Goals of Our Work: An Industry Outlook Design a Concurrent Refresh Scheduler that can 1. Provide Deterministic Memory Performance Guarantees − Maximize Memory Throughput (Optimality) 2. Be Universally Applicable − For any eDRAM macro with B banks, R Rows, M memory ports − For any characteristics of cell retention time W ++, and Clock speed 3. Maximize Memory Burst Tolerance 4. Have Low Implementation Overhead 8 ++ Note that W is itself a function of temperature, process, and the micro-architecture of the eDRAM
9
ACM Sigmetrics/Performance 2012 Problem Formulation We consider a general class of algorithms that require X refresh (idle) timeslots in every Y consecutive timeslots. 9 Fixed TDM Constraint Refresh Refresh Window 1Refresh Window 2Refresh Window 3Refresh Window 4........... …. Sliding Window Constraint Supports X idle cycles in any (t, t+Y) Refresh Any Refresh Window Sliding Window Constraint Gives Maximum Flexibility for Handling Bursts, and When to Provide Idle Cycles
10
ACM Sigmetrics/Performance 2012 Key Performance Metrics 10 Refresh Overhead = X / Y Memory bandwidth wasted on refresh Burst Tolerance = Y – X Maximum number of consecutive memory accesses without interruption for refresh We’ll Consider the Simple Case When the User is Required to Send X = 1 Idle in Y Cycles, and M = 1
11
ACM Sigmetrics/Performance 2012 Deficit Register Pointer Count Our Solution: Versatile Refresh Algorithm 11 1 2 B Memory Banks Next Concurrent 0 Refresh Pointer RP 1 RP 2 RP 3 RP 4 RP B RP 2 RP 1 RP 4 1 RP B RP 1 RP 2 2 RP 3 1 Bank with deficit has priority for refresh. Maximum Allowed Deficit Register Controls Burst Tolerance (Y) Max Register 3 Count
12
ACM Sigmetrics/Performance 2012 Necessary Refresh Overhead for any Algorithm: Intuition, X=1 At each time the BR memory cells have distinct ages ≥ (0, …, BR-1) An adversary keeps reading from a particular bank; idle slots are needed to refresh cells in that bank. A total of BR inequalities to ensure cells are refreshed in time Interestingly, only two of these inequalities matter The one corresponding to the oldest cell The one corresponding to the oldest “youngest cell in each bank” 12
13
ACM Sigmetrics/Performance 2012 Necessary Refresh Overhead for any Algorithm: Derivation, X=1 How much can the adversary age the oldest cell? Current age is at least BR-1 Must wait for at least 1 idle before it is picked up: (BR -1) + Y ≤ W How much can the adversary age the oldest “youngest cell in each bank”? Current age is at least B-1 Must wait for at least R idles before it is picked up: (B-1) + YR ≤ W 13
14
ACM Sigmetrics/Performance 2012 Optimality for Versatile Refresh Overhead: Results, X =1 Necessity: Result for any Algorithm Sufficiency: Result for VR Algorithm (with parameter X): 14 Nearly Optimal Refresh For X=1
15
ACM Sigmetrics/Performance 2012 Performance Guarantees of Versatile Refresh Algorithm Why Would We Ever Use Large X? Refresh Overhead ~ R/W, for W large RB W c = RB + B-1 “Bad” Region with High Overhead 1/B Worst-case Refresh Overhead (X/Y) 0 R/W 1 Near-optimal Refresh Overhead for X = 1 Increasing X 15 Cell Retention Time (W)
16
ACM Sigmetrics/Performance 2012 Why Would We Ever Use Large X? Because of Burst Tolerance (large X → large Y – X) If memory accesses are bursty, refreshes can be hidden There is a Critical Value of X for Max Burst Tolerance Example: B = 16, R = 128, W = 2500 16
17
ACM Sigmetrics/Performance 2012 Calculations for Customer ASIC ++ 17 TheoreticalIn Practice Total Bandwidth375 MHz Versatile Refresh Formula6825 > 257N Versatile Refresh Constraint1 in 26.551 in 26 Data-path360 MHz Refresh14.12 MHz14.42 MHz Extra Bandwidth for CPU0.88 MHz0.58 MHz R = 1024, B = 6 Banks, W = 18.2us @ 100C = 6825 cycles @ 375MHz ( ++ Note that these numbers have been sanitized)
18
ACM Sigmetrics/Performance 2012 Versatile Refresh Enhancement Enhancement: No-conflict slot: A timeslot where the bank the VR scheduler wants to refresh is not being accessed. Any idle slot is a no-conflict slot; but not vice versa For VR, no-conflict slots are as good as idle slots. Observation: This allows lower refresh overhead (possibly zero) for non-adversarial memory access patterns 18
19
ACM Sigmetrics/Performance 2012 Deficit Register Pointer Count Fully Enhanced Versatile Refresh Algorithm 19 1 2 B Memory Banks Next Refresh 2 Bank Pointer RP 1 RP 2 RP 3 RP 4 RP B RP 2 RP 1 RP 4 RP B RP 1 RP 2 RP 3 Max Register 3 Count Repeat for Multiple Memory Ports (M) Enforcer Module (User Logic) No conflict feedback X idles in Y timeslots
20
ACM Sigmetrics/Performance 2012 Simulation: Synthetic Statistical Workload Parameter Alpha Controls Degree of Temporal Locality alpha ~ 0 → always read from bank 1 (adversarial) alpha ~ 1 → read from random banks (benign) 20 VR with X = 4: Min worst-case overhead (best for adversarial) VR with X = 128: Max burst tolerance (best for benign) Refresh Overhead has Disappeared Completely!
21
ACM Sigmetrics/Performance 2012 Conclusion With Versatile Refresh A Designer Can … 1. Exactly Calculate Available Memory Bandwidth − For any eDRAM macro with B banks, R Rows, M memory ports − For any characteristics of Temperature, W= T ref and Clock speed 2. Achieve Optimal Worst-case Memory Bandwidth 3. Design for Large Burst Tolerance 4. Potentially Eliminate Back-pressure − Simplify associated complex design and verification 5. Maximize Best-case Memory Bandwidth 6. Avail of a Formally Verified VR Controller − On a suitably reduced memory instance 21
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.