Presentation is loading. Please wait.

Presentation is loading. Please wait.

Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,

Similar presentations


Presentation on theme: "Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,"— Presentation transcript:

1 Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer, Yi Lu (alizade, adelj)@stanford.edu, (dachuang, sundaes)@memoir-systems.com, yilu4@illinois.edu ACM Sigmetrics/Performance 2012

2 What Is Embedded DRAM? 1.2 nd Most Common Embedded Memory Consists of 1 Transistor, 1 Capacitor cell 2X-3X denser than SRAM 2X-4X slower than SRAM 2.Supported by Key ASIC and IP Vendors IBM, TSMC, NEC, Mosys, ST 3.Used in a Number of Applications Servers, Networking, Storage, Gaming, Mobile 4.Industry Examples IBM's P7 Sony Playstations, Nintendo GameCube, Wii Apple iPhone, Microsoft Zune HD, Xbox 360 Cisco Catalyst 3K-10K eDRAM 1T1C Memory Cell Data Select Storage Capacitor 2

3 ACM Sigmetrics/Performance 2012 R/W Port Refresh Port Bank 1 R Rows DRAM Capacitor has Finite Retention Time (W = T ref ) Problem: eDRAM Refresh Causes Memory Bandwidth Loss Example: W= 18us @ 100C = 4050 cycles @ 225 MHz Example: R = 64 rows Solution: Periodic Refresh --- Reserve Refresh Cycles for Every Cell in Memory Causes Bandwidth Loss = R/W = 64 rows/4050 cycles ~ 1.58% All 64 rows will lose data in 4050 cycles! 3

4 ACM Sigmetrics/Performance 2012 Trend: Higher Density Multi-banked Macros (Mb/mm 2 ) (2) More Banks are Packed Together and Need to be Refreshed Shared Refresh and R/W Ports 1 M 1 2 B 1 R Memory Banks Rows (3) Smaller Capacitor with Lower Geometry → Smaller W (1) More Rows are Packed Together and Need to be Refreshed (4) Smaller W with Higher Temperature (5) Low Clock Speed Mode Decreases ‘Clock-time’ to Refresh 4 Periodic Refresh Bandwidth Loss ~ RB/W (Note: W ≥ RB) Shared Circuitry to Conserve Area Does not Scale with Larger Macros, Geometry & Low Power Modes

5 ACM Sigmetrics/Performance 2012 Examples of Periodic Refresh with Multi-banked Macros 5 M=1 (Ports) W (@ 100C) R Rows B Banks Periodic Refresh Loss Normal (250 MHz) 18.2 us = 4050 cycles 64812.6% 128825% Low Power (150 MHz) 18.2 us = 2699 cycles 64818.9% 128838% M = Number of Memory Ports, B = Number of Banks, R = Number of Rows, W = Cell Retention Time The Problem is Only Getting Worse Over Time …

6 ACM Sigmetrics/Performance 2012 Vendor Solution: Concurrent Refresh R/W Ports 1 M 1 2 B 1 R Memory Banks Rows Concurrent Refresh ++ : Refresh a Bank Which is Not Being Concurrently Accessed ++ T. Kirihata et. al., An 800-MHz embedded DRAM with a concurrent refresh mode. Solid-State Circuits, IEEE Journal of, 40(6):1377–1387, June 2005. Refresh Port Concurrent Refresh Port 6

7 ACM Sigmetrics/Performance 2012 How is Concurrent Refresh Used Today? 7 RP 3 RP 1 RP 2 RP 4 RP 16 1 2 B Memory Banks Next Concurrent Refresh Pointer Bank 2 Deficit Register Tracks Non-refreshed Bank(s) Our Observation: This is Incorrect! In Some Cases, Refresh Overhead can be Very Bad, Close to 100% for ANY Concurrent Refresh Scheduler Deficit Register 3 Count Standard Observation: N-1 out of N Banks Get Refreshed for Any Pattern Concurrent Refresh Overhead is Proportional to 1 bank Concurrent Refresh Overhead = R/W = 64 rows /4050 cycles = ~1.58% Accessed Bank

8 ACM Sigmetrics/Performance 2012 Goals of Our Work: An Industry Outlook  Design a Concurrent Refresh Scheduler that can 1. Provide Deterministic Memory Performance Guarantees − Maximize Memory Throughput (Optimality) 2. Be Universally Applicable − For any eDRAM macro with B banks, R Rows, M memory ports − For any characteristics of cell retention time W ++, and Clock speed 3. Maximize Memory Burst Tolerance 4. Have Low Implementation Overhead 8 ++ Note that W is itself a function of temperature, process, and the micro-architecture of the eDRAM

9 ACM Sigmetrics/Performance 2012 Problem Formulation  We consider a general class of algorithms that require X refresh (idle) timeslots in every Y consecutive timeslots. 9 Fixed TDM Constraint Refresh Refresh Window 1Refresh Window 2Refresh Window 3Refresh Window 4........... …. Sliding Window Constraint Supports X idle cycles in any (t, t+Y) Refresh Any Refresh Window Sliding Window Constraint Gives Maximum Flexibility for Handling Bursts, and When to Provide Idle Cycles

10 ACM Sigmetrics/Performance 2012 Key Performance Metrics 10  Refresh Overhead = X / Y Memory bandwidth wasted on refresh  Burst Tolerance = Y – X Maximum number of consecutive memory accesses without interruption for refresh We’ll Consider the Simple Case When the User is Required to Send X = 1 Idle in Y Cycles, and M = 1

11 ACM Sigmetrics/Performance 2012 Deficit Register Pointer Count Our Solution: Versatile Refresh Algorithm 11 1 2 B Memory Banks Next Concurrent 0 Refresh Pointer RP 1 RP 2 RP 3 RP 4 RP B RP 2 RP 1 RP 4 1 RP B RP 1 RP 2 2 RP 3 1 Bank with deficit has priority for refresh. Maximum Allowed Deficit Register Controls Burst Tolerance (Y) Max Register 3 Count

12 ACM Sigmetrics/Performance 2012 Necessary Refresh Overhead for any Algorithm: Intuition, X=1  At each time the BR memory cells have distinct ages ≥ (0, …, BR-1)  An adversary keeps reading from a particular bank; idle slots are needed to refresh cells in that bank.  A total of BR inequalities to ensure cells are refreshed in time  Interestingly, only two of these inequalities matter The one corresponding to the oldest cell The one corresponding to the oldest “youngest cell in each bank” 12

13 ACM Sigmetrics/Performance 2012 Necessary Refresh Overhead for any Algorithm: Derivation, X=1  How much can the adversary age the oldest cell? Current age is at least BR-1 Must wait for at least 1 idle before it is picked up: (BR -1) + Y ≤ W  How much can the adversary age the oldest “youngest cell in each bank”? Current age is at least B-1 Must wait for at least R idles before it is picked up: (B-1) + YR ≤ W 13

14 ACM Sigmetrics/Performance 2012 Optimality for Versatile Refresh Overhead: Results, X =1  Necessity: Result for any Algorithm  Sufficiency: Result for VR Algorithm (with parameter X): 14 Nearly Optimal Refresh For X=1

15 ACM Sigmetrics/Performance 2012 Performance Guarantees of Versatile Refresh Algorithm Why Would We Ever Use Large X? Refresh Overhead ~ R/W, for W large RB W c = RB + B-1 “Bad” Region with High Overhead 1/B Worst-case Refresh Overhead (X/Y) 0 R/W 1 Near-optimal Refresh Overhead for X = 1 Increasing X 15 Cell Retention Time (W)

16 ACM Sigmetrics/Performance 2012 Why Would We Ever Use Large X?  Because of Burst Tolerance (large X → large Y – X) If memory accesses are bursty, refreshes can be hidden  There is a Critical Value of X for Max Burst Tolerance  Example: B = 16, R = 128, W = 2500 16

17 ACM Sigmetrics/Performance 2012 Calculations for Customer ASIC ++ 17 TheoreticalIn Practice Total Bandwidth375 MHz Versatile Refresh Formula6825 > 257N Versatile Refresh Constraint1 in 26.551 in 26 Data-path360 MHz Refresh14.12 MHz14.42 MHz Extra Bandwidth for CPU0.88 MHz0.58 MHz R = 1024, B = 6 Banks, W = 18.2us @ 100C = 6825 cycles @ 375MHz ( ++ Note that these numbers have been sanitized)

18 ACM Sigmetrics/Performance 2012 Versatile Refresh Enhancement  Enhancement: No-conflict slot: A timeslot where the bank the VR scheduler wants to refresh is not being accessed. Any idle slot is a no-conflict slot; but not vice versa For VR, no-conflict slots are as good as idle slots.  Observation: This allows lower refresh overhead (possibly zero) for non-adversarial memory access patterns 18

19 ACM Sigmetrics/Performance 2012 Deficit Register Pointer Count Fully Enhanced Versatile Refresh Algorithm 19 1 2 B Memory Banks Next Refresh 2 Bank Pointer RP 1 RP 2 RP 3 RP 4 RP B RP 2 RP 1 RP 4 RP B RP 1 RP 2 RP 3 Max Register 3 Count Repeat for Multiple Memory Ports (M) Enforcer Module (User Logic) No conflict feedback X idles in Y timeslots

20 ACM Sigmetrics/Performance 2012 Simulation: Synthetic Statistical Workload  Parameter Alpha Controls Degree of Temporal Locality alpha ~ 0 → always read from bank 1 (adversarial) alpha ~ 1 → read from random banks (benign) 20 VR with X = 4: Min worst-case overhead (best for adversarial) VR with X = 128: Max burst tolerance (best for benign) Refresh Overhead has Disappeared Completely!

21 ACM Sigmetrics/Performance 2012 Conclusion  With Versatile Refresh A Designer Can … 1. Exactly Calculate Available Memory Bandwidth − For any eDRAM macro with B banks, R Rows, M memory ports − For any characteristics of Temperature, W= T ref and Clock speed 2. Achieve Optimal Worst-case Memory Bandwidth 3. Design for Large Burst Tolerance 4. Potentially Eliminate Back-pressure − Simplify associated complex design and verification 5. Maximize Best-case Memory Bandwidth 6. Avail of a Formally Verified VR Controller − On a suitably reduced memory instance 21


Download ppt "Versatile Refresh: Low Complexity Refresh Scheduling for High–throughput Multi-banked eDRAM Mohammed Alizadeh, Adel Javanmard, Da Chuang, Sundar Iyer,"

Similar presentations


Ads by Google