Download presentation
Presentation is loading. Please wait.
1
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian, Al Davis University of Utah
2
Takeaway Multiple, on-chip MCs will be common in future CMPs, with multiple cores sharing one MC – NUMA memory hierarchies across multiple sockets – Intelligent data mapping required to reduce average memory access delay Hardware-software co-design approach required for efficient data placement – Minimum software involvement Data placement needs to be aware of system parameters – Row-buffer hit rates, queuing delays, physical proximity, etc. 2
3
NUMA - Today 3 MC Core 1Core 2 Core 3Core 4 DIMM Socket 1 MC Core 1Core 2 Core 3Core 4 DIMM MC Core 1Core 2 Core 3Core 4 DIMM MC Core 1Core 2 Core 3Core 4 DIMM QPI Conceptual representation of four socket Nehalem machine MC On-Chip Memory Controller QPI Interconnect Memory Channel DIMM DRAM (DIMMs) Socket Boundary
4
NUMA - Future 4 Core 5Core 6Core 7Core 8 Core 9Core 10Core 11Core 12 Core 13Core 14Core 15Core 16 Core 1Core 2Core 3Core 4 DIMM MC 2 MC 3 MC 4 MC 1 DIMM L2$ Future CMPs with multiple on-chip MCs MC On-Chip Memory Controller On-Chip Interconnect Memory Channel DIMM DRAM (DIMMs)
5
Local Memory Access 5 Accessing local memory is fast!! 5 MC Core 1Core 2 Core 3Core 4 DIMM Socket 1 MC Core 1Core 2 Core 3Core 4 DIMM MC Core 1Core 2 Core 3Core 4 DIMM MC Core 1Core 2 Core 3Core 4 DIMM ADDR DATA
6
Problem 1 - Remote Memory Access Data for Core N can be anywhere! 6 MC Core 1Core 2 Core 3Core 4 DIMM Socket 1 MC Core 1Core 2 Core 3Core 4 DIMM MC Core 1Core 2 Core 3Core 4 DIMM MC Core 1Core 2 Core 3Core 4 DIMM ADDR
7
Problem 1 - Remote Memory Access Data for Core N can be anywhere! 7 MC Core 1Core 2 Core 3Core 4 DIMM Socket 1 MC Core 1Core 2 Core 3Core 4 DIMM MC Core 1Core 2 Core 3Core 4 DIMM MC Core 1Core 2 Core 3Core 4 DIMM DATA
8
Memory Access Stream – Single Core 8 Prog 1 CPU 0 Prog 1 CPU 0 Prog 1 CPU 0 Prog 1 CPU 0 Prog 2 CPU 0 Prog 1 CPU 0 Prog 1 CPU 0 Prog 1 CPU 0 Memory Controller Request Queue In Out Single cores executed a handful of context-switched programs. Spatio-temporal locality can be exploited!!
9
Problem 2 - Memory Access Stream - CMPs 9 Prog 0 CPU 0 Prog 1 CPU 1 Prog 1 CPU 1 Prog 2 CPU 2 Prog 3 CPU 3 Prog 4 CPU 4 Prog 5 CPU 5 Prog 6 CPU 6 Memory Controller Request Queue In Out Memory accesses from cores get interleaved, leading to loss of spatio-temporal locality.
10
Problem 3 – Increased Overheads for Memory Accesses Increased queuing delays 1 Core/1 Thread 16 Core/16 Threads
11
Problem 4 – Pin Limitations 11 Core 5Core 6Core 7Core 8 Core 9Core 10Core 11Core 12 Core 13Core 14Core 15Core 16 Core 1Core 2Core 3Core 4 MC7MC5 MC1MC2MC3MC4 MC8MC6 Core 5Core 6Core 7Core 8 Core 9Core 10Core 11Core 12 Core 13Core 14Core 15Core 16 Core 1Core 2Core 3Core 4 MC10MC12 MC1MC2MC3MC4 MC9MC11 MC16 MC15 MC14 MC13 MC4 Pin bandwidth is limited : Number of MCs cannot grow exponentially A small number of MCs will have to handle all traffic
12
Problems Summary - I Pin limitations imply an increase in queuing delay – Almost 8x increase in queuing delays from single core/one thread to 16 cores/16 threads Multi-core implies an increase in row-buffer interference – Increasingly randomized memory access stream – Row-buffer hit rates bound to go down Longer on- and off-chip wire delays imply an increase in NUMA factor NUMA factor already at 1.5 today 12
13
Problems Summary - II DRAM access time in systems with multiple on-chip MCs is governed by – Distance between requesting core and responding MC. – Load on the on-chip interconnect. – Average queuing delay at responding MC – Bank and rank contention at target DIMM – Row-buffer hit rate at responding MC 13 Bottomline : Intelligent management of data is required
14
cost j = α x load j + β x rowhits j + λ x distance j Adaptive First Touch Policy Basic idea : Assign each new virtual page to a DRAM (physical) page belonging to MC (j) that minimizes the following cost function – 14 Measure of Queuing Delay Measure of Locality at DRAM Measure of Physical Proximity Constants α, β and λ can be made programmable
15
Dynamic Page Migration Policy Programs change phases!! – Can completely stop touching new pages – Can change the frequency of access to a subset of pages Leads to imbalance in MC accesses – For long running programs with varying working sets, AFT can lead to some MCs getting overloaded 15 Solution : Dynamically migrate pages between MCs at runtime to decrease imbalance
16
Dynamic Page Migration Policy 16 Core 5Core 6Core 7Core 8 Core 9Core 10Core 11Core 12 Core 13Core 14Core 15Core 16 Core 1Core 2Core 3Core 4 DIMM MC 2 MC 3 MC 4 MC 1 DIMM L2$ MC 3 Heavily Loaded (Donor) MC MC 2 Lightly Loaded MC MC 2
17
Dynamic Page Migration Policy 17 Core 5Core 6Core 7Core 8 Core 9Core 10Core 11Core 12 Core 13Core 14Core 15Core 16 Core 1Core 2Core 3Core 4 DIMM MC 2 MC 3 MC 4 MC 1 DIMM L2$ MC 3 Select N pages MC 2 Select Recipient MC Copy N pages from donor to recipient MC
18
Dynamic Page Migration Policy - Challenges Selecting recipient MC – Move pages to MC with least value of cost function Selecting N pages to migrate – Empirically select the best possible value – Can also be made programmable 18 Move pages to a physically proximal MC Minimize interference at recipient MC cost k = Λ x distance k + Γ x rowhits k
19
Dynamic Page Migration Policy - Overheads Pages are physically copied to new addresses – Original address mapping has to be invalidated – Invalidate cache lines belonging to copied pages Copying pages can block resources, leading to unnecessary stalls. Instant TLB invalidates could cause misses in memory even when data is present. Solution : Lazy Copying – Essentially, delayed write-back 19
20
Issues with TLB Invalidates 20 Donor MC Recipient MC Copy Page A,B Core 1Core 3Core 5Core 12 TLB INV Read A’ -> A OS Stall!
21
Lazy Copying 21 Donor MC Recipient MC Copy Page A,B Core 1Core 3Core 5Core 12 Read Only OS Flush Dirty Cachelines Read A’ -> A Copy Complete TLB Update Read A’ -> A
22
Methodology 22 Simics based simulation platform DRAMSim based DRAM timing. DRAM energy figures from CACTI 6.5 Baseline : Assign pages to closest MC CPU16-core Out-of-Order CMP, 3 GHz freq. L1 Inst. and Data CachePrivate, 32 KB/2-way, 1-cycle access L2 Unified CacheShared, 2 MB KB/8-way, 4x4 S-NUCA, 3 cycle bank access Total DRAM Capacity4 GB DIMM Configuration8 DIMMs, 1 rank/DIMM, 64 bit channel, 8 devices/DIMM α, β,λ, Λ, Γ10, 20, 100, 100, 100
23
Results - Throughput 23 AFT : 17.1%, Dynamic Page Migration : 34.8%
24
Results – DRAM Locality 24 AFT : 16.6%, Dynamic Page Migration : 22.7% STDDEV Down, increased fairness
25
Results – Reasons for Benefits 25
26
Sensitivity Studies Lazy Copying does help, a little – Average 3.2% improvement over without lazy copying Terms/Variables in cost function – Very sensitive to load and row-buffer hit rates, not as much to distance Cost of TLB shootdowns – Negligible, since fairly uncommon Physical placement of MCs – center or peripheral – Most workloads agnostic to physical placement 26
27
Summary Multiple, on-chip MCs will be common in future CMPs, with multiple cores sharing one MC – Intelligent data mapping will need to be done to reduce average memory access delay Adaptive First Touch policy – Increases performance by 17.1% – Decreases DRAM energy consumption by 14.1% Dynamic page migration, improvement on AFT – Further improvement over AFT by 17.7%, 34.8% over baseline. – Increases energy consumption by 5.2% 27
28
Thank You 28 http://www.cs.utah.edu/arch-research
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.