Download presentation
Presentation is loading. Please wait.
Published bySherman Vincent Arnold Modified over 9 years ago
1
Memory System Performance in a NUMA Multicore Multiprocessor Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1
2
Summary NUMA multicore systems are unfair to local memory accesses Local execution sometimes suboptimal 2
3
Outline NUMA multicores: how it happened Experimental evaluation: Intel Nehalem Bandwidth sharing model The next generation: Intel Westmere 3
4
NUMA multicores: how it happened 3210 BusC Northbridge MC DRAM memory 4 01237654 BusC 4567 MC First generation: SMP
5
NUMA multicores: how it happened 3210 BusC Northbridge DRAM memory 5 7654 BusC MC DRAM memory BusC Next generation: NUMA IC
6
NUMA multicores: how it happened 3210 DRAM memory 6 7654 MC DRAM memory 0 1234567 IC Next generation: NUMA
7
NUMA multicores: how it happened 3210 DRAM memory 7 7654 MC DRAM memory 0 1234567 IC Next generation: NUMA
8
3210 DRAM memory 7654 MC DRAM memory IC Bandwidth sharing Frequent scenario: bandwidth shared between cores Sharing model for the Intel Nehalem 8 01234567
9
Outline NUMA multicores: how it happened Experimental evaluation: Intel Nehalem Bandwidth sharing model The next generation: Intel Westmere 9
10
Evaluation system Intel Nehalem E5520 2 x 4 cores 8 MB level 3 cache 12 GB DDR3 RAM 5.86 GT/s QPI 10 3210 DRAM memory 7654 MC DRAM memory QPI Level 3 cache Global Queue Level 3 cache Global Queue QPI Global Queue Processor 0Processor 1
11
Bandwidth sharing: local accesses 11 3210 DRAM memory 7654 MC DRAM memory QPI Level 3 cache Global Queue Level 3 cache Global Queue 0 DRAM memory 3 Global Queue Processor 0Processor 1
12
Bandwidth sharing: remote accesses 12 3210 DRAM memory 7654 MC DRAM memory QPI Level 3 cache Global Queue Level 3 cache Global Queue 4 DRAM memory 5 Global Queue 0 3 Processor 0Processor 1
13
Bandwidth sharing: combined accesses 13 3210 DRAM memory 7654 MC DRAM memory QPI Level 3 cache Global Queue Level 3 cache Global Queue 4 DRAM memory 5 Global Queue 0 3 Processor 0Processor 1 Global Queue
14
Mechanism to arbitrate between different types of memory accesses We look at fairness of the Global Queue: – local memory accesses – remote memory accesses – combined memory accesses 14
15
Benchmark program STREAM triad for (i=0; i<SIZE; i++) { a[i]=b[i]+SCALAR*c[i]; } Multiple co-executing triad clones 15
16
Multi-clone experiments All memory allocated on Processor 0 Local clones: Remote clones: Example benchmark configurations: 16 CC CC (2L, 0R) CCC CC CCC (0L, 3R)(2L, 3R) Processor 0Processor 1Processor 0Processor 1
17
GQ fairness: local accesses 17 Total bandwidth [GB/s] 3210 DRAM 7654 IMC DRAM QPI Cache GQ Cache GQ C DRAM memory C Processor 0Processor 1 CC
18
GQ fairness: remote accesses 18 Total bandwidth [GB/s] 3210 DRAM 7654 IMC DRAM QPI Cache GQ Cache GQ C DRAM memory C Processor 0Processor 1 CC
19
Global Queue fairness Global Queue fair when there are only local/remote accesses in the system What about combined accesses? 19
20
GQ fairness: combined accesses Execute clones in all possible configurations 20 # local clones 01234 # remote clones 0 1 2 3 4 (2L, 3R)
21
GQ fairness: combined accesses Execute clones in all possible configurations 21 # local clones 01234 # remote clones 0 1 2 3 4
22
GQ fairness: combined accesses 22 Total bandwidth [GB/s]
23
GQ fairness: combined accesses Execute clones in all possible configurations 23 # local clones 01234 # remote clones 0 1 2 3 4
24
Combined accesses 24 Total bandwidth [GB/s]
25
Combined accesses In configuration (4L, 1R) remote clone gets 30% more bandwidth than a local clone Remote execution can be better than local 25
26
Outline NUMA multicores: how it happened Experimental evaluation: Intel Nehalem Bandwidth sharing model The next generation: Intel Westmere 26
27
Bandwidth sharing model 27 3210 DRAM memory 7654 IMC DRAM memory QPI Level 3 cache Global Queue Level 3 cache Global Queue DRAM memory CC
28
Sharing factor ( ) Characterizes the fairness of the Global Queue Dependence of sharing factor on contention? 28
29
Contention affects sharing factor 29 DRAM Processor 0 C CQPI contenders C C C
30
Contention affects sharing factor 30 Sharing factor ( )
31
Combined accesses 31 Total bandwidth [GB/s]
32
Contention affects sharing factor Sharing factor decreases with contention With local contention remote execution becomes more favorable 32
33
Outline NUMA multicores: how it happened Experimental evaluation: Intel Nehalem Bandwidth sharing model The next generation: Intel Westmere 33
34
The next generation Intel Westmere X5680 2 x 6 cores 12 MB level 3 cache 144 GB DDR3 RAM 6.4 GT/s QPI 34 3210 DRAM memory IMC DRAM memory QPI Level 3 cache Global Queue BA98 IMCQPI Level 3 cache Global Queue 76 45
35
The next generation 35 Total bandwidth [GB/s]
36
Conclusions Optimizing for data locality can be suboptimal Applications: – OS scheduling (see ISMM’11 paper) – data placement and computation scheduling 36
37
Thank you! Questions? 37
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.