Download presentation
Presentation is loading. Please wait.
Published byJasper Postlethwaite Modified over 10 years ago
1
Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1
2
NUMA multicores DRAM memory 32 MC Cache 10 MC DRAM memory Cache IC 2 MC DRAM memory MCIC Processor 0Processor 1
3
10 MC DRAM memory Cache DRAM memory 32 MC Cache NUMA multicores Two problems: NUMA: interconnect overhead BA MAMA MBMB 3 IC Processor 0Processor 1
4
DRAM memory 32 MC Cache 10 MC DRAM memory Cache NUMA multicores BA MAMA MBMB 4 Cache Two problems: NUMA: interconnect overhead multicore: cache contention IC Processor 0Processor 1
5
Outline NUMA: experimental evaluation Scheduling – N-MASS – N-MASS evaluation 5
6
Multi-clone experiments Intel Xeon E5520 4 clones of soplex (SPEC CPU2006) – local clone – remote clone 6 DRAM memory MC Cache 0 MC DRAM memory Cache IC 1324675 Memory behavior of unrelated programs MMMM CCCC CCCC C C Processor 0Processor 1
7
1 2 3 4 5 7 Cache C DRAM Cache C CC Local bandwidth: 100% MMMM Cache C DRAM Cache C CC Local bandwidth: 80% MMMM Cache C DRAM Cache C CC Local bandwidth: 57% MMMM Cache C DRAM Cache C CC Local bandwidth: 32% MMMM Cache C DRAM Cache C CC Local bandwidth: 0% MMMM
8
Performance of schedules Which is the best schedule? Baseline: single-program execution mode 8 Cache C M
9
Execution time local clones remote clones average Slowdown relative to baseline C C C 9
10
Outline NUMA: experimental evaluation Scheduling – N-MASS – N-MASS evaluation 10
11
Two steps: – Step 1: maximum-local mapping – Step 2: cache-aware refinement 11 N-MASS (NUMA-Multicore-Aware Scheduling Scheme)
12
Step 1: Maximum-local mapping 12 DRAM Cache 0 DRAM Cache 1324675 BMBMB AMAMA CMCMC DMDMD Processor 0Processor 1
13
Default OS scheduling 13 DRAM Cache 0 DRAM Cache 1324675 BAD MBMB MAMA MCMC MDMD C Processor 0Processor 1
14
Two steps: – Step 1: maximum-local mapping – Step 2: cache-aware refinement 14 N-MASS (NUMA-Multicore-Aware Scheduling Scheme)
15
Step 2: Cache-aware refinement In an SMP: 15 DRAM Cache 0 DRAM Cache 1324675 MBMB MAMA MCMC MDMD BA DC Processor 0Processor 1
16
Step 2: Cache-aware refinement 16 DRAM Cache 0 DRAM Cache 1324675 MBMB MAMA MCMC MDMD BADC MAMA In an SMP: Processor 0Processor 1
17
Step 2: Cache-aware refinement 17 ABC D DRAM Cache 0 DRAM Cache 1324675 MBMB MCMC MDMD BAD C MAMA ABCD Performance degradation In an SMP: NUMA penalty Processor 0Processor 1
18
Step 2: Cache-aware refinement 18 DRAM Cache 0 DRAM Cache 1324675 MBMB MAMA MCMC MDMD BACD In a NUMA: Processor 0Processor 1
19
Step 2: Cache-aware refinement 19 DRAM Cache 0 DRAM Cache 1324675 MBMB MAMA MCMC MDMD ADCB In a NUMA: Processor 0Processor 1
20
Step 2: Cache-aware refinement 20 ABC D Performance degradation DRAM Cache 0 DRAM Cache 1324675 MBMB MCMC MDMD MAMA BADC A B CD NUMA allowance In a NUMA: NUMA penalty Processor 0Processor 1
21
Performance factors Two factors cause performance degradation: 1. NUMA penalty slowdown due to remote memory access 2. cache pressure local processes: misses / KINST (MPKI) remote processes: MPKI x NUMA penalty 21 NUMA penalty
22
Implementation User-mode extension to the Linux scheduler Performance metrics – hardware performance counter feedback – NUMA penalty perfect information from program traces estimate based on MPKI All memory for a process allocated on one processor 22
23
Outline NUMA: experimental evaluation Scheduling – N-MASS – N-MASS evaluation 23
24
Workloads SPEC CPU2006 subset 11 multi-program workloads (WL1 WL11) 4-program workloads (WL1 WL9) 8-program workloads (WL10, WL11) 24 NUMA penalty CPU-boundMemory- bound
25
Memory allocation setup Where the memory of each process is allocated influences performance Controlled setup: memory allocation maps 25
26
Memory allocation maps 26 BMBMB ACMCMC DMDMD MAMA DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0000 MAMA MBMB MCMC MDMD
27
Memory allocation maps 27 BACD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0000 MAMA MBMB MCMC MDMD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0011 MAMA MBMB MCMC MDMD
28
Memory allocation maps 28 BACD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0000 MAMA MBMB MCMC MDMD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0011 MAMA MBMB MCMC MDMD UnbalancedBalanced
29
Evaluation Baseline: Linux average – Linux scheduler non-deterministic – average performance degradation in all possible cases N-MASS with perfect NUMA penalty information 29
30
WL9: Linux average 30 Average slowdown relative to single-program mode
31
WL9: N-MASS 31 Average slowdown relative to single-program mode
32
WL1: Linux average and N-MASS 32 Average slowdown relative to single-program mode
33
N-MASS performance N-MASS reduces performance degradation by up to 22% Which factor more important: interconnect overhead or cache contention? Compare: - maximum-local - N-MASS (maximum-local + cache refinement step) 33
34
Data-locality vs. cache balancing (WL9) 34 Performance improvement relative to Linux average
35
Data-locality vs. cache balancing (WL1) 35 Performance improvement relative to Linux average
36
Data locality vs. cache balancing Data-locality more important than cache balancing Cache-balancing gives performance benefits mostly with unbalanced allocation maps What if information about NUMA penalty not available? 36
37
Estimating NUMA penalty NUMA penalty is not directly measurable Estimate: fit linear regression onto MPKI data 37 NUMA penalty
38
Estimate-based N-MASS: performance 38 Performance improvement relative to Linux average
39
Conclusions N-MASS: NUMA multicore-aware scheduler Data locality optimizations more beneficial than cache contention avoidance Better performance metrics needed for scheduling 39
40
Thank you! Questions? 40
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.