Presentation is loading. Please wait.

Presentation is loading. Please wait.

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer.

Similar presentations


Presentation on theme: "Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer."— Presentation transcript:

1 Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

2 NUMA multicores DRAM memory 32 MC Cache 10 MC DRAM memory Cache IC 2 MC DRAM memory MCIC Processor 0Processor 1

3 10 MC DRAM memory Cache DRAM memory 32 MC Cache NUMA multicores Two problems: NUMA: interconnect overhead BA MAMA MBMB 3 IC Processor 0Processor 1

4 DRAM memory 32 MC Cache 10 MC DRAM memory Cache NUMA multicores BA MAMA MBMB 4 Cache Two problems: NUMA: interconnect overhead multicore: cache contention IC Processor 0Processor 1

5 Outline NUMA: experimental evaluation Scheduling – N-MASS – N-MASS evaluation 5

6 Multi-clone experiments Intel Xeon E5520 4 clones of soplex (SPEC CPU2006) – local clone – remote clone 6 DRAM memory MC Cache 0 MC DRAM memory Cache IC 1324675 Memory behavior of unrelated programs MMMM CCCC CCCC C C Processor 0Processor 1

7 1 2 3 4 5 7 Cache C DRAM Cache C CC Local bandwidth: 100% MMMM Cache C DRAM Cache C CC Local bandwidth: 80% MMMM Cache C DRAM Cache C CC Local bandwidth: 57% MMMM Cache C DRAM Cache C CC Local bandwidth: 32% MMMM Cache C DRAM Cache C CC Local bandwidth: 0% MMMM

8 Performance of schedules Which is the best schedule? Baseline: single-program execution mode 8 Cache C M

9 Execution time local clones remote clones average Slowdown relative to baseline C C C 9

10 Outline NUMA: experimental evaluation Scheduling – N-MASS – N-MASS evaluation 10

11 Two steps: – Step 1: maximum-local mapping – Step 2: cache-aware refinement 11 N-MASS (NUMA-Multicore-Aware Scheduling Scheme)

12 Step 1: Maximum-local mapping 12 DRAM Cache 0 DRAM Cache 1324675 BMBMB AMAMA CMCMC DMDMD Processor 0Processor 1

13 Default OS scheduling 13 DRAM Cache 0 DRAM Cache 1324675 BAD MBMB MAMA MCMC MDMD C Processor 0Processor 1

14 Two steps: – Step 1: maximum-local mapping – Step 2: cache-aware refinement 14 N-MASS (NUMA-Multicore-Aware Scheduling Scheme)

15 Step 2: Cache-aware refinement In an SMP: 15 DRAM Cache 0 DRAM Cache 1324675 MBMB MAMA MCMC MDMD BA DC Processor 0Processor 1

16 Step 2: Cache-aware refinement 16 DRAM Cache 0 DRAM Cache 1324675 MBMB MAMA MCMC MDMD BADC MAMA In an SMP: Processor 0Processor 1

17 Step 2: Cache-aware refinement 17 ABC D DRAM Cache 0 DRAM Cache 1324675 MBMB MCMC MDMD BAD C MAMA ABCD Performance degradation In an SMP: NUMA penalty Processor 0Processor 1

18 Step 2: Cache-aware refinement 18 DRAM Cache 0 DRAM Cache 1324675 MBMB MAMA MCMC MDMD BACD In a NUMA: Processor 0Processor 1

19 Step 2: Cache-aware refinement 19 DRAM Cache 0 DRAM Cache 1324675 MBMB MAMA MCMC MDMD ADCB In a NUMA: Processor 0Processor 1

20 Step 2: Cache-aware refinement 20 ABC D Performance degradation DRAM Cache 0 DRAM Cache 1324675 MBMB MCMC MDMD MAMA BADC A B CD NUMA allowance In a NUMA: NUMA penalty Processor 0Processor 1

21 Performance factors Two factors cause performance degradation: 1. NUMA penalty slowdown due to remote memory access 2. cache pressure local processes: misses / KINST (MPKI) remote processes: MPKI x NUMA penalty 21 NUMA penalty

22 Implementation User-mode extension to the Linux scheduler Performance metrics – hardware performance counter feedback – NUMA penalty perfect information from program traces estimate based on MPKI All memory for a process allocated on one processor 22

23 Outline NUMA: experimental evaluation Scheduling – N-MASS – N-MASS evaluation 23

24 Workloads SPEC CPU2006 subset 11 multi-program workloads (WL1  WL11) 4-program workloads (WL1  WL9) 8-program workloads (WL10, WL11) 24 NUMA penalty CPU-boundMemory- bound

25 Memory allocation setup Where the memory of each process is allocated influences performance Controlled setup: memory allocation maps 25

26 Memory allocation maps 26 BMBMB ACMCMC DMDMD MAMA DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0000 MAMA MBMB MCMC MDMD

27 Memory allocation maps 27 BACD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0000 MAMA MBMB MCMC MDMD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0011 MAMA MBMB MCMC MDMD

28 Memory allocation maps 28 BACD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0000 MAMA MBMB MCMC MDMD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0011 MAMA MBMB MCMC MDMD UnbalancedBalanced

29 Evaluation Baseline: Linux average – Linux scheduler non-deterministic – average performance degradation in all possible cases N-MASS with perfect NUMA penalty information 29

30 WL9: Linux average 30 Average slowdown relative to single-program mode

31 WL9: N-MASS 31 Average slowdown relative to single-program mode

32 WL1: Linux average and N-MASS 32 Average slowdown relative to single-program mode

33 N-MASS performance N-MASS reduces performance degradation by up to 22% Which factor more important: interconnect overhead or cache contention? Compare: - maximum-local - N-MASS (maximum-local + cache refinement step) 33

34 Data-locality vs. cache balancing (WL9) 34 Performance improvement relative to Linux average

35 Data-locality vs. cache balancing (WL1) 35 Performance improvement relative to Linux average

36 Data locality vs. cache balancing Data-locality more important than cache balancing Cache-balancing gives performance benefits mostly with unbalanced allocation maps What if information about NUMA penalty not available? 36

37 Estimating NUMA penalty NUMA penalty is not directly measurable Estimate: fit linear regression onto MPKI data 37 NUMA penalty

38 Estimate-based N-MASS: performance 38 Performance improvement relative to Linux average

39 Conclusions N-MASS: NUMA  multicore-aware scheduler Data locality optimizations more beneficial than cache contention avoidance Better performance metrics needed for scheduling 39

40 Thank you! Questions? 40


Download ppt "Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer."

Similar presentations


Ads by Google