Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer.

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer Science ETH Zurich 1

NUMA multicores DRAM memory 32 MC Cache 10 MC DRAM memory Cache IC 2 MC DRAM memory MCIC Processor 0Processor 1

10 MC DRAM memory Cache DRAM memory 32 MC Cache NUMA multicores Two problems: NUMA: interconnect overhead BA MAMA MBMB 3 IC Processor 0Processor 1

DRAM memory 32 MC Cache 10 MC DRAM memory Cache NUMA multicores BA MAMA MBMB 4 Cache Two problems: NUMA: interconnect overhead multicore: cache contention IC Processor 0Processor 1

Outline NUMA: experimental evaluation Scheduling – N-MASS – N-MASS evaluation 5

Multi-clone experiments Intel Xeon E5520 4 clones of soplex (SPEC CPU2006) – local clone – remote clone 6 DRAM memory MC Cache 0 MC DRAM memory Cache IC 1324675 Memory behavior of unrelated programs MMMM CCCC CCCC C C Processor 0Processor 1

1 2 3 4 5 7 Cache C DRAM Cache C CC Local bandwidth: 100% MMMM Cache C DRAM Cache C CC Local bandwidth: 80% MMMM Cache C DRAM Cache C CC Local bandwidth: 57% MMMM Cache C DRAM Cache C CC Local bandwidth: 32% MMMM Cache C DRAM Cache C CC Local bandwidth: 0% MMMM

Performance of schedules Which is the best schedule? Baseline: single-program execution mode 8 Cache C M

Execution time local clones remote clones average Slowdown relative to baseline C C C 9

Two steps: – Step 1: maximum-local mapping – Step 2: cache-aware refinement 11 N-MASS (NUMA-Multicore-Aware Scheduling Scheme)

Step 1: Maximum-local mapping 12 DRAM Cache 0 DRAM Cache 1324675 BMBMB AMAMA CMCMC DMDMD Processor 0Processor 1

Default OS scheduling 13 DRAM Cache 0 DRAM Cache 1324675 BAD MBMB MAMA MCMC MDMD C Processor 0Processor 1

Two steps: – Step 1: maximum-local mapping – Step 2: cache-aware refinement 14 N-MASS (NUMA-Multicore-Aware Scheduling Scheme)

Step 2: Cache-aware refinement In an SMP: 15 DRAM Cache 0 DRAM Cache 1324675 MBMB MAMA MCMC MDMD BA DC Processor 0Processor 1

Step 2: Cache-aware refinement 16 DRAM Cache 0 DRAM Cache 1324675 MBMB MAMA MCMC MDMD BADC MAMA In an SMP: Processor 0Processor 1

Step 2: Cache-aware refinement 17 ABC D DRAM Cache 0 DRAM Cache 1324675 MBMB MCMC MDMD BAD C MAMA ABCD Performance degradation In an SMP: NUMA penalty Processor 0Processor 1

Step 2: Cache-aware refinement 18 DRAM Cache 0 DRAM Cache 1324675 MBMB MAMA MCMC MDMD BACD In a NUMA: Processor 0Processor 1

Step 2: Cache-aware refinement 19 DRAM Cache 0 DRAM Cache 1324675 MBMB MAMA MCMC MDMD ADCB In a NUMA: Processor 0Processor 1

Step 2: Cache-aware refinement 20 ABC D Performance degradation DRAM Cache 0 DRAM Cache 1324675 MBMB MCMC MDMD MAMA BADC A B CD NUMA allowance In a NUMA: NUMA penalty Processor 0Processor 1

Performance factors Two factors cause performance degradation: 1. NUMA penalty slowdown due to remote memory access 2. cache pressure local processes: misses / KINST (MPKI) remote processes: MPKI x NUMA penalty 21 NUMA penalty

Implementation User-mode extension to the Linux scheduler Performance metrics – hardware performance counter feedback – NUMA penalty perfect information from program traces estimate based on MPKI All memory for a process allocated on one processor 22

Workloads SPEC CPU2006 subset 11 multi-program workloads (WL1  WL11) 4-program workloads (WL1  WL9) 8-program workloads (WL10, WL11) 24 NUMA penalty CPU-boundMemory- bound

Memory allocation setup Where the memory of each process is allocated influences performance Controlled setup: memory allocation maps 25

Memory allocation maps 26 BMBMB ACMCMC DMDMD MAMA DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0000 MAMA MBMB MCMC MDMD

Memory allocation maps 27 BACD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0000 MAMA MBMB MCMC MDMD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0011 MAMA MBMB MCMC MDMD

Memory allocation maps 28 BACD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0000 MAMA MBMB MCMC MDMD DRAM Processor 1 Cache Processor 0 DRAM Cache Allocation map: 0011 MAMA MBMB MCMC MDMD UnbalancedBalanced

Evaluation Baseline: Linux average – Linux scheduler non-deterministic – average performance degradation in all possible cases N-MASS with perfect NUMA penalty information 29

WL9: Linux average 30 Average slowdown relative to single-program mode

WL9: N-MASS 31 Average slowdown relative to single-program mode

WL1: Linux average and N-MASS 32 Average slowdown relative to single-program mode

N-MASS performance N-MASS reduces performance degradation by up to 22% Which factor more important: interconnect overhead or cache contention? Compare: - maximum-local - N-MASS (maximum-local + cache refinement step) 33

Data-locality vs. cache balancing (WL9) 34 Performance improvement relative to Linux average

Data-locality vs. cache balancing (WL1) 35 Performance improvement relative to Linux average

Data locality vs. cache balancing Data-locality more important than cache balancing Cache-balancing gives performance benefits mostly with unbalanced allocation maps What if information about NUMA penalty not available? 36

Estimating NUMA penalty NUMA penalty is not directly measurable Estimate: fit linear regression onto MPKI data 37 NUMA penalty

Estimate-based N-MASS: performance 38 Performance improvement relative to Linux average

Conclusions N-MASS: NUMA  multicore-aware scheduler Data locality optimizations more beneficial than cache contention avoidance Better performance metrics needed for scheduling 39

Thank you! Questions? 40

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer.

Similar presentations

Presentation on theme: "Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer.

Similar presentations

Presentation on theme: "Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer."— Presentation transcript:

Similar presentations

About project

Feedback