Download presentation
Presentation is loading. Please wait.
Published bySpencer Bryant Modified over 9 years ago
1
(Mis)Understanding the NUMA Memory System Performance of Multithreaded Workloads Zoltán Majó Thomas R. Gross Department of Computer Science ETH Zurich, Switzerland
2
NUMA-multicore memory system 2 Processor 1 ICMC DRAM Processor 0 02 46 MCIC DRAM Last-level cache 13 57 810 1214 911 1315 LOCAL_CACHE: 38 cycles LOCAL_DRAM: 190 cycles REMOTE_CACHE: 186 cycles REMOTE_DRAM: 310 cycles T All data based on experimental evaluation of Intel Nehalem (Hackenberg [MICRO ’09], Molka [PACT ‘09])
3
Experimental setup Three benchmark programs from PARSEC streamcluster, ferret, and dedup Grown size of inputs more pressure on the memory system Intel Westmere 4 processors, 32 cores 3 execution scenarios w/o NUMA: Sequential w/o NUMA: Parallel (8 cores/1 processor) w/ NUMA: Parallel (32 cores/4 processors) 3
4
Execution scenarios 4 Processor 1 ICMC DRAM Processor 0 02 46 MCIC DRAM Last-level cache 13 57 810 1214 911 1315 Processor 2 1618 2022 MCIC DRAM Last-level cache 1719 2123 Processor 3 2426 2830 ICMC Last-level cache 2527 2931 DRAM TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT
5
Parallel performance 5
6
CPU cycle breakdown dedup: good scaling (26X)streamcluster: poor scaling (11X) 6
7
Outline Introduction Performance analysis Data locality Prefetcher effectiveness Source-level optimizations Performance evaluation Conclusions 7
8
Data locality Page placement policy Commonly used policy: first-touch (default in Linux) Measurement: data locality of the benchmarks Data locality = [%] Read transfers measured at the processor’s uncore 8 Remote memory references Total memory references
9
NUMA-multicore memory system 9 Processor 1 ICMC DRAM Processor 0 02 46 MCIC DRAM Last-level cache 13 57 810 1214 911 1315 LOCAL_CACHE LOCAL_DRAM REMOTE_DRAM T Processor 2 1618 2022 MCIC DRAM Last-level cache 1719 2123 Processor 3 2426 2830 ICMC Last-level cache 2527 2931 DRAM REMOTE_CACHE
10
Data locality 10
11
Inter-processor data sharing Cause of data sharing streamcluster : data points to be clustered ferret and dedup : in-memory databases 11
12
Prefetcher performance Experiment Run each benchmarks with prefetcher on/off Compare performance 12 Causes of prefetcher inefficiency ferret and dedup : hash-based memory access patterns streamcluster : random shuffling
13
streamcluster : random shuffling while (input = read_data_points()) { clusters = process(input); } 13 Randomly shuffle data points to increase probability that each point is compared to each cluster.
14
streamcluster : prefetcher effectiveness Original data layout (before shuffling) 14 T0 T1 ABCDEFGHABCDEFGH pointscoordinates
15
streamcluster : prefetcher effectiveness 15 T0 T1 ABCDEFGHABCDEFGH pointscoordinates Data layout (after pointer-based shuffle)
16
streamcluster : prefetcher effectiveness 16 T0 ABCDEFGHABCDEFGH pointscoordinates Data layout (after pointer-based shuffle)
17
Outline Introduction Performance analysis Data locality Prefetcher effectiveness Source-level optimizations Prefetching Data locality Performance evaluation Conclusions 17
18
ABCDEFGHABCDEFGH streamcluster : Optimizing prefetching Copy-based shuffle Performance improvement over pointer-based shuffle Westmere: 12% Nehalem: 60% 18 T0 T1 GBCHFEADGBCHFEAD pointscoordinates
19
Data locality optimizations Control the mapping of data and computations: 1.Data placement Supported by numa_alloc(), move_pages() First-touch: also OK if data accessed at single processor Interleaved page placement: reduce interconnect contention [Lachaize et al. USENIX ATC’12, Dashti et al. ASPLOS’13] 2.Computation scheduling Threads: affinity scheduling, supported by sched_setaffinity() Loop parallelism: rely on OpenMP static loop scheduling Pipeline parallelism: locality-aware task dispatch 19
20
T0 Executed at Processor 1 Executed at Processor 0 T1 streamcluster 20 GCBHFEADGCBHFEAD pointscoordinates Placed at Processor 0 Placed at Processor 1
21
ferret 21 Stage 2: Segment Stage 5: Rank Stage 1: Input Stage 6: Output Stage 4: Index Stage 3: Extract Image database TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT Stage 4: Index TT TT TT TT Executing at Proc. 0 Executing at Proc. 1
22
Stage 4: Index ferret 22 Stage 2: Segment Stage 1: Input Stage 3: Extract Stage 5: Rank Stage 6: Output Image database Stage 4: Index’ Stage 4: Index’’ Executing at Proc. 0 Executing at Proc. 1 TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT Placed at Proc. 0 Placed at Proc. 1
23
Performance evaluation Two parameters with major effect on NUMA performance Data placement Schedule of computations Execution scenario: schedule / placement 23 Scenario 1: default / FT Schedule: default Placement: first-touch (FT)
24
Processor 0 DRAM default / FT 24 Processor 1 DRAM Processor 2 DRAM Processor 3 DRAM TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT
25
Processor 0 DRAM default / FT 25 Processor 1 DRAM Processor 2 DRAM Processor 3 DRAM TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT
26
Processor 0 DRAM default / FT 26 Processor 1 DRAM Processor 2 DRAM Processor 3 DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT
27
Performance evaluation Two parameters with major effect on NUMA performance Data placement Schedule of computations Execution scenario: schedule / placement 27 Scenario 2: default / INTL Schedule: default Placement: interleaved (INTL) Scenario 1: default / FT Schedule: default Placement: first-touch (FT) change placement
28
default / FT default / INTL Processor 0 DRAM 28 Processor 1 DRAM Processor 2 DRAM Processor 3 DRAM 02 46 13 57 TT TT TT TT 810 1214 911 1315 TT TT TT TT 1618 2022 1719 2123 TT TT TT TT 2426 2830 2527 2931 TT TT TT TT D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D
29
Performance evaluation Two parameters with major effect on NUMA performance Data placement Schedule of computations Execution scenario: schedule / placement 29 Scenario 2: default / INTL Schedule: default Placement: interleaved (INTL) Scenario 1: default / FT Schedule: default Placement: first-touch (FT) change placement Scenario 3: NUMA / INTL Schedule: NUMA-aware Placement: interleaved (INTL) change schedule
30
default / INTL NUMA / INTL Processor 0 DRAM 30 Processor 1 DRAM Processor 2 DRAM Processor 3 DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D TTTT TTTT TTTT TTTT TTTT TTTT TTTT TTTT D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D
31
Performance evaluation Two parameters with major effect on NUMA performance Data placement Schedule of computations Execution scenario: schedule / placement 31 Scenario 2: default / INTL Schedule: default Placement: interleaved (INTL) Scenario 1: default / FT Schedule: default Placement: first-touch (FT) change placement Scenario 3: NUMA / INTL Schedule: NUMA-aware Placement: interleaved (INTL) Scenario 4: NUMA / NUMA Schedule: NUMA-aware Placement: NUMA-aware (NA) change schedule change placement
32
NUMA / NUMA Processor 0 DRAM 32 Processor 1 DRAM Processor 2 DRAM Processor 3 DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT
33
Performance evaluation: ferret 33 Uncore transfers [x 10 9 ] Improvement over default / FT default / FT Processor 0 DRAM Processor 1 DRAM Processor 2 DRAM Processor 3 DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT
34
Performance evaluation: ferret 34 Uncore transfers [x 10 9 ] Improvement over default / FT Processor 0 DRAM Processor 1 DRAM Processor 2 DRAM Processor 3 DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D TTTT TTTT TTTT TTTT TTTT TTTT TTTT TTTT default / INTL
35
Performance evaluation: ferret 35 Uncore transfers [x 10 9 ] Improvement over default / FT NUMA / INTL Processor 0 DRAM Processor 1 DRAM Processor 2 DRAM Processor 3 DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT
36
Performance evaluation: ferret 36 Uncore transfers [x 10 9 ] Improvement over default / FT NUMA / NUMA Processor 0 DRAM Processor 1 DRAM Processor 2 DRAM Processor 3 DRAM D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D D TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT TT
37
Performance evaluation (cont’d) streamclusterdedup 37
38
Data locality optimizations: summary Data locality better than avoiding interconnect contention Interleaved placement easy to control Data locality: lack of tools for implementing optimizations Other options Data replication Automatic data migration 38
39
Performance evaluation: ferret 39 Uncore transfers [x 10 9 ] Improvement over default (FT)
40
Conclusions Details matter Prefetcher efficiency Data locality Substantial improvements Benchmarking using NUMA-multicores far from easy Two aspects to consider: data placement and computation scheduling Appreciate memory system details to avoid misconceptions Limited support for understanding hardware bottlenecks 40 streamclusterferretdedup 214%59%17%
41
Thank you for your attention! 41
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.