Download presentation
Presentation is loading. Please wait.
Published byNancy Rigdon Modified over 9 years ago
1
Matching Memory Access Patterns and Data Placement for NUMA Systems Zoltán Majó Thomas R. Gross Computer Science Department ETH Zurich, Switzerland
2
Non-uniform memory architecture 2 Processor 1 Core 4Core 5 Core 6Core 7 IC MC DRAM Processor 0 Core 0Core 1 Core 2Core 3 MC IC DRAM
3
Non-uniform memory architecture Local memory accesses bandwidth: 10.1 GB/s latency: 190 cycles 3 Processor 1 Core 4Core 5 Core 6Core 7 IC MC DRAM Processor 0 Core 0Core 1 Core 2Core 3 MC IC DRAM T Data All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])
4
Non-uniform memory architecture Local memory accesses bandwidth: 10.1 GB/s latency: 190 cycles Remote memory accesses bandwidth: 6.3 GB/s latency: 310 cycles 4 Processor 1 Core 4Core 5 Core 6Core 7 IC MC DRAM Processor 0 Core 0Core 1 Core 2Core 3 MC IC DRAM T Data Key to good performance: data locality All data based on experimental evaluation of Intel Xeon 5500 (Hackenberg [MICRO ’09], Molka [PACT ‘09])
5
Data locality in multithreaded programs 5 Remote memory references / total memory references [%]
6
Data locality in multithreaded programs 6 Remote memory references / total memory references [%]
7
Outline Automatic page placement Memory access patterns of matrix-based computations Matching memory access patterns and data placement Evaluation Conclusions 7
8
Automatic page placement Current OS support for NUMA: first-touch page placement Often high number of remote accesses Data address profiling Profile-based page-placement Supported in hardware on many architectures 8
9
Profile-based page placement Based on the work of Marathe et al. [JPDC 2010, PPoPP 2006] 9 Processor 1 DRAM Processor 0 DRAM T0 Profile P0: accessed 1000 times by P1:accessed3000 times by T0 T1 P1 P0
10
Automatic page placement Compare: first-touch and profile-based page placement Machine: 2-processor 8-core Intel Xeon E5520 Subset of NAS PB: programs with high fraction of remote accesses 8 threads with fixed thread-to-core mapping 10
11
Profile-based page placement 11
12
Profile-based page placement 12
13
Inter-processor data sharing 13 Processor 1 DRAM Processor 0 DRAM T0 Profile P0: accessed 1000 times by P1 :accessed 3000 times by T0 T1 P0P1 P2: accessed 4000 times by accessed 5000 times by T0 T1 P2 P2: inter-processor shared
14
Inter-processor data sharing 14 Processor 1 DRAM Processor 0 DRAM T0 Profile P0: accessed 1000 times by P1 :accessed 3000 times by T0 T1 P0P1 P2: accessed 4000 times by accessed 5000 times by T0 T1 P2 P2: inter-processor shared
15
Inter-processor data sharing 15 Shared heap / total heap [%]
16
Inter-processor data sharing 16 Shared heap / total heap [%]
17
Inter-processor data sharing 17 Shared heap / total heap [%] Performance improvement [%]
18
Inter-processor data sharing 18 Shared heap / total heap [%] Performance improvement [%]
19
Automatic page placement Profile-based page placement often ineffective Reason: inter-processor data sharing Inter-processor data sharing is a program property Detailed look: program memory access patterns Loop-parallel programs with OpenMP-like parallelization Matrix processing NAS BT 19
20
Matrix processing Process m sequentially m[NX][NY] 20 NX NY for (i=0; i<NX; i++) for (j=0; j<NY; j++) // access m[i][j]
21
for (i=0; i<NX; i++) for (j=0; j<NY; j++) // access m[i][j] #pragma omp parallel for for (i=0; i<NX; i++) for (j=0; j<NY; j++) // access m[i][j] Matrix processing Process m x-wise parallel 21 NX NY T0 T1 T2 T3 T4 T5 T6 T7 m[NX][NY]
22
Thread scheduling Remember: fixed thread-to-core mapping 22 Processor 1 DRAM Processor 0 DRAM T0 T1 T2 T3 T4 T5 T6 T7
23
#pragma omp parallel for for (i=0; i<NX; i++) for (j=0; j<NY; j++) // access m[i][j] Process m x-wise parallel Matrix processing 23 NX NY T0 T1 T2 T3 T4 T5 T6 T7 Allocated at Processor 1 Allocated at Processor 0 m[NX][NY]
24
#pragma omp parallel for for (i=0; i<NX; i++) for (j=0; j<NY; j++) // access m[i][j] Process m x-wise parallel for (i=0; i<NX; i++) #pragma omp parallel for for (j=0; j<NY; j++) // access m[i][j] Process m y-wise parallel Matrix processing 24 NX NY T0T1T2T3T4T5T6T7 Allocated at Processor 1 Allocated at Processor 0 m[NX][NY]
25
NX NY for (t=0; t<TMAX; t++) { x_wise(); y_wise(); } Example: NAS BT 25 Time-step iteration m[NX][NY] T0T1T2T3T4T5T6T7T0 T1 T2 T3 T4 T5 T6 T7
26
NX NY for (t=0; t<TMAX; t++) { x_wise(); y_wise(); } Example: NAS BT 26 Result: Inter-processor shared heap: 35% Remote accesses: 19% Time-step iteration m[NX][NY] T0T1T2T3T4T5T6T7T0 T1 T2 T3 T4 T5 T6 T7 Appropriate allocation not possible Allocated at Processor 0 Allocated at Processor 1 Appropriate allocation not possible
27
Solution? 1.Adjust data placement High overhead of runtime data migration cancels benefit 2.Adjust iteration scheduling Limited by data dependences 3.Adjust data placement and iteration scheduling together 27
28
API Library for data placement Set of common data distributions Affinity-aware loop iteration scheduling Extension to GCC OpenMP implementation Example use case: NAS BT 28
29
Use-case: NAS BT Remember: BT has two incompatible access patterns Repeated x-wise and y-wise access to the same data Idea: data placement to accommodate both access patterns 29 NX NY Allocated at Processor 0 Allocated at Processor 1 Blocked-exclusive data placement
30
distr_t *distr; distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2)); distribute_to(distr); Use-case: NAS BT for (t=0; t<TMAX; t++) { x_wise(); 30 y_wise(); }
31
Use-case: NAS BT 31 distr_t *distr; distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2)); distribute_to(distr); #pragma omp parallel for for (i=0; i<NX; i++) for (j=0; j<NY; j++) //access m[i][j] for (t=0; t<TMAX; t++) { x_wise(); y_wise(); }
32
x_wise() Matrix processed in two steps 32 Step 1: left half all accesses local Step 2: right half all accesses local Allocated at Processor 1 Allocated at Processor 0 NY / 2 NX Allocated at Processor 0 Allocated at Processor 1 NY / 2 T0 T1 T2 T3 T4 T5 T6 T7
33
#pragma omp parallel for for (i=0; i<NX; i++) for (j=0; j<NY; j++) //access m[i][j] #pragma omp parallel for for (i=0; i<NX; i++) for (j=0; j<NY/2; j++) //access m[i][j] #pragma omp parallel for for (i=0; i<NX; i++) for (j=NY/2; j<NY; j++) //access m[i][j] Use-case: NAS BT for (t=0; t<TMAX; t++) { x_wise(); 33 distr_t *distr; distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2)); distribute_to(distr); y_wise(); }
34
#pragma omp parallel for for (i=0; i<NX; i++) for (j=0; j<NY/2; j++) //access m[i][j] #pragma omp parallel for for (i=0; i<NX; i++) for (j=NY/2; j<NY; j++) //access m[i][j] Use-case: NAS BT for (t=0; t<TMAX; t++) { x_wise(); 34 distr_t *distr; distr = block_exclusive_distr(m, sizeof(m), sizeof(m[0]/2)); distribute_to(distr); schedule(static-inverse) schedule(static) y_wise(); }
35
for (i=0; i<NX; i++) for (j=0; j<NY; j++) // access m[i][j] #pragma omp parallel for schedule(static) for (i=0; i<NX; i++) for (j=0; j<NY; j++) // access m[i][j] Matrix processing Process m x-wise parallel 35 NX NY T0 T1 T2 T3 T4 T5 T6 T7 m[NX][NY]
36
for (i=0; i<NX; i++) for (j=0; j<NY; j++) // access m[i][j] #pragma omp parallel for schedule(static) for (i=0; i<NX; i++) for (j=0; j<NY; j++) // access m[i][j] Matrix processing Process m x-wise parallel 36 NX NY T0 T1 T2 T3 T4 T5 T6 T7 m[NX][NY] m[0.. NX/8 - 1][*] m[NX/8.. 2*NX/8 - 1][*] m[2*NX/8.. 3*NX/8 - 1][*] m[3*NX/8.. 4*NX/8 - 1][*] m[4*NX/8.. 5*NX/8 - 1][*] m[5*NX/8..6*NX/8 - 1][*] m[6*NX/8..7*NX/8 - 1][*] m[7*NX/8.. NX - 1][*]
37
m[0.. NX/8 - 1][*] m[NX/8.. 2*NX/8 - 1][*] m[2*NX/8.. 3*NX/8 - 1][*] m[3*NX/8.. 4*NX/8 - 1][*] m[4*NX/8.. 5*NX/8 - 1][*] m[5*NX/8.. 6*NX/8 - 1][*] m[6*NX/8.. 7*NX/8 - 1][*] m[7*NX/8.. NX - 1][*] #pragma omp parallel for schedule(static) for (i=0; i<NX; i++) for (j=0; j<NY; j++) // access m[i][j] static vs. static-inverse 37 T0 m[0.. NX/8 - 1][*] T1 m[NX/8.. 2*NX/8 - 1][*] T2 m[2*NX/8.. 3*NX/8 - 1][*] T3 m[3*NX/8.. 4*NX/8 - 1][*] T4 m[4*NX/8.. 5*NX/8 - 1][*] T5 m[5*NX/8.. 6*NX/8 - 1][*] T6 m[6*NX/8.. 7*NX/8 - 1][*] T7 m[7*NX/8.. NX - 1][*] #pragma omp parallel for schedule(static-inverse) for (i=0; i<NX; i++) for (j=0; j<NY; j++) // access m[i][j] T0 T1 T2 T3 T4 T5 T6 T7
38
y_wise() Matrix processed in two steps 38 Allocated at Processor 0 Allocated at Processor 1 NX / 2 Allocated at Processor 0 Allocated at Processor 1 NY NX / 2 T4T5T6T7 Step 1: upper half all accesses local Step 2: lower half all accesses local T0T1T2T3
39
Outline Profile-based page placement Memory access patterns Matching data distribution and iteration scheduling Evaluation Conclusions 39
40
Evaluation 40 Performance improvement over first-touch [%]
41
Evaluation 41 Performance improvement over first-touch [%]
42
Evaluation 42 Performance improvement over first-touch [%]
43
Scalability Machine: 4-processor 32-core Intel Xeon E7-4830 43 Performance improvement over first-touch [%]
44
Scalability Machine: 4-processor 32-core Intel Xeon E7-4830 44 Performance improvement over first-touch [%]
45
Conclusions Automatic data placement (still) limited Alternating memory access patterns Inter-processor data sharing Match memory access patterns and data placement Simple API: practical solution that works today Ample opportunities for further improvement 45
46
Thank you for your attention! 46
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.