IFS Benchmark with Federation Switch John Hague, IBM.

IFS Benchmark with Federation Switch John Hague, IBM

Introduction Federation has dramatically improved pwr4 p690 communication, so –Measure Federation performance with Small Pages and Large Pages using simulation program –Compare Federation and pre-Federation (Colony) performance of IFS –Compare Federation performance of IFS with and without Large Pages and Memory Affinity –Examine IFS communication using mpi profiling

Colony v Federation Colony (hpca) –1.3GHz 32-processor p690s –Four 8-processor Affinity LPARs per p690 Needed to get communication performance –Two 180MB/s adapters per LPAR Federation (hpcu) –1.7GHz p690s –One 32-processor LPAR per p690 –Memory and MPI MCM Affinity MPI Task and Memory from same MCM Slightly better than binding task to specific processor –Two 2-link 1.2GB/s Federation adapters per p690 Four 1.2GB/s links per node

IFS Communication: transpositions 1.MPI Alltoall in all rows simultaneously Mostly shared memory 2.MPI Alltoall in all columns simultaneously 1 0 MPI task Node 1 4 2 12345 89 3130 8 3 0

Simulation of transpositions All transpositions in “row” use shared memory All transpositions in “column” use switch Number of MPI tasks per node varied –But all processors used by using OpenMP threads Bandwidth measured for MPI Sendrecv calls –Buffers allocated and filled by threads between each call Large Pages give best switch performance –With current switch software

“Transposition” Bandwidth per link (8 nodes, 4 links/node, 8 tasks/node, 4 threads/task, 2 tasks/link) SP = Small Pages LP = Large Pages

“Transposition” Bandwidth per link (8 nodes, 4 links/node) Multiple threads ensure all processors are used

hpcu v hpca with IFS Benchmark jobs (provided 3 years ago) –Same executable used on hpcu and hpca –256 processors used –All jobs run with mpi_profiling (and barriers before data exchange) ProcsGrid PointshpcahpcuSpeedup T39910x1_4213988582838101.52 T79916x8_2843532990755271.79 4D-Var T511/T255 16x8_2486927371.78

IFS Speedups: hpcu v hpca LP = Large Pages; SP = Small Pages MA = Memory Affinity

LP/SP & MA/noMA CPU comparison

LP/SP & MA/noMA Comms comparison

Percentage Communication hpca ------------------- hpcu --------------------------

Extra Memory needed by Large Pages Large Pages are allocated in Real Memory in segments of 256 MB MPI_INIT –80MB which may not be used –MP_BUFFER_MEM (default 64MB) can be reduced –MPI_BUFFER_ALLOCATE needs memory which may not be used OpenMP threads: –Stack allocated with XLSMPOPTS=“stack=…” may not be used Fragmentation –Memory is "wasted" Last 256 MB segment –Only a small part of it may be used

mpi_profile Examine IFS communication using mpi profiling –Use libmpiprof.a –Calls and MB/s rate for each type of call Overall For each higher level subroutine –Histogram of blocksize for each type of call

mpi_profile for T799 128 MPI tasks, 2 threads WALL time = 5495 sec -------------------------------------------------------------- MPI Routine #calls avg. bytes Mbytes time(sec) -------------------------------------------------------------- MPI_Send 49784 52733.2 2625.3 7.873 MPI_Bsend 6171 454107.3 2802.3 1.331 MPI_Isend 84524 1469867.4 124239.1 1.202 MPI_Recv 91940 1332252.1 122487.3 359.547 MPI_Waitall 75884 0.0 0.0 59.772 MPI_Bcast 362 26.6 0.0 0.028 MPI_Barrier 9451 0.0 0.0 436.818 ------- TOTAL 866.574 ---------------------------------------------------------------- Barrier indcates load imbalance

mpi_profile for 4D_Var min0 128 MPI tasks, 2 threads WALL time = 1218 sec -------------------------------------------------------------- MPI Routine #calls avg. bytes Mbytes time(sec) -------------------------------------------------------------- MPI_Send 43995 7222.9 317.8 1.033 MPI_Bsend 38473 13898.4 534.7 0.843 MPI_Isend 326703 168598.3 55081.6 6.368 MPI_Recv 432364 127061.8 54936.9 220.877 MPI_Waitall 276222 0.0 0.0 23.166 MPI_Bcast 288 374491.7 107.9 0.490 MPI_Barrier 27062 0.0 0.0 94.168 MPI_Allgatherv 466 285958.8 133.3 26.250 MPI_Allreduce 1325 73.2 0.1 1.027 ------- TOTAL 374.223 ----------------------------------------------------------------- Barrier indicates load imbalance

MPI Profiles for send/recv

mpi_profiles for recv/send Avg MB MB/s per task hpcahpcu T799 (4 tasks per link) trltom (inter node)1.8435224 trltog (shrd memory)4.00116890 slcomm2 (halo)0.6665363 4D-Var min0 (4 tasks per link) trltom (inter node)0.167160 trltog (shrd memory)0.373490 slcomm2 (halo)0.088222 799

Conclusions Speedups of hpcu over hpca Large Memory Pages Affinity Speedup N N 1.32 – 1.60 Y N 1.43 – 1.62 N Y 1.47 – 1.78 Y Y 1.52 – 1.85 Best Environment Variables –MPI.network=ccc0 (instead of cccs) –MEMORY_AFFINITY=yes –MP_AFFINITY=MCM ! With new pvmd –MP_BULK_MIN_MSG_SIZE=50000 –LDR_CNTRL="LARGE_PAGE_DATA=Y“ don’t use – else system calls in LP very slow –MP_EAGER_LIMIT=64K

hpca v hpcu ------Time---------- ----Speedup----- % LP Aff I/O* Total CPU Comms Total CPU Comms Comms min0: hpca *** N N 2499 1408 1091 43.6 hpcu H+/22 N N 1502 1119 383 1.66 1.26 2.85 25.5 H+/21 N Y 1321 951 370 1.89 1.48 2.95 28.0 H+/20 Y N 1444 1165 279 1.73 1.21 3.91 19.3 H+/19 Y Y 1229 962 267 2.03 1.46 4.08 21.7 min1: hpca *** N N 1649 1065 584 43.6 hpcu H+/22 N N 1033 825 208 1.60 1.29 2.81 20.1 H+/21 N Y 948 734 214 1.74 1.45 2.73 22.5 H+/15 Y N 1019 856 163 1.62 1.24 3.58 16.0 H+/19 Y Y 914 765 149 1.80 1.39 3.91 16.3

mpi_profiles for recv/send Avg MB MB/s per task hpcahpcu T799 (4 tasks per link) trltom (inter node)1.8435224 trltog (shrd memory)4.00116890 slcomm2 (halo)0.6665363 4D-Var min0 (4 tasks per link) trltom (inter node)0.167160 trltog (shrd memory)0.373490 slcomm2 (halo)0.088222 799

Conclusions Memory Affinity with binding –Program binds to: MOD(task_id*nthrds+thrd_id,32), or –Use new /usr/lpp/ppe.poe/bin/pmdv4 –How to bind if whole node not used –Try VSRAC code from Montpellier –Bind adapter link to MCM ? Large Pages –Advantages Need LP for best communication B/W with current software –Disadvantages Uses extra memory (4GB more per node in 4D-Var min1) Load Leveler Scheduling –Prototype switch software indicates Large Pages not necessary Collective Communication –To be investigated

Linux compared to PWR4 for IFS Linux (run by Peter Mayes) –Opteron, 2GHz, 2 CPUs/node, 6GB/node, myrinet switch –Portland Group compiler: –Compiler flags: -O3 -Mvect=sse –No code optimisation or OpenMP –Linux 1: 1 CPU/node, Myrinet IP –Linux 1A: 1 CPU/node, Myrinet GM –Linux 2: using 2 CPUs/node IBM Power4 –MPI (intra-node shared memory) and OpenMP –Compiler flags: -O3 –qstrict –hpca: 1.3GHz p690, 8 CPUs/node, 8GB/node, colony switch –hpcu: 1.7GHz p690, 32 CPUs/node, 32GB/node, federation switch

Linux compared to Pwr4

IFS Benchmark with Federation Switch John Hague, IBM.

Similar presentations

Presentation on theme: "IFS Benchmark with Federation Switch John Hague, IBM."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IFS Benchmark with Federation Switch John Hague, IBM.

Similar presentations

Presentation on theme: "IFS Benchmark with Federation Switch John Hague, IBM."— Presentation transcript:

Similar presentations

About project

Feedback