TEXAS ADVANCED COMPUTING CENTER User experiences on Heterogeneous TACC IBM Power4 System Avi Purkayastha Kent Milfeld, Chona Guiang Texas Advanced Computing Center University of Texas, Austin ScicomP 8 Minneapolis, MN August 4-8, 2003
TEXAS ADVANCED COMPUTING CENTER Outline Architectural Overview –TACC Heterogeneous Power4 System –Fundamental differences/similarities of TACC Power4 nodes Resource Allocation Policies Scheduling –Simple and advanced job scripts –Pre-processing LL with Job filter Performance Analysis –STREAM, NPB benchmarks –Finite Difference and Molecular Dynamics Applications –MPI Bandwidth Performance Conclusions
TEXAS ADVANCED COMPUTING CENTER TACC Cache/ Micro-architecture Features L1 32KB/data 2-way assoc. (write through) 64KB/instruction direct mapped L2 1.44MB (unified) 8-way assoc. L3 32MB 8-way assoc. memory 32 GB/Node (p690H) 128 GB/Node (p690T) 8 GB/Node (p655H) 128/128/4x128 Byte Lines for L1/L2/L3
TEXAS ADVANCED COMPUTING CENTER Comparison of TACC Power4 Nodes All nodes have same processing speed but different memory configurations; p690H, p655H have 2G/proc; p690T has 4G/proc. Only p690T has dual-core processors hence share L2 cache while others have dedicated L2 cache. p655 nodes have PCI-X adapters while the other nodes have PCI adapters, hence former has faster throughputs on message-passing. Global address snooping is absent on the p655s; this provides about 10% improvement in performance over the p690s.
TEXAS ADVANCED COMPUTING CENTER TACC Power4 System longhorn.tacc.utexas.edu Login GPFS P690 HPC P690 Turbo P690s HPC P655s HPC 32 nodes 4-way SMP 8GB/node 3 nodes 16-way SMP 32GB/node 1 node 32-way 128GB login node 13-way 16GB GPFS nodes 3 1-way 6 GB 16 procs. 32 procs. 48 procs. 128 procs. 22GB 128GB 96 GB 256GB Totals
TEXAS ADVANCED COMPUTING CENTER TACC Power4 System longhorn.tacc.utexas.edu Login GPFS P690 HPC P690 Turbo P690s HPC P655s HPC 32 ports 32 ports IBM HPC Dual-Plane SP Switch2
TEXAS ADVANCED COMPUTING CENTER TACC Power4 System longhorn.tacc.utexas.edu x16 SP Switch 2 Login P690 Turbo P690 HPC P690 /srcatch 36GB /srcatch 36GB /archive /home 1/4TB /work 4.5 TB HPC P690 HPC P690 HPC P655 /srcatch 18GB /srcatch 18GB local archival x16 work HPC P655
TEXAS ADVANCED COMPUTING CENTER LoadLeveler Batch Facility Used to execute a batch parallel job POE options: Use Environment Variables for LoadLeveler Scheduling –Adapter Specification –MPI parameters –Number of Nodes –Class (Priority) –Consumable Resources Also have simple job scripts (PBS like) to address users migrating from clusters
TEXAS ADVANCED COMPUTING CENTER Job Filter TACC Power4 system is heterogeneous – some nodes have large memories – some have faster communication throughput – some have dual cores, with different processor counts – need to address cluster users Part of the scheduling is just categorizing the job requests into classes Problem with LoadLeveler is that a job cannot be changed from one class to another -- hence a filter has evolved. Filter also optimizes resource allocation and scheduling policies with emphasis on application performance. job submission job filter one of following queues{ LH13, LH16, LH32, LH4 } scheduler determines priority and releases jobs for execution
TEXAS ADVANCED COMPUTING CENTER POE: Simple Job Script I #!/bin/csh … job type = parallel tasks = 16 memory = 1000 walltime = 00:30:00 class = normal queue poe a.out MPI example
TEXAS ADVANCED COMPUTING CENTER POE: Simple Job Script II #!/bin/csh … job type = parallel threads = 16 memory = 1000 walltime = 00:30:00 class = normal queue setenv OMP_NUM_THREADS 16 poe a.out OpenMP example
TEXAS ADVANCED COMPUTING CENTER POE: Advanced Job Script MPI example across Nodes #!/bin/csh … resources = ConsumableCpus(1) ConsumableMemory(1500mb) network.MPI=csss,not_shared,us node = 4 class = normal queue setenv MP_SHARED_MEMORY true poe a.out
TEXAS ADVANCED COMPUTING CENTER NCpTMpTTpN User Input N > 1 N = 1 Mem checks X * 4 Time< f(nodes) Non-shared, csss C=N * CpT * TpN M = TpN * MpT C>32 32>C>1716>C>5 3>C>2 LH32 LH4 LH16 N = 1 _ _ Decision Matrix Derived Values TACC Power4 System Filter Logic (MPI) Non-SharedShared C=4 N = 1 M<2 TpN=1-4TpN=4 TpN<4 TpN=16 N=2or3 M<2 TpN _ _ _ M<2M>2M<2M>2M<2M>2M<4 Shared TpN=C CpT=1 TpN>1 N=1 CpT=1 TpN>1 N>1 N = nodes CpT = cpus/task TpN = tasks/node MpT = mem/task C = CPUs M = Mem/Task C>32 removed
TEXAS ADVANCED COMPUTING CENTER CpTMpT User Input Mem checks X * 4 Time< f(nodes) Non-shared, csss C= CpT M = (MpT)/C N = nodes CpT = cpus/task TpN = tasks/node MpT = mem/task C = CPUs M = Mem/Cpu 32>C>1716>C>54>C>1 LH32 LH4 LH16 M<4M<2 _ _ Decision Matrix Derived Values Non-Shared TACC Power4 System Filter Logic (OMP) 16>C>5 _ 2<M<4 CpT>1 TpN=1 N=1
TEXAS ADVANCED COMPUTING CENTER Batch Resource Limits serial (front end) –8 hours, up to 2 jobs, development (front end) –12 GB, 2 hours, upto 8 CPUs normal/high (some default examples) –<= 8 GBs, < 4 CPUs (LH16) –<= 8 GBs, 4 CPUs (LH4) –> 32 GBs, 5<= CPUs <=16 (LH32) –For various other combination possibilities, see the User Guide dedicated (by special request only)
TEXAS ADVANCED COMPUTING CENTER Application Performance Effects of compute intensive and memory intensive scientific applications on the different nodes. Examples are STREAM, NPB etc. Effects of different kinds of MPI functionality on the different nodes. Examples include MPI ping- pong and MPI All-to-All Send_Recv’s
TEXAS ADVANCED COMPUTING CENTER Scientific Applications SM: Stommel model of ocean “circulation” ; solves 2-D partial differential equation. –Uses Finite Difference approx for derivatives on discretized domain, (timed for a constant number of Jacobi iterations). –Parallel version uses Domain Decomposition for grid 1Kx1K Memory Intensive Application Nearest Neighbor communication MD: Molecular Dynamics of a Solid Argon Lattice. –Uses Verlet algorithm for propagation (displacement & velocities). –Calculation done for 1 pico second for size Compute Intensive Application Global communications
TEXAS ADVANCED COMPUTING CENTER Scientific Applications II StommelMD lh lh lh lh16 architecture best suited for memory intensive application combined with nearest neighbor communication. L2 cache sharing is most ill suited for memory intensive application. Time(secs)
TEXAS ADVANCED COMPUTING CENTER NAS Parallel Benchmarks lh lh lh4 mgluftisepbtcg Class B results lh lh lh4 mgluftisepbtcg Class C results Time (secs)
TEXAS ADVANCED COMPUTING CENTER STREAM Benchmarks * CopyScaleAddTriads p p690_H p690_T Results courtesy of John McCalpin STREAM web-site
TEXAS ADVANCED COMPUTING CENTER STREAM Benchmarks * Results for p655 used large pages and threads Results for p690s used small pages and threads p655 tested system has 64G memory, TACC system has 8 G of memory p690 tested system has 128G memory, TACC system has 32 G of memory Systems with larger memory and CPUS can prefetch data streaming, hence applications with STREAM like kernels should perform better on p690s than p655s
TEXAS ADVANCED COMPUTING CENTER MPI On-node Performance IBM P690 Turbo K M IBM P690 HPC K M IBM P655 HPC K M Ping Pong IBM P690 Turbo K 549 2M IBM P690 HPC K 862 2M IBM P655 HPC K M Bisection Bandwidth
TEXAS ADVANCED COMPUTING CENTER MPI Off-node Performance IBM P M-4M IBM P M-4M Sustained off-node range measurements Cruiser adapter on p655s vs Corsair on p690s
TEXAS ADVANCED COMPUTING CENTER Thoughts and Comments p690T are most suited for large memory, threaded (OMP) type applications. Applications such as FD, which typically contain nearest-neighbor communication with large memory requirements, are best suited for p690H type nodes. Large MPI distributed jobs are best suited for p655 nodes as they are most balanced nodes. Latency sensitive, but small MPI jobs are better suited for p690H node than using the slower interconnect with p655s. In general, p690s are more limited by slower interconnect than helped by shared memory -- exceptions include FD, Linpack.