NERSC NUG Meeting 5/29/03 Seaborg Code Scalability Project Richard Gerber NERSC User Services
NERSC NUG Meeting 5/29/03 NERSC Scaling Objectives NERSC wants to promote higher concurrency jobs. To this end NERSC Reconfigured the LoadLeveler job scheduler to favor large jobs Implemented a large job reimbursement program Provides users assistance with their codes Began a detailed study of a number of selected projects
NERSC NUG Meeting 5/29/03 Code Scalability Project About 20 large user projects were chosen for NERSC to study more closely. Each is assigned to a staff member from NERSC User Services. The study will: Interview users to determine why they run their jobs they way they do. Collect scaling information for major codes. Identify classes of codes that scale well or poorly. Identify bottlenecks to scaling. Analyze cost/benefit for large concurrency jobs. Note lessons learned and tips for scaling well.
NERSC NUG Meeting 5/29/03 Current Usage of Seaborg We can examine current job statistics on Seaborg to check –User behavior (how jobs are run) –Queue wait times We can also look at the results of the large job reimbursement program to see how it influenced the way users ran jobs
NERSC NUG Meeting 5/29/03 Job Distribution 3/3/2003-5/26/2003 Regular charge class
NERSC NUG Meeting 5/29/03 Connect Time Usage Distribution 3/3/2003-5/26/2003 Regular charge class
NERSC NUG Meeting 5/29/03 Queue Wait Times 3/3/2003-5/26/2003 Regular charge class
NERSC NUG Meeting 5/29/03 Processor Time/Wait Ratio 3/3/2003-5/26/2003 Regular charge class
NERSC NUG Meeting 5/29/03 Current Usage Summary Users run many small jobs However, 55% of computing time is spent running jobs that use more than 16 nodes (256 processors) And 45% of computing time is used by jobs running on 32+ nodes (512+ CPUs) Current queue policy favors large jobs; it is not a barrier to running on many nodes
NERSC NUG Meeting 5/29/03 Factors that May Affect Scaling Why aren’t even more jobs run at high concurrency? Are any of the following bottlenecks to scaling? Algorithmic issues Coding effort needed MPP cost per amount of science achieved Any remaining scheduling / job turnaround issues Other????
NERSC NUG Meeting 5/29/03 Hints from Reimbursement During April NERSC reimbursed a number of projects for jobs using 64+ nodes Time set aside to let users investigate scaling performance of their codes Some projects made great use of the program, showing that they would run at high concurrency if given free time.
NERSC NUG Meeting 5/29/03 Reimbursement Usage Project PIOct.-MarchApril Toussaint36 %59 % Ryne19 %56 % Cohen17 %48 % Held0 %78 % Borrill8 %64 % Batchelor went from 0% to 66% of time running on 128+ nodes (2,048 CPUs) Run time percentage using 64+ nodes (examples)
NERSC NUG Meeting 5/29/03 Project activity Many projects are working with their User Services Group contacts –Characterizing scaling performance –Profiling codes –Parallel I/O strategies –Enhancing code for high concurrency –Compiler, runtime bug fixes and optimizations –Examples: Batchelor (Jaeger), Ryne (Qiang, Adelmann), Vahalla, Toussaint, Mezzacappa (Swesty, Strayer, Blondin), Butalov, Guzdar (Swisdak), Spong
NERSC NUG Meeting 5/29/03 Project Example 1 Qiang’s (Ryne) BeamBeam3D beam dynamics code; written in Fortran Poor scaling noted on N3E compared to N3 We performed many scaling runs, noticed very bad performance using 16 tasks/node Tracked problem to routine making heavy use of RANDOM_NUMBER intrinsic Identified runtime problem with IBM’s default threading of RANDOM_NUMBER Found undocumented setting that improved performance dramatically; reported to IBM Identified run strategy that minimized execution time; another that minimized cost
NERSC NUG Meeting 5/29/03 BeamBeam3D Scaling
NERSC NUG Meeting 5/29/03 Tasks per Node Number of Tasks (209.0) (207.5) (201.7) (115.6) (100.8) 97.6 (98.0)96.1 (96.1) 53.1 (106.6) 47.0 (53.2)45.3 (45.9)44.2 (44.7)43.7 (44.0) 25.0 (62.0) 21.9 (27.2)21.6 (22.8)21.2 (21.6) 15.6 (73.9) 14.1 (21.7)12.7 (14.1)12.3 (12.7) 20.8 (75.4) 13.5 (16.7)12.2 (12.1) 38.1 (181.2) 21.8 (32.9) BeamBeam Run Time intrinthds=1 (default)
NERSC NUG Meeting 5/29/03 MPP Charges Number of Nodes Number of Tasks ,244 (8,360) 16,416 (16,600) 32,336 (32,272) 4,144 (4,624) 8,128 (8,064) 15,616 (15,680) 30,750 (30,750) 2,124 (4,264) 3,760 (4,256) 7,248 (7,344) 14,144 (14,304) 27,968 (28,160) 2,000 (4,960) 3,504 (4,352) 6,912 (7,296) 13,568 (13,824) 2,496 (11,824) 4,512 (6,944) 8,122 (9,024) 15,744 (16,230) 6,656 (24,128) 8,640 (10,688) 15,616 (15,539) 24,384 (115,968) 27,904 (42,112)
NERSC NUG Meeting 5/29/03 BeamBeam Summary Found a fix for runtime performance problem Reported to IBM; seeking clarification and documentation Identified run configuration that solved problem the fastest Identified cheapest job Quantified MPP cost for various configurations
NERSC NUG Meeting 5/29/03 Project Example 2 Adelmann’s (Ryne) PARSEC code –3D self-consistent iterative field solver, particle code for studying accelerator beam dynamics; written in C++ –Scales extremely well to 4,096 processors, but Mflops/s performance disappointing –Migrating from KCC to xlC; found fatal xlC compiler bug; pushing IBM for fix so can optimize with IBM compiler –Using HPMlib profiling calls, found that large amount of run time spent in integer-only stenciling routines; naturally gives low Mflops/s –Have recently identified possible poor load-balancing issues; working to resolve
NERSC NUG Meeting 5/29/03 In Conclusion This work is underway. We don’t expect to be able to characterize every code we are studying, but we hope to survey a number of algorithms and scientific applications. A draft report is scheduled for July.