NERSC NUG Meeting 5/29/03 Seaborg Code Scalability Project Richard Gerber NERSC User Services.

Slides:

Advertisements

Similar presentations

Starfish: A Self-tuning System for Big Data Analytics.

Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

FY 2004 Allocations Francesca Verdier NERSC User Services NERSC User Group Meeting 05/29/03.

TUPEC057 Advances With Merlin – A Beam Tracking Code J. Molson, R.J. Barlow, H.L. Owen, A. Toader MERLIN is a.

Chess Problem Solver Solves a given chess position for checkmate Problem input in text format.

A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

Reference: Message Passing Fundamentals.

A Stratified Approach for Supporting High Throughput Event Processing Applications July 2009 Geetika T. LakshmananYuri G. RabinovichOpher Etzion IBM T.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Brad Whitlock October 14, 2009 Brad Whitlock October 14, 2009 Porting VisIt to BG/P.

Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.

Module 8: Monitoring SQL Server for Performance. Overview Why to Monitor SQL Server Performance Monitoring and Tuning Tools for Monitoring SQL Server.

AMIR RACHUM CHAI RONEN FINAL PRESENTATION INDUSTRIAL SUPERVISOR: DR. ROEE ENGELBERG, LSI Optimized Caching Policies for Storage Systems.

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Simulation of Memory Management Using Paging Mechanism in Operating Systems Tarek M. Sobh and Yanchun Liu Presented by: Bei Wang University of Bridgeport.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.

Office of Science U.S. Department of Energy Evaluating Checkpoint/Restart on the IBM SP Jay Srinivasan

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

Lecture 2 Process Concepts, Performance Measures and Evaluation Techniques.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Component-Based Programming with Streams Philip Garcia University of Wisconsin - Madison Johannes Helander Microsoft Research.

9,825,461,087,64 10,91 6,00 0,00 8,00 Information and Communication Networks HiPath ProCenter Compact.

1 Metrics for the Office of Science HPC Centers Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.

EFFECTIVE LOAD-BALANCING VIA MIGRATION AND REPLICATION IN SPATIAL GRIDS ANIRBAN MONDAL KAZUO GODA MASARU KITSUREGAWA INSTITUTE OF INDUSTRIAL SCIENCE UNIVERSITY.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.

Zhengji Zhao, Nicholas Wright, and Katie Antypas NERSC Effects of Hyper- Threading on the NERSC workload on Edison NUG monthly meeting, June 6, 2013.

1. 2 Table 4.1 Key characteristics of six passenger aircraft: all figures are approximate; some relate to a specific model/configuration of the aircraft.

A New Parallel Debugger for Franklin: DDT Katie Antypas User Services Group NERSC User Group Meeting September 17, 2007.

Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.

UNICOS. When it comes to solving real-world problems, leading-edge hardware is only part of the solution. A complete solution also requires a powerful.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

April 14, 2004 The Distributed Performance Consultant: Automated Performance Diagnosis on 1000s of Processors Philip C. Roth Computer.

Distributed Process Scheduling : A Summary

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

Template This is a template to help, not constrain, you. Modify as appropriate. Move bullet points to additional slides as needed. Don’t cram onto a single.

CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Computer Science 320 Load Balancing. Behavior of Parallel Program Why do 3 threads take longer than two?

Author Utility-Based Scheduling for Bulk Data Transfers between Distributed Computing Facilities Xin Wang, Wei Tang, Raj Kettimuthu,

Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.

CS4315A. Berrached:CMS:UHD1 Introduction to Operating Systems Chapter 1.

Desktop Workload Characterization for CMP/SMT and Implications for Operating System Design Sven Bachthaler Fernando Belli Alexandra Fedorova Simon Fraser.

Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Concurrency and Performance Based on slides by Henri Casanova.

KERRY BARNES WILLIAM LUNDGREN JAMES STEED

Tuning Threaded Code with Intel® Parallel Amplifier.

Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.

Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.

OPERATING SYSTEMS CS 3502 Fall 2017

Parallel Programming By J. H. Wang May 2, 2017.

Parallel Algorithm Design

Processor Management Damian Gordon.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Adaptive Code Unloading for Resource-Constrained JVMs

CPU SCHEDULING.

PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.

Parallel Programming in C with MPI and OpenMP

CSE 373: Data Structures and Algorithms

Processor Management Damian Gordon.

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

NERSC NUG Meeting 5/29/03 Seaborg Code Scalability Project Richard Gerber NERSC User Services

NERSC NUG Meeting 5/29/03 NERSC Scaling Objectives NERSC wants to promote higher concurrency jobs. To this end NERSC Reconfigured the LoadLeveler job scheduler to favor large jobs Implemented a large job reimbursement program Provides users assistance with their codes Began a detailed study of a number of selected projects

NERSC NUG Meeting 5/29/03 Code Scalability Project About 20 large user projects were chosen for NERSC to study more closely. Each is assigned to a staff member from NERSC User Services. The study will: Interview users to determine why they run their jobs they way they do. Collect scaling information for major codes. Identify classes of codes that scale well or poorly. Identify bottlenecks to scaling. Analyze cost/benefit for large concurrency jobs. Note lessons learned and tips for scaling well.

NERSC NUG Meeting 5/29/03 Current Usage of Seaborg We can examine current job statistics on Seaborg to check –User behavior (how jobs are run) –Queue wait times We can also look at the results of the large job reimbursement program to see how it influenced the way users ran jobs

NERSC NUG Meeting 5/29/03 Job Distribution 3/3/2003-5/26/2003 Regular charge class

NERSC NUG Meeting 5/29/03 Connect Time Usage Distribution 3/3/2003-5/26/2003 Regular charge class

NERSC NUG Meeting 5/29/03 Queue Wait Times 3/3/2003-5/26/2003 Regular charge class

NERSC NUG Meeting 5/29/03 Processor Time/Wait Ratio 3/3/2003-5/26/2003 Regular charge class

NERSC NUG Meeting 5/29/03 Current Usage Summary Users run many small jobs However, 55% of computing time is spent running jobs that use more than 16 nodes (256 processors) And 45% of computing time is used by jobs running on 32+ nodes (512+ CPUs) Current queue policy favors large jobs; it is not a barrier to running on many nodes

NERSC NUG Meeting 5/29/03 Factors that May Affect Scaling Why aren’t even more jobs run at high concurrency? Are any of the following bottlenecks to scaling? Algorithmic issues Coding effort needed MPP cost per amount of science achieved Any remaining scheduling / job turnaround issues Other????

NERSC NUG Meeting 5/29/03 Hints from Reimbursement During April NERSC reimbursed a number of projects for jobs using 64+ nodes Time set aside to let users investigate scaling performance of their codes Some projects made great use of the program, showing that they would run at high concurrency if given free time.

NERSC NUG Meeting 5/29/03 Reimbursement Usage Project PIOct.-MarchApril Toussaint36 %59 % Ryne19 %56 % Cohen17 %48 % Held0 %78 % Borrill8 %64 % Batchelor went from 0% to 66% of time running on 128+ nodes (2,048 CPUs) Run time percentage using 64+ nodes (examples)

NERSC NUG Meeting 5/29/03 Project activity Many projects are working with their User Services Group contacts –Characterizing scaling performance –Profiling codes –Parallel I/O strategies –Enhancing code for high concurrency –Compiler, runtime bug fixes and optimizations –Examples: Batchelor (Jaeger), Ryne (Qiang, Adelmann), Vahalla, Toussaint, Mezzacappa (Swesty, Strayer, Blondin), Butalov, Guzdar (Swisdak), Spong

NERSC NUG Meeting 5/29/03 Project Example 1 Qiang’s (Ryne) BeamBeam3D beam dynamics code; written in Fortran Poor scaling noted on N3E compared to N3 We performed many scaling runs, noticed very bad performance using 16 tasks/node Tracked problem to routine making heavy use of RANDOM_NUMBER intrinsic Identified runtime problem with IBM’s default threading of RANDOM_NUMBER Found undocumented setting that improved performance dramatically; reported to IBM Identified run strategy that minimized execution time; another that minimized cost

NERSC NUG Meeting 5/29/03 BeamBeam3D Scaling

NERSC NUG Meeting 5/29/03 Tasks per Node Number of Tasks (209.0) (207.5) (201.7) (115.6) (100.8) 97.6 (98.0)96.1 (96.1) 53.1 (106.6) 47.0 (53.2)45.3 (45.9)44.2 (44.7)43.7 (44.0) 25.0 (62.0) 21.9 (27.2)21.6 (22.8)21.2 (21.6) 15.6 (73.9) 14.1 (21.7)12.7 (14.1)12.3 (12.7) 20.8 (75.4) 13.5 (16.7)12.2 (12.1) 38.1 (181.2) 21.8 (32.9) BeamBeam Run Time intrinthds=1 (default)

NERSC NUG Meeting 5/29/03 MPP Charges Number of Nodes Number of Tasks ,244 (8,360) 16,416 (16,600) 32,336 (32,272) 4,144 (4,624) 8,128 (8,064) 15,616 (15,680) 30,750 (30,750) 2,124 (4,264) 3,760 (4,256) 7,248 (7,344) 14,144 (14,304) 27,968 (28,160) 2,000 (4,960) 3,504 (4,352) 6,912 (7,296) 13,568 (13,824) 2,496 (11,824) 4,512 (6,944) 8,122 (9,024) 15,744 (16,230) 6,656 (24,128) 8,640 (10,688) 15,616 (15,539) 24,384 (115,968) 27,904 (42,112)

NERSC NUG Meeting 5/29/03 BeamBeam Summary Found a fix for runtime performance problem Reported to IBM; seeking clarification and documentation Identified run configuration that solved problem the fastest Identified cheapest job Quantified MPP cost for various configurations

NERSC NUG Meeting 5/29/03 Project Example 2 Adelmann’s (Ryne) PARSEC code –3D self-consistent iterative field solver, particle code for studying accelerator beam dynamics; written in C++ –Scales extremely well to 4,096 processors, but Mflops/s performance disappointing –Migrating from KCC to xlC; found fatal xlC compiler bug; pushing IBM for fix so can optimize with IBM compiler –Using HPMlib profiling calls, found that large amount of run time spent in integer-only stenciling routines; naturally gives low Mflops/s –Have recently identified possible poor load-balancing issues; working to resolve

NERSC NUG Meeting 5/29/03 In Conclusion This work is underway. We don’t expect to be able to characterize every code we are studying, but we hope to survey a number of algorithms and scientific applications. A draft report is scheduled for July.