Review of NCAR Al Kellie SCD Director November 01, 2001
Outline of Presentation Introduction to UCAR NCAR SCD Overview of divisional activities Research data sets (Worley) Mass Storage System (Harano) Extracting model performance (Hammond) Visualization & Earth System GRiD (Middleton) Computing RFP (ARCS)
Outline of Presentation INTRODUCTION Overview of three divisional aspects Computing RFP (ARCS)
University Corporation for Atmospheric Research NCAR Tim Killeen, Director Scientific Computing Division (SCD) President Richard Anthes Al Kellie Member Institutions Board of Trustees Finance & Administration Katy Schmoll, VP Corporate Affairs Jack Fellows, VP UCAR Programs Jack Fellows, Director Constellation Observing System for Meteorology Ionosphere Climate (COSMIC) Cooperative Program for Optional Meteorology Education and Training (COMET) GPS Science and Technology Program (GST) Unidata Visiting Scientists Programs (VSP) Environmental & Societal Impacts Group (ESIG) Mesoscale & Microscale Meteorological Division (MMM) Research Applications Programs (RAP) Joint Office for Science Support (JOSS) Information Infrastructure Technology & Applications (IITA) Timothy Spangler Bill Kuo Mary Marlino Robert Harriss Robert Gall Brant Foote Randolph Ware David Fulker Meg Austin Karyn Sawyer Atmospheric Chemistry Division (ACD) Atmospheric Technology Division (ATD) Advanced Study Program (ASP) Climate & Global Dynamics Diviion (CGD) Maurice Blackmon Al Cooper David Carlson Daniel McKenna Richard Chinman Denotes President’s Office 12/07/98 Digital Library for Earth System Science (DLESE) Michael Knölker High Altitude Observatory (HAO)
Atmospheric Chemistry Dan McKenna Atmospheric Technology Dave Carlson Climate & Global Dynamics Maurice Blackmon Mesoscale & Microscale Meteorology Bob Gall High Altitude Observatory Michael Knolker Research Applications Brant Foote Scientific Computing Al Kellie NCAR Tim Killeen UCAR Rick Anthes UCAR Board of Trustees ESIG Bob Harriss ASP Al Cooper Associate Director Steve Dickson ISS K. Kelly B&P R.Brasher NCAR Organization
NCAR at a Glance 41 years; 850 Staff – 135 Scientists $128M budget for FY2001 9 divisions and programs Research tools, facilities, and visitor programs for the NSF and university communities
Total FY2001 funding: $128M
NCAR Peer-Reviewed Publications
NCAR Visitors
Where did SCD come from? “Blue Book” 1959 “Blue Book” Link “There are four compelling reasons for establishing a National Institute for Atmospheric Research” 2. The requirement for facilities and technological assistance beyond those that can properly be made available at individual universities
SCD Mission Enable the best atmospheric & related research, no matter where the investigator is located through the provision of high performance computing technologies and related services
Supercomputer Systems Mass Storage Systems High Performance Systems Gene Harano (13) SCIENTIFIC COMPUTING DIVISION Computational Science Steve Hammond (8) Algorithmic Software Development Model performance Research Science Collaboration Frameworks Standards & Benchmarking Data Support Roy Jenne (9) Data Archives Data Catalogs User Assistance Operations and Infrastructure Support Aaron Andersen (18) Operations Room Facility Management & Reporting Database Applications Site Licenses LAN MAN WAN Dial-up Access Network Infrastructure Ginger Caldwell (21) Training/Outreach/Consulting Digital Information Distributed Servers & Workstations Allocations & Account Management User Support Section Networking Engineering & Telecommunications Marla Meehl (25) DIRECTOR’S OFFICE Al Kellie, Director (12) Visualization & Enabling Technologies Don Middleton (12) Data Access Data Analysis Visualization Base $24,874 Ucar $4,027 Outside $2,020 Overhead $1,063
Computing Services for Research Operates two distinct computational facilities. –Climate simulations –University community Governance of these SCD resources in the hands of the users - two external allocation committees. Computing leverages a common infrastructure for access, networking, data storage & analysis, research data sets, and support services including software development, and consulting.
Climate Simulation Laboratory (CSL) The CSL is a national, multi-agency, special-use, computing facility for climate system modeling for the U.S. Global Change Research Program (USGCRP). – Priority projects that require very large amounts of computer time. CSL resources are available to U.S. individual researchers with a preference for research teams regardless of sponsorship. An inter-agency panel selects the projects that use the CSL.
Community Facility The Community Facility is used primarily by university-based NSF grantees and NCAR Scientists. – Community resources are allocated evenly between NCAR and the university community. NCAR resources are allocated by the NCAR Director to the various NCAR divisions. University resources are allocated by the SCD Advisory Panel. Open to areas of atmospheric and related sciences.
Distribution of Compute Resources
History of Supercomputing at NCAR CDC 3600 CDC 6600 CDC 7600 Cray 1-A S/N 3 Cray Y-MP/2 Cray 1-A S/N 14 TMC CM2/8192 Cray X-MP/4 Cray Y-MP/8 Cray C90/16 Cray T3D/64 TMC CM5/32 IBM RS/6000 Cluster IBM SP1/8 CCC Cray 3/ Cray Y-MP/8I Cray T3D/128 Cray J90/16 Cray J90/20 Cray J90se/24 HP SPP-2000/64 SGI Origin2000/128 Beowulf/16 IBM SP/64 IBM SP/604 Compaq ES40/36 Cluster IBM SP/32 IBM SP/296 Non-Production Machines Production Machines Currently in Production IBM SP/
2001 STK 9940 #4 #5
NCAR Wide Area Connectivity OC3 (155Mbps) to the Front Range GigaPop - OC12 (622Mbps) on 1/1/2002 –OC3 to AT&T Commodity Internet –OC3 to C&W Commodity Internet –OC3 to Abilene (OC12 on 1/1/2002) OC3 to the vBNS+ OC12 (622Mbps) to University of Colorado at Boulder –intra-site research and back-up link to FRGP OC12 to NOAA/NIST in Boulder –Intra-site research and UUNET Commodity Internet Dark fiber metropolitan area network at GigE (1000Mbps) to other NCAR campus sites
TeraGrid Wide Area Network NCSA/UIUC ANL UIC Multiple Carrier Hubs Starlight / NW Univ Ill Inst of Tech Univ of Chicago Indianapolis (Abilene NOC) I-WIRE StarLight International Optical Peering Point (see Los Angeles San Diego DTF Backbone Abilene Chicago Indianapolis Urbana OC-48 (2.5 Gb/s, Abilene) Multiple 10 GbE (Qwest) Multiple 10 GbE (I-WIRE Dark Fiber) Solid lines in place and/or available by October 2001 Dashed I-WIRE lines planned for summer 2002 * DENVER
ARCS Synopsis Credit: Tom Engel
ARCS RFP Overview BEST VALUE PROCUREMENT –Technical evaluation –Delivery schedule –Production disruption –Allocation ready state –Infrastructure –Maintenance –Cost impact – i.e. existing equipment –Past performance of bidders –Business proposal review –Other considerations - invitation to partner
ARCS Procurement Production-level –Availability, robust batch capacity, operational sustainability and support –Integrated software engineering and development environment High performance execution of existing applications Additionally – environment conducive to development of next-generation models
Workload profile context Jobs using > 32 nodes –0.4 % of workload –Average 44 nodes or 176 pes Jobs using < 32 nodes –99.6 % of workload –Average 6 nodes or 24 pes
ARCS – The Goal A production-level, high-performance computing system providing for both capability and capacity computing A stable and upwardly compatible system architecture, user environment, and software engineering & development environments Initial equipment: At least double current capacity at NCAR Long Term: Achieve 1 TFLOPs sustained by 2005
ARCS – The Process SCD began technical requirements draft Feb 2000 RFP process (including scientific reps from NCAR divisions, UCAR Contracts, & external review panel) formally began Mar 2000; RFP released Nov 2000 Offeror proposal reviews, BAFOs, & Supplemental proposals Jan-May 2001 Technical Evaluations, Performance projections, Risk Assessment, etc. Feb-Jun 2001 SCD Recommendation for Negotiations 21 Jun; NCAR/ UCAR acceptance of recommendation 25 Jun Negotiations Jul; tech. Ts&Cs completed 14 Aug Contract submitted to the NSF 01 Oct NSF Approval 5 Oct … Joint Press Release week SC01
ARCS RFP Technical Attributes Hardware (processors, nodes, memory, disk, interconnect, network, HIPPI) Software (OS, user environment, filesystems, batch subsystem) System admin., resource mgmt., user limits, accounting, network/HIPPI, security Documentation & training System maintenance & support services Facilities (power, cooling, space)
Major Requirements Critical Resource ratios: –Disk6 Bytes/peak-FLOP: 64+ MB/sec single-stream & 2+ GB/sec bandwidth - sustainable –Memory0.4 Bytes/peak-FLOP “Full-featured” product set (cluster-aware compilers, debuggers, performance tools, administrative tools, monitoring) Hardware & Software stability Hardware & Software vendor support & responsiveness (on-site, call center, development organization, escalation procedures) Resource allocation (processor(s), node(s), memory, disk; user limits & disk quotas) Batch Subsystem and NCAR job scheduler (BPS)
ARCS – Benchmarks (1) Kernels (Hammond Harkness, Loft) Single Processor (COPY, IA, XPOSE, SHAL, RADABS, ELEFUNT, STREAMC) Multi-processor shared memory (PSTREAM) Message-Passing Performance (XPAIR, BISECT, XGLOB, COMMS[1,2,3], STRIDED[1,2], SYNCH, ALLGATHER) Parallel Shared Memory Applications –CCM (T42 30-days & T170 1-day) – CGD, Rosinski –WRF Prototype (b_wave 5- days) - MMM, Michalakes more >
ARCS – Benchmarks (2) Parallel (MPI & Hybrid) models –CCM (T42 30-day & T170 1-day – CGD, Rosinski –MM5 3.3 (t3a 6-hr & “large” 1-hr) – MMM, Michalakes –POP 1.0 (medium & large) – CGD, Craig –MHD3D (medium & large) – HAO, Fox –MOZART2 (medium & large) – ACD, Walters –PCM 1.2 (T42) – CGD, Craig –WRF Prototype (b_wave 5- day) – MMM, Michalakes System Tests –HIPPI – SCD, Merrill –I/O-tester – SCD, Anderson –Network – SCD, Mitchell –Batch Workload includes: 2 I/O-tester, 4 Hybrid MM5 3.3 large, 2 Hybrid MM5 3.3 t3a, 2 POP 1.0 medium & large, ccm T170, MOZART2 medium, PCM 1.2 T42, 2 MHD3D medium & large, WRF Prototype – SCD, Engel < return
Risks Vendor ability to meet commitments –Hardware (processor architecture, clock speed boosts, memory architecture) –Software (OS, filesystems, processor-aware compilers/libraries, tools [3 rd party]) Service, Support, Responsiveness Vendor stability (product set, financial) Vendor promises vs. reality
Past Performance Hardware & Software –SCD/NCAR experience –Other customers’ experience “Missed Promises” –Vendor X ~ 2 yr slip, product line changes –Vendor Y ~ on target –Vendor Z ~ 1.5 yr slip, product line changes
Other Considerations “Blue Light” project invitation to develop of models for an exploratory supercomputer –Invitation to a partnership development. –Offer for an industrial partnership 256 Tflops peak, 8TB mem, 200TB disk on 64k nodes. True MPP with Torus interconnect. Node-64 Gflops, 128 MB mem, 32 kB L1 cache, 4MB L2 cache –Columbia, LLNL, SDSC, Oak Ridge
ARCS Award IBM was chosen to supply the NCAR Advanced Research Computing System (ARCS) … … will exceed the articulated purpose and goals A world-class system to provide reliable production supercomputing to the NCAR Community and Climate Simulation Laboratory A phased introduction of new, state-of-the-art computational, storage and communications technologies through the life of the contract (3-5 years) First equipment delivered Friday, 5 October
ARCS Timetable DateSystemNode TypeProcessor 3-Year Contract Oct 2001blackforest upgradeWinterhawk-2 & Nighthawk MHz POWER3-II Sep 2002bluesky with Colony Switch Regatta~1.35 GHz POWER4 Sep-Dec 2003Federation Switch Upgrade (blackforest removed after Federation acceptance) 2-Year Extension Option Sep-Dec 2004bluesky upgradeArmada~2.0 GHz POWER4-GP
ARCS Capacities DateSystemTotal Disk Capacity (TB) Total Memory (TB) Peak TFLOPs New (Total) 3-Year Contract Oct 2001blackforest upgrade (2.0) Sep 2002bluesky with Colony Switch (6.81+) Sep-Dec 2003Federation Switch Upgrade 2-Year Extension Option Sep-Dec 2004bluesky upgrade (8.75+) + Negotiated capability commitments may require installation of additional capacity. Minimum
ARCS Commitments Minimum Model Capability Commitments –blackforest upgrade1.0x (defines ‘x’) –bluesky3.1x –bluesky upgrade4.6x Failure to meet these commitments will result in IBM installing additional computational capacity Improved user environment functionality, support and problem resolution response Early access to new hardware & software technologies NCAR’s participation in IBM’s “Blue Light” exploratory supercomputer project (PFLOPs)
Proposed Equipment - IBM ARO+60Sep 2002 Nodes 164 WH2/4 5 NH2/ POWER4 MI SMP/8 Processor375 MHz POWER31.35 GHz POWER4 Interconn. TBMX 180MB/s; 22 usec Colony/NH2 Adapter † 345MB/s; 17 usec Peak TF Mem (TB) Disk (TB) System Software: PSSP/AIX, JFS/GPFS, LoadLeveler † Federation switch (2400 MB/s, 4 usec) option 2H03
ARCS Roadmap Oct ’01Oct ’02Oct ’03Oct ‘04 blackforest Upgrade bluesky Installation Federation Upgrade bluesky Upgrade bluesky 4.8+ TFLOPs peak 2.8 TB memory 21 TB GPFS disk Colony Switch 3 NH2/16pe – P3 POWER4/~1.35 GHz P4 Node/pe #s TBD ~2.0 GB memory/pe bluesky 4.8+ TFLOPs peak 2.8 TB memory 21 TB GPFS disk Federation Switch NH2 removed POWER4/~1.35 GHz P4 Node/pe #s TBD ~2.0 GB memory/pe bluesky TFLOPs peak 3.8 TB memory 65 TB GPFS disk Federation Switch - POWER4-GP/~2.0 GHz P4 Node/pe #s TBD ~3.0 memory/pe blackforest 2.0 TFLOPs peak 0.73 TB memory 10.5 TB GPFS disk TBMX Switch POWER3-II/375 MHz 315 WH2/4pe 3 NH2/16pe 512 MB memory/pe blackforest 2.0 TFLOPs peak 0.73 TB memory 10.5 TB GPFS disk TBMX Switch POWER3-II/375 MHz 315 WH2/4pe NH2 to bluesky 512 MB memory/pe blackforest 2.0 TFLOPs peak 0.73 TB memory 10.5 TB GPFS disk TBMX Switch POWER3-II/375 MHz 315 WH2/4pe MB memory/pe “TFLOP Option” SCD will likely augment bluesky with additional POWER4 nodes when blackforest is decommissioned
Thank you all for attending CAS 2001
See you all in 2003!