GSFC NCCS NCCS User Forum 25 September 2008. GSFC NCCS NCCS User Forum9/25/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Scott Wallace,

Slides:



Advertisements
Similar presentations
IBM SMB Software Group ® ibm.com/software/smb Maintain Hardware Platform Health An IT Services Management Infrastructure Solution.
Advertisements

Chapter 19: Network Management Business Data Communications, 5e.
Network Design and Implementation
Near-Term NCCS & Discover Cluster Changes and Integration Plans: A Briefing for NCCS Users October 30, 2014.
Copyright © 2009 Altair Engineering, Inc. Proprietary and Confidential. All rights reserved. Paolo Masera BDM PBS Works South EMEA, Altair Engineering.
NCCS User Forum September 14, Agenda – September 14, 2010 Welcome & Introduction (Phil Webster, CISTO Chief) Current System Status (Fred Reitz,
Graduate System for Management of Admissions, Alumni & Records Tracking (Grad SMAART) January 8, 2007 Office of Graduate Studies.
Portal User Group Meeting September 12, Agenda 1.Welcome 2. Updates on the Following: 1.Migration Status 2.Template 3.Disaster Recovery Exercise.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Report Distribution Report Distribution in PeopleTools 8.4 Doug Ostler & Eric Knapp 7264.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 8 Introduction to Printers in a Windows Server 2008 Network.
Chapter 8: Network Operating Systems and Windows Server 2003-Based Networking Network+ Guide to Networks Third Edition.
Operational Dataset Update Functionality Included in the NCAR Research Data Archive Management System 1 Zaihua Ji Doug Schuster Steven Worley Computational.
Research Computing with Newton Gerald Ragghianti Nov. 12, 2010.
Configuration Management Supplement 67 Robert Horn, Agfa Healthcare.
Hands-On Microsoft Windows Server 2008 Chapter 1 Introduction to Windows Server 2008.
Copyright © 2008 Altair Engineering, Inc. All rights reserved. PBS GridWorks - Efficient Application Scheduling in Distributed Environments Dr. Jochen.
TPB Models Development Status Report Presentation to the Travel Forecasting Subcommittee Ron Milone National Capital Region Transportation Planning Board.
Chapter 2. Creating the Database Environment
Annie Griffith Infrastructure Programme Manager July 2007 UK Link Technology Refresh.
DMF Configuration for JCU HPC Dr. Wayne Mallett Systems Manager James Cook University.
Introduction Optimizing Application Performance with Pinpoint Accuracy What every IT Executive, Administrator & Developer Needs to Know.
Portal User Group Meeting March 9, Agenda  Introduction  Guest Presentation – Website Accessibility Michelle Laramie, David Bergmann, Jolene Nemeth.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
Effective User Services for High Performance Computing A White Paper by the TeraGrid Science Advisory Board May 2009.
Research Support Services Research Support Services.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Overview of the Computer Resource Team (CRT) Blaise Barney (LLNL) Rob Cunningham (LANL) Barbara Jennings (Sandia) PSAAP Kickoff Meeting July 8, 2008 Albuquerque,
Portal User Group Meeting June 29, Agenda Introduction (Angela Taetz) Ulogin (Mario Mezzio) Database Breakup (Mario Mezzio) New Help Desk Forms.
Working with the LiveOps Help Desk
Bigben Pittsburgh Supercomputing Center J. Ray Scott
NCCS User Forum 15 May NCCS User Forum5/15/20082 Agenda Welcome & Introduction Phil Webster NCCS Current System Status Fred Reitz, Operations Manager.
CD FY09 Tactical Plan Status FY09 Tactical Plan Status Report for Site Networking Anna Jordan April 28, 2009.
El Paso Corporation Nominations and Scheduling Customer Advisory Group Update May 14, 2009 Colorado Interstate Gas Company Wyoming Interstate Company Cheyenne.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
Essential 3 - SSID Enrollment Course Introduction v1.0, September 3, 2013 SSID ENROLLMENT Course Introduction Essential 3.
NCCS NCCS User Forum 24 March NCCS Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager.
NCCS User Forum June 15, Agenda Current System Status Fred Reitz, HPC Operations NCCS Compute Capabilities Dan Duffy, Lead Architect User Services.
Introduction to Using SLURM on Discover Chongxun (Doris) Pan September 24, 2013.
Portal User Group Meeting June 13, Agenda I. Welcome II. Updates on the following: –Migration Status –New Templates –DB Breakup –Keywords –Streaming.
Looking Ahead: A New PSU Research Cloud Architecture Chuck Gilbert - Systems Architect and Systems Team Lead Research CI Coordinating Committee Meeting.
Chapter 6 Supporting Knowledge Management through Technology
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Presentation to the Information Services Board March 6, 2008 Bill Kehoe, Chief Information Officer Bill Kehoe, Chief Information Officer.
Essential 3 - SSID Enrollment Course Introduction v3.0, August 7, 2012 SSID ENROLLMENT Course Introduction Essential 3.
NCCS User Forum 11 December GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.
Jini Architecture Introduction System Overview An Example.
SharePoint Administrative Communications Planning: Dynamic User Notifications for Upgrades, Migrations, Testing, … PRESENTED BY ROBERT FREEMAN (
November 8, Agenda Welcome Accessibility Reminders Service and Software Updates Reminders.
Portal Update Plan Ashok Adiga (512)
TeraGrid Quarterly Meeting Arlington, VA Sep 6-7, 2007 NCSA RP Status Report.
CSC190 Introduction to Computing Operating Systems and Utility Programs.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
CD FY09 Tactical Plan Status FY09 Tactical Plan Status Report for Neutrino Program (MINOS, MINERvA, General) Margaret Votava April 21, 2009 Tactical plan.
March 2004 At A Glance The AutoFDS provides a web- based interface to acquire, generate, and distribute products, using the GMSEC Reference Architecture.
NCCS User Forum December 7, Agenda – December 7, 2010 Welcome & Introduction (Phil Webster, CISTO Chief) Current System Status (Fred Reitz, NCCS.
Office of Administration Enterprise Server Farm November 2004 Briefing.
Troubleshooting Windows Vista Lesson 11. Skills Matrix Technology SkillObjective DomainObjective # Troubleshooting Installation and Startup Issues Troubleshoot.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
An Brief Introduction Charlie Taylor Associate Director, Research Computing UF Research Computing.
Digital Campus: Foundation Projects
Accessing the VI-SEEM infrastructure
Maximum Availability Architecture Enterprise Technology Centre.
Portal User Group Meeting
Introduction to Operating System (OS)
Department of Licensing HP 3000 Replatforming Project Closeout Report
IT and Development support services
Presentation transcript:

GSFC NCCS NCCS User Forum 25 September 2008

GSFC NCCS NCCS User Forum9/25/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Scott Wallace, CSC PM Current System Status Fred Reitz, Operations Manager Compute Capabilities at the NCCS Dan Duffy, Lead Architect Questions and Comments Phil Webster, CISTO Chief User Services Updates Bill Ward, User Services Lead SMD Allocations Sally Stemwedel, HEC Allocation Specialist

GSFC NCCS NCCS User Forum9/25/083 Large scale HEC computing cluster and on-line storage Comprehensive toolsets for job scheduling and monitoring Large capacity storage Tools to manage and protect data Data migration support Help Desk Account/Allocation support Computational science support User teleconferences Training & tutorials Interactive analysis environment Software tools for image display Easy access to data archive Specialized visualization support Capability to share data & results Supports community-based development Facilitates data distribution and publishing Code repository for collaboration Environment for code development and test Code porting and optimization support Web based tools Internal high speed interconnects for HEC components High-bandwidth to NCCS for GSFC users Multi-gigabit network supports on-demand data transfers NCCS Support Services HEC Compute Data Archival and Stewardship Code Development & Collaboration Analysis & Visualization User Services Data Sharing High Speed Networks DATA Global file system enables data access for full range of modeling and analysis activities

GSFC NCCS NCCS User Forum9/25/084 Resource Growth at the NCCS

GSFC NCCS NCCS User Forum9/25/085 NCCS Staff Transitions ∙ New Govt. Lead Architect: Dan Duffy (301) , ∙ New CSC Lead Architect: Jim McElvaney (301) , ∙ New User Services Lead: Bill Ward (301) , ∙ New HPC System Administrator: Bill Woodford (301) ,

GSFC NCCS NCCS User Forum9/25/086 Key Accomplishments ∙ SLES10 upgrade ∙ GPFS 3.2 upgrade ∙ Integrated SCU3 ∙ Data Portal storage migration ∙ Transition from Explore to Discover

GSFC NCCS NCCS User Forum9/25/087 Integration of Discover SCU4 ∙ SCU4 to be connected to Discover 10/1 (date firm) ∙ Staff testing and scalability 10/1-10/5 (dates approximate) ∙ Opportunity for interested users to run large jobs 10/6-10/13 (dates approximate)

GSFC NCCS NCCS User Forum9/25/088 Agenda Welcome & Introduction Phil Webster, CISTO Chief Scott Wallace, CSC PM Current System Status Fred Reitz, Operations Manager Compute Capabilities at the NCCS Dan Duffy, Lead Architect Questions and Comments Phil Webster, CISTO Chief User Services Updates Bill Ward, User Services Lead SMD Allocations Sally Stemwedel, HEC Allocation Specialist

GSFC NCCS NCCS User Forum9/25/089 Explore Utilization Past Year 67.0% 69.0% 85.6% 73.0% 613,975 CPU hours

GSFC NCCS NCCS User Forum9/25/0810 8/13 – Hardware failure 8/13 – Facilities maintenance 5/14 – Hardware maintenance May through August availability ∙ 13 outages ▶ 8 unscheduled ◆ 7 hardware failures ◆ 1 human error ▶ 5 scheduled ∙ 65 hours total downtime ▶ 32.8 unscheduled ▶ 32.2 scheduled Longest outages ∙ 8/13 – Hardware failure, 19.9 hrs ▶ E2 hardware issues after scheduled power down for facilities maintenance ▶ System left down until normal business hours for vendor repair ▶ Replaced/unseated PCI bus, IB connection ∙ 8/13 – Hardware maintenance, hrs ▶ Scheduled outage ▶ Electrical work – facility upgrade Explore Availability

GSFC NCCS NCCS User Forum9/25/0811 Queue Wait Time + Run Time Run Time Weighted over all queues for all jobs (Background and Test queues excluded) Explore Queue Expansion Factor

GSFC NCCS NCCS User Forum9/25/0812 Discover Utilization Past Year 81.1% 76.1% 58.7% 69.2% 1,320,683 CPU Hours

GSFC NCCS NCCS User Forum9/25/0813 Discover Utilization SCU3 cores added

GSFC NCCS NCCS User Forum9/25/0814 Discover Availability May through August availability ∙ 11 outages ▶ 6 unscheduled ◆ 2 hardware failures ◆ 2 software failures ◆ 3 extended maintenance windows ▶ 5 scheduled ∙ 89.9 hours total downtime ▶ 35.1 unscheduled ▶ 54.8 scheduled Longest outages ∙ 7/10 – SLES 10 upgrade, 36 hrs ▶ 15 hours planned ▶ 21 hour extension ∙ 8/20 – Connect SCU3, 15 hrs ▶ Scheduled outage ∙ 8/13 – Facilities maintenance, 10.6 hrs ▶ Electrical work – facility upgrade 7/10 – Extended SLES10 upgrade window 7/10 – SLES10 upgrade 8/20 – Connect SCU3 to cluster 8/13 – Facilities electrical upgrade

GSFC NCCS NCCS User Forum9/25/0815 Discover Queue Expansion Factor Queue Wait Time + Run Time Run Time Weighted over all queues for all jobs (Background and Test queues excluded)

GSFC NCCS NCCS User Forum9/25/0816 Current Issues on Discover: Infiniband Subnet Manager ∙ Symptom: Working nodes erroneously removed from GPFS following Infiniband Subnet problems with other nodes. ∙ Outcome: Job failures due to node removal ∙ Status: Modified several subnet manager configuration parameters on 9/17 based on IBM recommendations. Problem has not recurred; admins monitoring.

GSFC NCCS NCCS User Forum9/25/0817 Current Issues on Discover: PBS Hangs ∙ Symptom: PBS server experiencing 3-minute hangs several times per day ∙ Outcome: PBS-related commands (qsub, qstat, etc.) hang ∙ Status: Identified periodic use of two communication ports also used for hardware management functions. Implemented work- around on 9/17 to prevent conflicting use of these ports. No further occurrences.

GSFC NCCS NCCS User Forum9/25/0818 Current Issues on Discover: Problems with PBS –V Option ∙ Symptom: Jobs with large environments not starting ∙ Outcome: Jobs placed on hold by PBS ∙ Status: Investigating with Altair (vendor). In the interim, requested users not pass full environment via –V, instead use –v or define necessary variables within job scripts.

GSFC NCCS NCCS User Forum9/25/0819 Current Issues on Discover: Problem with PBS and LDAP ∙ Symptom: Intermittent PBS failures while communicating with LDAP server. ∙ Outcome: Jobs rejected with bad UID error due to failed lookup ∙ Status: LDAP configuration changes to improve information caching and reduce queries to LDAP server; significantly reduced problem frequency. Still investigating with Altair.

GSFC NCCS NCCS User Forum9/25/0820 Future Enhancements ∙ Discover Cluster ▶ Hardware platform – SCU4 10/1/2008 ▶ Additional storage ∙ Discover PBS Select Changes ▶ Syntax changes to streamline job resource requests ∙ Data Portal ▶ Hardware platform ∙ DMF ▶ Hardware platform

GSFC NCCS NCCS User Forum9/25/0821 Agenda Welcome & Introduction Phil Webster, CISTO Chief Scott Wallace, CSC PM Current System Status Fred Reitz, Operations Manager Compute Capabilities at the NCCS Dan Duffy, Lead Architect Questions and Comments Phil Webster, CISTO Chief User Services Updates Bill Ward, User Services Lead SMD Allocations Sally Stemwedel, HEC Allocation Specialist

GSFC NCCS NCCS User Forum9/25/0822 Overall Acquisition Planning Schedule 2008 – 2009 JanFebMarAprMayJunJulAugSepOctNovDec2009 Stage 1: Power & Cooling Upgrade Stage 2: Cooling Upgrade Storage Upgrade Compute Upgrade Facilities: E100 Write RFP Issue RFP, Evaluate Responses, Purchase Delivery & Integration Write RFP Issue RFP, Evaluate Responses, Purchase Delivery Explore Decommissioned Stage 1: Integration & Acceptance Stage 2: Integration & Acceptance We are here!

GSFC NCCS NCCS User Forum9/25/0823 What does this schedule mean to you? Expect some outages – Please be patient JanFebMarAprMayJunJulAugSepOctNovDec2009 Storage Upgrade Compute Upgrade Additional Storage On-line Stage 1 Compute Capability Available for Users Discover Mods GPFS 3.2 Upgrade (not RDMA) SLES 10 Software Stack Upgrade Stage 2 Compute Capability Available for Users Decommission Explore DONE Delayed

GSFC NCCS NCCS User Forum9/25/0824 Cubed Sphere Finite Volume Dynamic Core Benchmark ∙ Non-hydrostatic, 10 KM resolution ∙ Most computationally intensive benchmark ∙ Discover Reference Timings ▶ 216 cores (6x6) – 6,466s ▶ 288 cores (6x8) – 4,879s ▶ 384 cores (8x8) – 3,200s ∙ All runs made using ALL cores on a node.

GSFC NCCS NCCS User Forum9/25/0825 Near Future ∙ Additional storage to be added to the cluster ▶ 240 TB RAW ▶ By the end of the calendar year ▶ RDMA pushed into next year ∙ Potentially one additional scalable unit ▶ Same as the new IBM units ▶ By the end of the calendar year ∙ Small IBM Cell application development testing environment ▶ 2 to 3 months

GSFC NCCS NCCS User Forum9/25/0826 Agenda Welcome & Introduction Phil Webster, CISTO Chief Scott Wallace, CSC PM Current System Status Fred Reitz, Operations Manager Compute Capabilities at the NCCS Dan Duffy, Lead Architect Questions and Comments Phil Webster, CISTO Chief User Services Updates Bill Ward, User Services Lead SMD Allocations Sally Stemwedel, HEC Allocation Specialist

GSFC NCCS NCCS User Forum9/25/0827 SMD Allocation Policy Revisions 8/1/08 ∙ 1-year allocations only during regular spring and fall cycles ▶ Fall e-Books deadline 9/20 for November 1 awards ▶ Spring e-Books deadline 3/20 for May 1 awards ∙ Projects started off-cycle ▶ Must have support of HQ Program Manager to start off-cycle ▶ Will get limited allocation expiring at next regular cycle award date ∙ Increases over 10% of award or 100K processor-hours during award period need support of funding manager; to request ∙ Projects using up allocation faster than anticipated are encouraged to submit for next regular cycle.

GSFC NCCS NCCS User Forum9/25/0828 Questions about Allocations? ∙ Allocation POC Sally Stemwedel HEC Allocation Specialist (301) ∙ SMD allocation procedure and e-Books submission link

GSFC NCCS NCCS User Forum9/25/0829 Agenda Welcome & Introduction Phil Webster, CISTO Chief Scott Wallace, CSC PM Current System Status Fred Reitz, Operations Manager Compute Capabilities at the NCCS Dan Duffy, Lead Architect Questions and Comments Phil Webster, CISTO Chief User Services Updates Bill Ward, User Services Lead SMD Allocations Sally Stemwedel, HEC Allocation Specialist

GSFC NCCS NCCS User Forum9/25/0830 Explore Will Be Decommissioned ∙ It is a leased system ∙ e1, e2, e3 must be returned to vendor ∙ Palm will remain ∙ Palm will be repurposed ∙ Users must move to Discover

GSFC NCCS NCCS User Forum9/25/0831 Transition to Discover - Phases ∙ PI notified ∙ Users on Discover ∙ Code migrated ∙ Data accessible ∙ Code acceptably tuned ∙ Performing production runs

GSFC NCCS NCCS User Forum9/25/0832 Transition to Discover - Status

GSFC NCCS NCCS User Forum9/25/0833 Transition to Discover - Status

GSFC NCCS NCCS User Forum9/25/0834 Transition to Discover - Status

GSFC NCCS NCCS User Forum9/25/0835 Accessing Discover Nodes ∙ We are in the process of making the PBS select statement more simple and streamlined ∙ Keep doing what you are doing until we publish something better ∙ For most folks, changes will not break what you are using now

GSFC NCCS NCCS User Forum9/25/0836 Discover Compilers ∙ comp/intel preferred ∙ comp/intel if intel-10 doesn’t work ∙ comp/intel only if absolutely necessary

GSFC NCCS NCCS User Forum9/25/0837 MPI on Discover ∙ mpi/scali-5 ▶ Not supported on new nodes ▶ -l select= :scali=true to get it ∙ mpi/impi ▶ Slower startup than OpenMPI ▶ Catches up later (anecdotal) ▶ Self-tuning feature still under evaluation ∙ mpi/openmpi-1.2.5/intel-10 ▶ Does not pass user environment ▶ Faster startup due to built-in PBS support

GSFC NCCS NCCS User Forum9/25/0838 Data Migration Facility Transition ∙ DMF hosts Dirac/Newmintz (SGI Origin 3800s) to be replaced by parts of Palm (SGI Altix) ∙ Actual cutover Q1 CY09 ∙ Impacts to Dirac users: ▶ Source code must be recompiled ▶ Some COTS must be relicensed ▶ Other COTS must be rehosted

GSFC NCCS NCCS User Forum9/25/0839 Accounts for Foreign Nationals ∙ Codes 240, 600, and 700 have established a well-defined process for creating NCCS accounts for foreign nationals ∙ Several candidate users have been navigated through the process ∙ Prospective users from designated countries must go to NASA HQ ∙ Process will be posted on the web very soon m?topic=visitors.foreign

GSFC NCCS NCCS User Forum9/25/0840 Feedback ∙ Now – Voice your … ▶ Praises? ▶ Complaints? ∙ Later – NCCS Support ▶ ▶ (301) ∙ Later – USG Lead (me!) ▶ ▶ (301)

GSFC NCCS NCCS User Forum9/25/0841 Agenda Welcome & Introduction Phil Webster, CISTO Chief Scott Wallace, CSC PM Current System Status Fred Reitz, Operations Manager Compute Capabilities at the NCCS Dan Duffy, Lead Architect Questions and Comments Phil Webster, CISTO Chief User Services Updates Bill Ward, User Services Lead SMD Allocations Sally Stemwedel, HEC Allocation Specialist

GSFC NCCS Open Discussion Questions Comments