NCCS NCCS User Forum 24 March 2009. NCCS Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager.

Slides:



Advertisements
Similar presentations
IBM Software Group ® Integrated Server and Virtual Storage Management an IT Optimization Infrastructure Solution from IBM Small and Medium Business Software.
Advertisements

Operating System.
Overview of DVX 9000.
Rhea Analysis & Post-processing Cluster Robert D. French NCCS User Assistance.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
NCCS User Forum June 19, Agenda Introduction Discover Updates NCCS Operations & User Services Updates Question & Answer Breakout Session: –Climate.
Near-Term NCCS & Discover Cluster Changes and Integration Plans: A Briefing for NCCS Users October 30, 2014.
Cacti Workshop Tony Roman Agenda What is Cacti? The Origins of Cacti Large Installation Considerations Automation The Current.
NCCS User Forum September 14, Agenda – September 14, 2010 Welcome & Introduction (Phil Webster, CISTO Chief) Current System Status (Fred Reitz,
Academic and Research Technology (A&RT)
Evolution of Enterprise Services in the Statistics Canada IT Environment Silver Buckler Chief, Managed Storage Section Informatics Technology Services.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 8: Implementing and Managing Printers.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
The Origin of the VM/370 Time-sharing system Presented by Niranjan Soundararajan.
Module 2: Planning to Install SQL Server. Overview Hardware Installation Considerations SQL Server 2000 Editions Software Installation Considerations.
Data Storage Willis Kim 14 May Types of storages Direct Attached Storage – storage hardware that connects to a single server Direct Attached Storage.
11 SERVER CLUSTERING Chapter 6. Chapter 6: SERVER CLUSTERING2 OVERVIEW  List the types of server clusters.  Determine which type of cluster to use for.
Illinois Campus Cluster Program User Forum October 24, 2012 Illini Union Room 210 2:00PM – 3:30PM.
Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.
File System Project Conversion from Netware to Windows December 2010.
Net Optics Confidential and Proprietary Net Optics appTap Intelligent Access and Monitoring Architecture Solutions.
Operating in a SAN Environment March 19, 2002 Chuck Kinne AT&T Labs Technology Consultant.
Operating System. Architecture of Computer System Hardware Operating System (OS) Programming Language (e.g. PASCAL) Application Programs (e.g. WORD, EXCEL)
1 Lecture 4: Threads Operating System Fall Contents Overview: Processes & Threads Benefits of Threads Thread State and Operations User Thread.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Performance Concepts Mark A. Magumba. Introduction Research done on 1058 correspondents in 2006 found that 75% OF them would not return to a website that.
VIPBG LINUX CLUSTER By Helen Wang March 29th, 2013.
Bigben Pittsburgh Supercomputing Center J. Ray Scott
NCCS User Forum 15 May NCCS User Forum5/15/20082 Agenda Welcome & Introduction Phil Webster NCCS Current System Status Fred Reitz, Operations Manager.
◦ What is an Operating System? What is an Operating System? ◦ Operating System Objectives Operating System Objectives ◦ Services Provided by the Operating.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
Stephen Dart LaRDS Service Manager Monash e-Research Centre LaRDS Staging Post Enhancing Workgroup Productivity.
NCCS User Forum June 15, Agenda Current System Status Fred Reitz, HPC Operations NCCS Compute Capabilities Dan Duffy, Lead Architect User Services.
Portal User Group Meeting June 13, Agenda I. Welcome II. Updates on the following: –Migration Status –New Templates –DB Breakup –Keywords –Streaming.
GStore: GSI Mass Storage ITEE-Palaver GSI Horst Göringer, Matthias Feyerabend, Sergei Sedykh
Looking Ahead: A New PSU Research Cloud Architecture Chuck Gilbert - Systems Architect and Systems Team Lead Research CI Coordinating Committee Meeting.
Chapter 7 Operating Systems. Define the purpose and functions of an operating system. Understand the components of an operating system. Understand the.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
NCCS User Forum 11 December GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Introduction to z/OS Basics © 2006 IBM Corporation Chapter 7: Batch processing and the Job Entry Subsystem (JES) Batch processing and JES.
COMMON INTERFACE FOR EMBEDDED SOFTWARE CONFIGURATION by Yatiraj Bhumkar Advisor Dr. Chung-E Wang Department of Computer Science CALIFORNIA STATE UNIVERSITY,
GSFC NCCS NCCS User Forum 25 September GSFC NCCS NCCS User Forum9/25/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Scott Wallace,
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
NCCS User Forum December 7, Agenda – December 7, 2010 Welcome & Introduction (Phil Webster, CISTO Chief) Current System Status (Fred Reitz, NCCS.
CCJ introduction RIKEN Nishina Center Kohei Shoji.
Virtual Server Server Self Service Center (S3C) JI July.
Office of Administration Enterprise Server Farm November 2004 Briefing.
CHAPTER 3 Router CLI Command Line Interface. Router User Interface User and privileged modes User mode --Typical tasks include those that check the router.
Repository Manager 1.3 Product Overview Name Title Date.
Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb
Dr Andrew Peter Hammersley ESRF ESRF MX COMPUTIONAL AND NETWORK CAPABILITIES (Note: This is not my field, so detailed questions will have to be relayed.
Compute and Storage For the Farm at Jlab
WP18, High-speed data recording Krzysztof Wrona, European XFEL
Operating System.
Architecture & System Overview
Introduction to Operating System (OS)
TYPES OFF OPERATING SYSTEM
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Chapter 1: Introduction
Welcome to our Nuclear Physics Computing System
NERSC Reliability Data
Welcome to our Nuclear Physics Computing System
Presentation transcript:

NCCS NCCS User Forum 24 March 2009

NCCS Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager NCCS Compute Capabilities Dan Duffy, Lead Architect Questions and Comments Phil Webster, CISTO Chief User Services Updates Bill Ward, User Services Lead

NCCS Key Accomplishments Incorporation of SCU4 processors into general queue pool Acquisition of analysis system

NCCS Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager NCCS Compute Capabilities Dan Duffy, Lead Architect Questions and Comments Phil Webster, CISTO Chief User Services Updates Bill Ward, User Services Lead

NCCS Key Accomplishments SCU4 processors added to the general queue pool on Discover SAN implementation Improved data sharing between Discover and Data Portal RAID 6 implementation

NCCS Discover Utilization Past Year by Month 9/4/08 – SCU3 (2064 cores added) 2/4/09 – SCU4 (544 cores moved from test queue) 2/19/09 – SCU4 (240 cores moved from test queue) 2/27/09 – SCU4 (1280 cores moved from test queue)

NCCS Discover Utilization Past Quarter by Week 2/4/09 – SCU4 (544 cores moved from test queue) 2/19/09 – SCU4 (240 cores moved from test queue) 2/27/09 – SCU4 (1280 cores moved from test queue)

NCCS Discover CPU Consumption Past 6 Months (CPU Hours) 9/4/08 – SCU3 (2064 cores added) 2/4/09 – SCU4 (544 cores moved from test queue) 2/19/09 – SCU4 (240 cores moved from test queue) 2/27/09 – SCU4 (1280 cores moved from test queue)

NCCS Discover Queue Expansion Factor December – February Eligible Time + Run Time Run Time Weighted over all queues for all jobs (Background and Test queues excluded)

NCCS Discover Job Analysis – February 2009

NCCS Discover Availability December through February availability 4 outages –2 unscheduled 0 hardware failures 1 user error 1 extended maintenance window –2 scheduled 11.7 hours total downtime –1.2 unscheduled –10.5 scheduled Outages 2/11 – Maintenance (Infiniband and GPFS upgrades, node reprovisioning), 10.5 hours – scheduled outage plus extension 11/12 – SPOOL filled due to user error, 45 minutes 1/6 – Network line card replacement, 30 minutes – scheduled outage Maintenance (scheduled plus extension) – Infiniband, GPFS upgrades, node reprovisioning SPOOL filled Network line card maintenance

NCCS Current Issues on Discover: GPFS Hangs Symptom: GPFS hangs resulting from users running nodes out of memory. Impact: Users cannot login or use filesystem. System Admins reboot affected nodes. Status: Implemented additional monitoring and reporting tools.

NCCS Current Issues on Discover: Problems with PBS –V Symptom: Jobs with large environments not starting. Impact: Jobs placed on hold by PBS. Status: Awaiting PBS 10.0 upgrade. In the interim, don’t use –V to pass full environment, instead use –v or define necessary variables within job scripts.

NCCS Future Enhancements Discover Cluster –Hardware platform –Additional storage Data Portal –Hardware platform Analysis environment –Hardware platform DMF –Hardware platform –Additional disk cache

NCCS Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager NCCS Compute Capabilities Dan Duffy, Lead Architect Questions and Comments Phil Webster, CISTO Chief User Services Updates Bill Ward, User Services Lead

NCCS FY09 Operating Plan Breakdown of Major Initiatives Analysis System Integration –Large scale disk and interactive analysis nodes –Pioneer users in April; full production in June FY09 Cluster Upgrade –Two scalable compute units (approximately 4K cores) –Additional 40 TF of Intel Nehalem processors –To be completed by July (subject to vendor availability of equipment) Data Portal –Enhance services within the data portal to serve IPCC and other data to the Earth Systems Grid (ESG) and PCMDI –Actively looking for partners –To be completed by the end of FY09 Data Management –Concept of operations still being worked out –Actively looking for partners –Plan is to have some amount of capability based on iRODS rolled out by the end of FY09 DMF Migration from Irix to Linux –Move DMF equipment out of S100 into E100 –SGI dropping support for DMF on Irix; will re-use Palm (SGI Linux) system as the new DMF server –To be completed by June

NCCS Representative Architecture Management Servers License Servers GPFS Management GPFS Disk Subsystems ~ 1.3 PB Other Services Analysis FY09 Upgrade ~40 TF Future Upgrades TBD Login ARCHIVE Data Gateways Viz Data Portal GPFS I/O Nodes Direct Connect GPFS Nodes Disk ~300 TB Tape ~8 PB Planned for FY09Future Plans Internal Services Existing Discover 65 TF GPFS I/O Nodes PBS Servers Data Management NCCS LAN (1 GbE and 10 GbE)

NCCS Benefits of the Representative Architecture Breakout of services –Separate highly available login, data mover, and visualization service nodes –These can be available even when upgrades are occurring within the cluster elsewhere Data Mover Service: these service nodes allow for –Data to be moved between the discover cluster and the archive –Access of data within the GPFS system to be served to the data portal WAN accessible nodes within the compute cluster –Users have requested nodes within compute jobs to have access to the network –The NCCS is currently configuring network accessible nodes to be scheduled in PBS jobs so users can run sentinel type processes, easily move data via NFS mounts, etc. Internal services run on dedicated nodes –Allows for the vertical components of the architecture to go up and down independently –Critical services are run in a high availability mode –Can even allow for licenses to be served outside the NCCS

NCCS Analysis Requirements Phase 1: –Reproduce Current SGI Capabilities Fast access to all GPFS and Archive file systems FORTRAN, C, IDL, GrADS, Matlab, Quads, LATS4D, Python Visibility and easy access to post data to the data portal Interactive display of analysis results Beyond Phase 1: –Develop Client/Server Capabilities Extend analytic functions to the user’s workstations Subsetting functions –In-line and Interactive visualization Synchronize analysis with model execution See the intermediate data as they are being generated Generate images for display back to the user’s workstations Capture and store images during execution for later analysis

NCCS DMF Archive Analysis System Technical Solution Archive File Systems GPFS I/O Servers 4 MDS & 16 NSD Fibre Channel SAN Multiple Interfaces Analysis IB 16 cores 256GB Discover Compute Fibre Channel SAN IP over IB Single Stream: MB/sec Aggregate: ~600 GB/sec NFS, bbftp, scp Single Stream: MB/sec Aggregate: GB/sec Large staging area to minimize data recall from archive 20 Direct GPFS I/O Connections ~3 GB/sec per node Additional Storage Large Network Pipes 10 GbE LAN

NCCS Analysis System Technical Details 8 IBM x3950 Nodes –4 socket, Quad-core (16 cores per server, 128 cores total) –Intel Dunnington E7440, 2.4 GHz cores with 1,066 MHz FSB –256 GB memory (16 GB/core) –10 GbE network interface –Can be configured as a single system image up to 4 servers (64 cores and 1 TB of RAM) GPFS File System –Direct connect I/O servers –~3 GB/sec per Analysis Node –Analysis nodes will see ALL GPFS file systems, including the nobackup areas currently in use; no need to “move” data into the analysis system Additional Disk Capacity –2 x DDN S2A9900 SATA disk subsystems –~900 TB RAW capacity –Total of ~6 GB/sec throughput 21

NCCS Analysis System Timeline 1 April 2009: Pioneer/Early Access Users If you would like to be one of the first, please let us know. Contact user services. Provide us with some details as to what you may need. 1 May 2009: Analysis System in Production Continued support for analysis users migrating off of Dirac. 1 June 2009: Dirac Transition Dirac no longer used for analysis. Migrate DMF from Irix to Linux. 22

NCCS Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager NCCS Compute Capabilities Dan Duffy, Lead Architect Questions and Comments Phil Webster, CISTO Chief User Services Updates Bill Ward, User Services Lead

NCCS What Happened to My Ticket? What happened to my ticket? First, it comes to me at USG for aging… Then, if it makes it into FootPrints, it will be eaten by trolls…

NCCS Ticket Closure Percentiles for the Past Quarter

NCCS Issue: Commands to Access DMF Implementation of dmget and dmput Status: resolved –Enabled on Discover login nodes –Performance has been stable since installation on 11 Dec 09

NCCS Issue: Parallel Jobs > 1500 CPUs Many jobs won’t run at > 1500 CPUs Status: resolved –Requires a different version of the DAPL library –Since this is not the officially supported version, it is not the default

NCCS Issue: Enabling Sentinel Jobs Need capability to run a “sentinel” subjob to watch a main parallel compute subjob in a single PBS job Status: in process –Requires an NFS mount of data portal file systems on Discover gateway nodes (done!) –Requires some special PBS usage to specify how subjobs will land on nodes

NCCS Issue: Poor Interactive Response Slow interactive response on Discover Status: under investigation –Router line card replaced –Automatic monitoring instituted to promptly detect future problems –Seems to happen when filesystem usage is heavy (anecdotal)

NCCS Issue: Getting Jobs into Execution Long wait for queued jobs before launching Reasons –SCALI=TRUE is restrictive –Per user & per project limits on number of eligible jobs (use qstat –is) –Scheduling policy: first-fit on job list ordered by queue priority and queue time Status: under investigation –Individual job priorities available in PBS v10 may help with this

NCCS Use of Project Shared Space Please begin using “$SHARE” instead of “/share” since the shared space may move Try to avoid having soft links that explicitly point to “/share/…” for the same reason

NCCS Dirac Filesystems Dirac’s disks are being repurposed for primary archive cache Hence, the SGI file systems on Dirac will be going away Users will need to migrate data off of the SGI home, nobackup, and share file systems Contact User Services if you need assistance.

NCCS Integrated Performance Monitor (IPM) Provides –Short report of resource consumption, and –Longer web-based presentation Requires –Low runtime overhead (2%-5%) –Linking with MPI wrapper library (your job) –Newer version of OS for complete statistics (our job)

NCCS IPM Sample Output

NCCS IPM Sample Output

NCCS Access to Analysis System Pioneer access scheduled for 1 April All Dirac analysis users welcome as pioneers Initially, no charge against your allocation If you have no allocation in e-Books, contact USG and we will resolve

NCCS Future User Forums The next three NCCS User Forums –23 June, 22 Sep, 8 Dec –All on Tuesday –All 2:00-3:30 PM –All in Building 33, Room H114 Published –On –On GSFC-CAL-NCCS-Users

NCCS Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager NCCS Compute Capabilities Dan Duffy, Lead Architect Questions and Comments Phil Webster, CISTO Chief User Services Updates Bill Ward, User Services Lead

NCCS Feedback Now – Open discussion to voice your … –Praises –Complaints –Suggestions Later to NCCS Support –(301) Later to USG Lead –(301)