User Forum NASA Center for Climate Simulation High Performance Science July 22, 2014.

Slides:



Advertisements
Similar presentations
IT Asset Management Status Update 02/15/ Agenda What is Asset Management and What It Is Not Scope of Asset Management Status of Key Efforts Associated.
Advertisements

{ Making Microsoft Office work for you Organizing Your Life at work and home in the Cloud Presented by: Matthew Baker (321)
Rhea Analysis & Post-processing Cluster Robert D. French NCCS User Assistance.
NCCS User Forum June 19, Agenda Introduction Discover Updates NCCS Operations & User Services Updates Question & Answer Breakout Session: –Climate.
Network Redesign and Palette 2.0. The Mission of GCIS* Provide all of our users optimal access to GCC’s technology resources. *(GCC Information Services:
Near-Term NCCS & Discover Cluster Changes and Integration Plans: A Briefing for NCCS Users October 30, 2014.
Network Redesign and Palette 2.0. The Mission of GCIS* Provide all of our users optimal access to GCC’s technology resources. *(GCC Information Services:
Technology Steering Group January 31, 2007 Academic Affairs Technology Steering Group February 13, 2008.
Trend Micro Round Table May 19, Agenda Introduction – why switch? Timeline for implementation Related policies Trend Micro product descriptions.
Technology Steering Group January 31, 2007 Academic Affairs Technology Steering Group February 13, 2008.
Microsoft ® Application Virtualization 4.5 Infrastructure Planning and Design Series.
Illinois Campus Cluster Program User Forum October 24, 2012 Illini Union Room 210 2:00PM – 3:30PM.
Microsoft ® Application Virtualization 4.6 Infrastructure Planning and Design Published: September 2008 Updated: February 2010.
Customer Forum OTech’s New Web Publishing Service Web Services Section – April 29, 2015.
Secured Hosting Services Frank Adams / Tom Carter.
11 The Ultimate Upgrade Nicholas Garcia Bell Helicopter Textron.
Unified Communications as a Managed Service DIR Telecom Forum, October 7, 2014 ROY ALBRECHT, Director, Sales and Marketing Globalscope Communications.
Module 10 Configuring and Managing Storage Technologies.
IT Update Faculty Senate September 1, 2004 University of Houston Information Technology.
Maintaining a Microsoft SQL Server 2008 Database SQLServer-Training.com.
Database System Development Lifecycle © Pearson Education Limited 1995, 2005.
Overview of SQL Server Alka Arora.
XA R7.8 Upgrade Process and Technical Overview Ruth Anne Pharr Sr. IT Consultant, CISTECH Inc.
Trimble Connected Community
Term 2, 2011 Week 3. CONTENTS The physical design of a network Network diagrams People who develop and support networks Developing a network Supporting.
Server Virtualization: Navy Network Operations Centers
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
DMF Configuration for JCU HPC Dr. Wayne Mallett Systems Manager James Cook University.
SDSC RP Update TeraGrid Roundtable Reviewing Dash Unique characteristics: –A pre-production/evaluation “data-intensive” supercomputer based.
OFC 200 Microsoft Solution Accelerator for Intranets Scott Fynn Microsoft Consulting Services National Practices.
Research Support Services Research Support Services.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
U.S. Department of Agriculture eGovernment Program August 14, 2003 eAuthentication Agency Application Pre-Design Meeting eGovernment Program.
Outline IT Organization SciComp Update CNI Update
Maintaining File Services. Shadow Copies of Shared Folders Automatically retains copies of files on a server from specific points in time Prevents administrators.
1 © 2004 Cisco Systems, Inc. All rights reserved. Rich Gore Case Study: Cisco Global Wireless LAN Software Migration Cisco Information.
NCCS NCCS User Forum 24 March NCCS Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager.
NCCS User Forum June 15, Agenda Current System Status Fred Reitz, HPC Operations NCCS Compute Capabilities Dan Duffy, Lead Architect User Services.
Win202 Database Administration. Introduction Welcome to OpenEdge. Type 2 Storage Areas. One of the big selling points for the OpenEdge platform and Win202.
Computer Emergency Notification System (CENS)
IST Storage & Backup Group 2011 Jack Shnell Supervisor Joe Silva Senior Storage Administrator Dennis Leong.
NCCS User Forum 11 December GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.
NUOL Internet Application Services Midterm presentation 22 nd March, 2004.
BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Seminar Microsoft Next Generation Windows Services: By Microsoft. Guide Presented By Mr. Sandeep Gaikwad
GSFC NCCS NCCS User Forum 25 September GSFC NCCS NCCS User Forum9/25/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Scott Wallace,
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
CD FY09 Tactical Plan Status FY09 Tactical Plan Status Report for Neutrino Program (MINOS, MINERvA, General) Margaret Votava April 21, 2009 Tactical plan.
2011 ACSI Survey Summary HDF/HDF-EOS Workshop Riverdale, MD April 18, 2012.
Welcome to the PRECIS training workshop
NCCS User Forum December 7, Agenda – December 7, 2010 Welcome & Introduction (Phil Webster, CISTO Chief) Current System Status (Fred Reitz, NCCS.
Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
TeraGrid’s Process for Meeting User Needs. Jay Boisseau, Texas Advanced Computing Center Dennis Gannon, Indiana University Ralph Roskies, University of.
The world’s libraries. Connected. The Benefits of CONTENTdm Hosting Services OCLC’s Digital Lifecycle Webinar Series April 9, 2013.
Advanced Computing Facility Introduction
Compute and Storage For the Farm at Jlab
Case Study: Cisco Global Wireless LAN Software Migration
Chapter 2 Objectives Identify Windows 7 Hardware Requirements.
Shared Research Computing Policy Advisory Committee (SRCPAC)
An Introduction to Computer Networking
Design Unit 26 Design a small or home office network
Experience with an IT Asset Management System
Data Management Components for a Research Data Archive
IT Next – Transformation Program
Presentation transcript:

User Forum NASA Center for Climate Simulation High Performance Science July 22, 2014

NASA Center for Climate Simulation Agenda Introduction Hardware Updates & Procurements User Survey Archive Operations and User Services Updates Questions and Answers 2 NCCS User Forum July 22, 2014

NASA Center for Climate Simulation Staff Additions Welcome to New Members of the NCCS Team: Jordan Robertson George Britzolakis Dan’l Pierce Steve Ambrose Welcome to Summer Interns: Mira Holford Winston Zhou Caitlin Ross Joseph Clamp Posters presented on Thursday, July 31 st, B28 Atrium NCCS User Forum July 22, 20143

NASA Center for Climate Simulation Recent Accomplishments Systems and Operations Hosted full-day Allinea workshop (MAP, DDT) (Mar 2014) Integration Efforts –Nature Run Storage on Discover: 7,200 TB RAW disk (Nov 2013) –JIBB Upgrades: ~40 TF SandyBridge and ~400 TB RAW disk (Feb-Apr 2014) –ESGF data node on new SandyBridge node with 10 Gbps (Feb 2014) –“Authorization to Operate” (ATO) completed and signed for 3 more years (Apr/May 2014) –Migration out of 9 legacy Tape Libraries (June 2014) Discover Cluster Efforts –SLURM migration (October 2013) –IB Fabric congestion reduction – cable replacements and configuration changes Archive Growth and Policy Recommendations Study (June 2014) Pre-ABoVE on proof-of-concept NCCS High Performance Science Cloud (ongoing) NCCS User Forum July 22, 20144

NASA Center for Climate Simulation Recent Accomplishments Campaigns and Special Support Field Campaigns –DISCOVER-AQ Fall 2013 –HS3 Summer/Fall 2013 –ATTREX (Guam) Winter 2014 –IPHEX 2014 (Smokey Mountains) May/June 2014 –DISCOVER-AQ FRAPPE (Colorado) Ongoing 2014 Upcoming Field Campaigns –ARISE and HS Other Special Support: –SMAP Level 4 Root Zone and Carbon product generation support –DSCOVR EPIC processing (ongoing) –GEOS-5 two-year, 7-km Nature Run –MERRA2 –ABoVE NCCS User Forum July 22, View from the NASA ER-2 during an IPHEX 2014 flight, May 24, 2014 (image credit: NASA) NU-WRF’s outer (9-km) domain forecast for 1100 EDT April 29, 2014, depicting simulated radar reflectivity and sea level pressure and wind vectors. When compared with operational models for this forecast, NU-WRF better simulated diminished precipitation over the IPHEX 2014 study region.

NASA Center for Climate Simulation GSFC-Wide Chilled Water Outage (Cooling for NCCS Hardware) July 2014 Center-wide chilled water outage July 8 (began 19:41) due to lightning strike in Building 24 that affected the West Campus pumps NCCS Facilities team arrived on site shortly after to assess the situation Upon realization that the chilled water would be out for an indefinite amount of time, the operations team began bringing down all HPC systems Users were notified as quickly as possible Room temperatures rose rapidly and exceeded 120 F within a short time period before the systems were shut down FMD addressed power issues and started pumps The pumps were started back up several hours after the event Took several hours for the water to reach normal operating temperatures Took several hours for the rooms to reach normal operating temperatures Operations team began restoring service early July 9th Discover available July 9th at 17:10 (without SCU8) Archive available July 11th at 19:00 (after significant disk rebuilds) NCCS lessons learned held on July15th NCCS User Forum - July 22, 20146

Hardware Updates and Procurements Dan Duffy, HPC Lead and NCCS Lead Architect

NASA Center for Climate Simulation FY14-FY15 Cluster Upgrade Combined funding from FY14 and FY15 –Taking advantage of new Intel processors – double the floating point operations over SandyBridge –Decommission SCU7 (Westmeres) Scalable Unit 10 –Target to effectively double the NCCS compute capability –128 GB of RAM per node with FDR IB (56 Gbps) or greater –Benchmarks used in procurement include GEOS5 and WRF Target delivery date ~Oct 2014 NCCS User Forum July 22, 20148

NASA Center for Climate Simulation FY14 NCCS Wide File System Augment storage along with the cluster upgrade –Targeting about 10 PB or more (depends on cost) Creation of an NCCS wide file system –Separate from GPFS Available even when there are issues with GPFS –Possible NFS solution (exploring options) Many applications will benefit from client side caching –Move home directories and other file systems into this storage solution –Accessible by all Discover nodes (including compute) and Archive –Will provide data to portal services (just like GPFS) Procurement –To be released early August –Target installation late fall 2014 NCCS User Forum July 22, 20149

NASA Center for Climate Simulation Archive Upgrades Increased DMF License Capacity (45 PB) Tape Storage Area Network (SAN) –Upgraded switch capacity and speeds (16 Gbps) 20 New Tape Drives –Capable of 8 TB per tape –To be installed in August 2014 Migration of Tapes to new Drives (constant) Archive capacity planning study – more on this later NCCS User Forum July 22,

NASA Center for Climate Simulation Nature Run Storage – Installed Late Fall 2013 Integrated 7,200 TB RAW disk capacity for the GMAO Nature Run 2-year Nature Run at 7.5 KM resolution –Completed 3-month Nature Run at 3.5 KM resolution –Just starting Will generate about 4 PB of data (compressed) All data to be publically accessible NCCS User Forum July 22,

NASA Center for Climate Simulation Nature Run storage (s1062) on new filesystem (dnb03) Was rapidly growing, leveled off, then clean-up after completion of run (another to start soon) Discover Storage Breakdown June Operational Analysis

NASA Center for Climate Simulation Hyperwall Monitors Installed June 2014 Upgraded 4-year old monitors –15 high resolution monitors –New mounting mechanism Next Steps –Content being updated for HD –Servers to be update in 2015 Please feel free to request scheduling of the wall for: –Presentations –Tours –Family –School groups NCCS User Forum July 22, Lori Perkins (Science Visualization Studio) describes a visualization of aerosols simulated by GEOS-5 and displayed on the new Visualization Wall in the NCCS’s Data Exploration Theater. (Photo credit: Jarrett Cohen, CISTO/GST. Aerosol image provided by Bill Putman, Global Modeling and Assimilation Office, GSFC Code 610.1) To schedule the wall, contact: Heidi Dewan NCCS User Services:

NASA Center for Climate Simulation JIBB Upgrade – Early 2014 Doubled the Compute Capacity to ~77 TF Peak –Additional 120 Compute Nodes –1,920 cores; 39 TF –2.6 GHz Intel Sandybridge with 64 GB of RAM –Fourteen Data Rate Infiniband Nework (56 Gpbs) in a 2-to-1 blocking fabric Doubled Storage Capacity to ~800 TB 2 New Login Nodes Nature Run mounted on login nodes –Exploring options to extend the nature run to the compute nodes NCCS Overview for the EPA 14 Upgraded to approximately double the computational and storage capacity. Received funding through NOAA from the Hurricane Sandy Relief bill.

NCCS User Survey Results & Responses Al Settell, CSC Program Manager, CISTO-SCTS

NASA Center for Climate Simulation Comparison to 2012 Survey NCCS User Forum July 22, Area Overall satisfaction with NCCS High Performance Computing Computing for Analysis Long Term Storage (Archive) Short Term Storage (Local disk) Data Transfer to/from NCCS Data Transfer within NCCS4.04 Data Publication/Distribution4.00 Help Desk Account Management Allocation Management Applications Support Documentation Training Communicating with Users Tools to visualize scientific data Developing and testing code

NASA Center for Climate Simulation Results by Service Area - Performance

NASA Center for Climate Simulation Results by Service Area - Importance

NASA Center for Climate Simulation Results by Service Area - Performance (P) Minus Importance (I) Short Term Storage (Local disk) Documentation Computing for Analysis Data Transfer to/from NCCS High Performance Computing Communicating with Users Long Term Storage (Archive) Developing and testing code Help Desk Tools to visualize scientific data Allocation Management Account Management Training Data Transfer within NCCS Data Publication/Distribution Applications Support I > P I < P Focus on the areas where the importance is much greater than the performance.

NASA Center for Climate Simulation Themes – Based on Scores and User Comments Communications –Improved documentation/support, e.g., more examples in primer –User Notification improvement (more timely and consistent notifications) –Ticketing system improvements Discover –Longer running jobs –More scratch space –Process improvements, e.g., quicker response to requests for increased disk Archive –Improved reliability and data restore timeliness –Performance New Services –Remote visualization –Remote GUI-interactive improvements –Expanded licensing NCCS User Forum July 22,

NASA Center for Climate Simulation Action Plan Communications –Created Communications and Marketing Plan –Website and virtual presence improvements –Business process improvements for notifications Discover –Longer running jobs via SLURM Quality of Service (QoS) –NCCS center-wide file system –Business process improvements for disk requests Archive –Archive Study and Planning Improvements (ongoing) –Storage Area Network (SAN) and Tape Drive Upgrades –More is coming New Services –Remote visualization servers and software being delivered in near future –Explore remote desktop capabilities to improve GUI interactive response on Discover –Tracking license usage and “denials” of license for better capacity planning NCCS User Forum July 22,

NCCS Archive Tom Schardt

NASA Center for Climate Simulation Archive Capacity Planning Study Archive capacity planning study was completed in June 2014 –Person from outside the NCCS was commissioned for the study The study took into account –Current architecture –Growth projections –Options for performance improvements –Specific and general suggestions –Projected growth and budget forecasts NCCS User Forum July 22,

NASA Center for Climate Simulation Projected Growth NCCS User Forum July 22,

NASA Center for Climate Simulation Noted Areas of Concern Thrashing of archive file systems (using archive as scratch) Data does not remain resident very long on the disk cache The large number of small files in DMF cause problems Large amounts of files/data are stored and never recalled Constant migration of data to newer tape media puts a load on the system above and beyond the users Tape libraries are almost full; new libraries are very expensive and take up large amounts of space in the computer room Overall cost to maintain growth NCCS User Forum July 22,

NASA Center for Climate Simulation Areas Under Consideration Based on the Study Perform a full analysis of the archive solution, including the following –Policies –Architecture –Budget –Performance Improvements –Hardware –Operations –Functionality –User Advisory Group Identify improvements, prioritize, and implement –Not a lengthy process NCCS User Forum July 22,

NASA Center for Climate Simulation Capping the Growth of the Archive Quotas are needed to control the growth of the archive and therefore maintain budgetary constraints Additional policies under consideration include –Data expiration –Other (TBD) These are under preliminary evaluation –Communication and coordination with the users is critical to the successful implementation of any policy User Advisory Group –The NCCS is looking for users who would like to take part in an advisory group on archive changes –This group would be a start on an overall NCCS Advisory Group for all services –If you are interested, please let us know NCCS User Forum July 22,

NCCS Operations & User Services Update Ellen Salmon

NASA Center for Climate Simulation Building 33 “Offices Hours” for NCCS Technical User Services Staff An NCCS representative typically holds Wednesday office hours in building 33 room C116 Purpose is to provide face-to-face technical user support to assist in Troubleshooting (e.g., steps to minimize swapping) Optimizing code Optimizing use of NCCS resources Facilitating NCCS responses to user requests Schedule for next four weeks: 7/23: George Britzolakis 7/30: Hamid Oloso 8/6: Denis Nadeau 8/13: Eric Winter Feel free to stop by with questions and problems NCCS User Forum July 22,

NASA Center for Climate Simulation New Batch Job Capabilities via “Native” SLURM New capabilities coming, via ”native” SLURM For example, Quality of Service (qos), which can enable many features, e.g.: –Longer job wall-time –Users must request to be enabled ( support) –More to come These advanced features are available via “native” SLURM and will not be “back-ported” to the PBS wrapper. Watch for the Brown Bag Seminar on July 31 about converting PBS scripts to “native” SLURM. NCCS User Forum July 22,

NASA Center for Climate Simulation Upcoming Brown Bag Seminar How to Convert Your Discover Job Scripts from PBS to SLURM –July 31, 2014, 12 noon, Bldg. 33, H118 –Review of issues & techniques involved when migrating PBS job scripts to “native” SLURM scripts. –“Native” SLURM scripts allow use of advanced features like Quality of Service (qos). NCCS User Forum July 22,

NASA Center for Climate Simulation Miscellaneous Discover to Data Portal 10 GbE Connections upgrade –Formerly 4 by 1 GbE and now 2 by 10 GbE (5x improvement) Dali –Monitoring Page Idea to create a similar page as the NCCS job monitor for Dali nodes Will assess the feasibility and potential implementation –Load Balance/Round Robin No load balance currently; round robin login across the different Dali nodes Will assess the feasibility of load balancing Tour of the NCCS (individuals, groups, school groups, family) –Please schedule through Heidi Dewan and/or send an to NCCS User Forum July 22,

Questions & Answers NCCS User Services:

NASA Center for Climate Simulation Contact Information NCCS User Services: Thank you NCCS User Forum July 22,