Presentation is loading. Please wait.

Presentation is loading. Please wait.

NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred.

Similar presentations


Presentation on theme: "NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred."— Presentation transcript:

1 NCCS User Forum 11 December 2008

2 GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager NCCS Compute Capabilities Dan Duffy, Lead Architect Questions and Comments Phil Webster, CISTO Chief User Services Updates Bill Ward, User Services Lead

3 GSFC NCCS NCCS User Forum12/11/083 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager NCCS Compute Capabilities Dan Duffy, Lead Architect Questions and Comments Phil Webster, CISTO Chief User Services Updates Bill Ward, User Services Lead

4 GSFC NCCS NCCS User Forum12/11/084 Key Accomplishments ∙ SCU4 added to Discover and currently running in “pioneer” mode ∙ Explore decommissioned and removed ∙ Discover filesystems converted to GPFS 3.2 native mode

5 GSFC NCCS NCCS User Forum12/11/085 Discover Utilization Past Year 67.1% 64.4% 73.3% 2,446,365 CPU hours 1,320,683 CPU hours SCU3 cores added

6 GSFC NCCS NCCS User Forum12/11/086 Discover Utilization

7 GSFC NCCS NCCS User Forum12/11/087 Discover Queue Expansion Factor Eligible Time + Run Time Run Time Weighted over all queues for all jobs (Background and Test queues excluded)

8 GSFC NCCS NCCS User Forum12/11/088 Discover Availability September through November availability ∙ 13 outages ▶ 9 unscheduled ◆ 0 hardware failures ◆ 7 software failures ◆ 2 extended maintenance windows ▶ 4 scheduled ∙ 104.3 hours total downtime ▶ 68.3 unscheduled ▶ 36.0 scheduled Longest outages ∙ 11/28-29 – GPFS hang, 21 hrs ∙ 11/12 – Electrical maintenance, Discover reprovisioning,18 hrs ▶ Scheduled outage ∙ 10/1 – SCU4 integration, 11.5 hrs ▶ Scheduled outage plus extension ∙ 9/2-3 – Subnet Manager hang, 11.3 hrs ∙ 11/6 – GPFS hang, 10.9 hrs GPFS hang Electrical maintenance, Discover reprovisioning SCU4 integration Subnet Manager hang GPFS hang SCU4 integration, Switch reconfiguration Subnet Manager hang Subnet Manager maint. GPFS hang

9 GSFC NCCS NCCS User Forum12/11/089 Current Issues on Discover: GPFS Hangs ∙ Symptom: GPFS hangs resulting from users running nodes out of memory. ∙ Impact: Users cannot login or use filesystem. System Admins reboot affected nodes. ∙ Status: Implemented additional monitoring and reporting tools.

10 GSFC NCCS NCCS User Forum12/11/0810 Current Issues on Discover: Problems with PBS –V Option ∙ Symptom: Jobs with large environments not starting ∙ Impact: Jobs placed on hold by PBS ∙ Status: Consulting with Altair. In the interim, don’t use –V to pass full environment, instead use –v or define necessary variables within job scripts.

11 GSFC NCCS NCCS User Forum12/11/0811 Resolved Issues on Discover: Infiniband Subnet Manager ∙ Symptom: Working nodes erroneously removed from GPFS following Infiniband Subnet problems with other nodes. ∙ Impact: Job failures due to node removal ∙ Status: Modified several subnet manager configuration parameters on 9/17 based on IBM recommendations. Problem has not recurred.

12 GSFC NCCS NCCS User Forum12/11/0812 Resolved Issues on Discover: PBS Hangs ∙ Symptom: PBS server experiencing 3-minute hangs several times per day ∙ Impact: PBS-related commands (qsub, qstat, etc.) hang ∙ Status: Identified periodic use of two communication ports also used for hardware management functions. Implemented work- around on 9/17 to prevent conflicting use of these ports. No further occurrences.

13 GSFC NCCS NCCS User Forum12/11/0813 Resolved Issues on Discover: Intermittent NFS Problems ∙ Symptom: Inability to access archive filesystems ∙ Impact: hung commands and sessions when attempting to access $ARCHIVE ∙ Status: Identified hardware problem with Force10 E600 network switch. Implemented workaround and replaced line card. No further occurrences.

14 GSFC NCCS NCCS User Forum12/11/0814 Future Enhancements ∙ Discover Cluster ▶ Hardware platform ▶ Additional storage ∙ Data Portal ▶ Hardware platform ∙ Analysis environment ▶ Hardware platform ∙ DMF ▶ Hardware platform

15 GSFC NCCS NCCS User Forum12/11/0815 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager NCCS Compute Capabilities Dan Duffy, Lead Architect Questions and Comments Phil Webster, CISTO Chief User Services Updates Bill Ward, User Services Lead

16 GSFC NCCS NCCS User Forum12/11/0816 Very High Level of What to Expect in FY09 OctNovDecJanFebMarAprMayJunJulAugSep Discover SW Stack Upgrade Cluster Upgrade (Nehalem) Analysis System DMF from IRIX to Linux Data Management Initiative New Tape Drives Major Initiatives Other Activities Discover FC and Disk Addition Additional Discover Disk Continued Scalability Testing Delivery of IBM Cell

17 GSFC NCCS NCCS User Forum12/11/0817 Adapting the Overall Architecture ∙ Services will have ▶ More independent SW stacks ▶ Consistent user environment ▶ Fast access to the GPFS file systems ▶ Large additional disk capacity for longer storage of files within GPFS ∙ This will result in ▶ Fewer downtimes ▶ Rolling outages (not everything at once)

18 GSFC NCCS NCCS User Forum12/11/0818 Conceptual Architecture Diagram GPFS I/O Servers IB Discover (batch) Base SCU1 SCU2 SCU3 SCU4 Viz GPFS I/O Servers IB Analysis Nodes (interactive) SAN GPFS I/O Servers IB FY09 Compute Upgrade (Nehalem) Data Portal GPFS I/O Servers IB SAN Archive DMF SAN 10 GbE LAN

19 GSFC NCCS NCCS User Forum12/11/0819 What is the Analysis Environment? ∙ Initial technical implementation plan ▶ Large shared memory (256 GB at least) nodes ◆ 16 core nodes with 16 GB/core ▶ Interactive (not batch); direct logins ▶ Fast access to GPFS ▶ 10 GbE network connectivity ▶ Consistent software stack to Discover ▶ Independent of compute stack (coupled only by GPFS) ∙ Additional storage for staging from the archive specific for analysis ∙ Visibility and easy access to the archive and data portal (NFS)

20 GSFC NCCS NCCS User Forum12/11/0820 Excited about Intel Nehalem ∙ Quick Specs ▶ Core 7i – 45 nm ▶ 731 million transistors per quad-core ▶ 2.66 GHz to 2.93 GHz ▶ Private L1 cache (32 KB) and L2 (256 KB) per core ▶ Shared L3 cache (up to 8 MB) across all the cores ▶ 1,066 MHz DDR3 Memory (3 channels per core) ∙ Important Features ▶ Intel QuickPath Interconnect ▶ Turbo Boost ▶ Hyper-Threading ∙ Learn more at: ▶ http://www.intel.com/technology/architecture-silicon/next- gen/index.htm http://www.intel.com/technology/architecture-silicon/next- gen/index.htm ▶ http://en.wikipedia.org/wiki/Nehalem_(microarchitecture) http://en.wikipedia.org/wiki/Nehalem_(microarchitecture)

21 GSFC NCCS NCCS User Forum12/11/0821 Nehalem versus Harpertown ∙ Single thread improvement (will vary based on application) ∙ Larger cache with the 8 MB shared cache across all processors ∙ Memory to processor bandwidth dramatically increased over the Harpertown ▶ Initial measurements have shown 3 to 4x memory to processor bandwidth increase

22 GSFC NCCS NCCS User Forum12/11/0822 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager NCCS Compute Capabilities Dan Duffy, Lead Architect Questions and Comments Phil Webster, CISTO Chief User Services Updates Bill Ward, User Services Lead

23 GSFC NCCS NCCS User Forum12/11/0823 Issues from Last User Forum: Shared Project Space ∙ Implementation of shared project space on Discover ∙ Status: resolved ▶ Available for projects by request ▶ Accessible via /share; usage deprecated ▶ Accessible via $SHARE; correct usage

24 GSFC NCCS NCCS User Forum12/11/0824 Issues from Last User Forum: Increase Queue Limits ∙ Increase CPU & time limits in queues ∙ Status: resolved QueuePriorityMax CPUsMax Hours test101206412 general_hi8051224 debug70321 general_long5525624 general5025612 general_small501612 background12564

25 GSFC NCCS NCCS User Forum12/11/0825 Issues from Last User Forum: Commands to Access DMF ∙ Implementation dmget and dmput ∙ Status: test version ready to be enabled on Discover login nodes ▶ Reason for delay was that dmgets on non- dm files would hang ▶ There may still be stability issues ▶ E-mail will be sent soon notifying users of availability

26 GSFC NCCS NCCS User Forum12/11/0826 Issues from Last User Forum: Enabling Sentinel Jobs ∙ Running a “sentinel” subjob to watch a main parallel compute “subjob” in a single PBS job ∙ Status: under investigation ▶ Requires an NFS mount of data portal file system on Discover gateway nodes ▶ Requires some special PBS usage to specify how subjobs will land on nodes

27 GSFC NCCS NCCS User Forum12/11/0827 Other Issues: Poor Interactive Response ∙ Slow interactive response on Discover ∙ Status: resolved ▶ Router line card replaced ▶ Automatic monitoring instituted to promptly detect future problems

28 GSFC NCCS NCCS User Forum12/11/0828 Other Issues: Parallel Jobs > ~300-400 CPUs ∙ Some users experiencing problems running > ~300-400 CPUs on Discover ∙ Status: resolved ▶ “stacksize unlimited” in.cshrc file needed ▶ Intel mpi passes environment, including settings in startup files

29 GSFC NCCS NCCS User Forum12/11/0829 Other Issues: Parallel Jobs > 1500 CPUs ∙ Many jobs won’t run at > 1500 CPUs ∙ Status: under investigation ▶ Some simple jobs will run ▶ NCCS consulting with IBM and Intel to resolve the issue ▶ Software upgrades probably required ▶ Solution may fix slow Intel MPI startup

30 GSFC NCCS NCCS User Forum12/11/0830 Other Issues: Visibility of the Archive ∙ Visibility of the archive from discover ∙ Current Status ▶ Compute/viz nodes don’t have external network connections ▶ “Hard” NFS mounts guarantee data integrity, but if there is an NFS hang, the node hangs ▶ Login/gateway nodes may use a “soft” NFS mount, but risk of data corruption ▶ bbftp or scp (to Dirac) preferred over cp when copying data

31 GSFC NCCS NCCS User Forum12/11/0831 DMF Transition ∙ Dirac due to be replaced in Q2 CY09 ▶ Interactive host for Grads, IDL, Matlab, etc. ▶ Much larger memory ▶ GPFS shared with Discover ▶ Significant increase in GPFS storage ∙ Impacts to Dirac users: ▶ Source code must be recompiled ▶ COTS must be relicensed/rehosted ∙ Old Dirac up until migration complete

32 GSFC NCCS NCCS User Forum12/11/0832 Help Us Help You ∙ Don’t use “PBS –V” (job hangs with error “too many failed attempts to start”) ∙ Direct stdout, stderr to specific files, or you will fill up the PBS spool directory ∙ Use an interactive batch session instead of an interactive session on a login node ∙ If you suspect your job is crashing nodes, call us before running again

33 GSFC NCCS NCCS User Forum12/11/0833 Help Us Help You (continued) ∙ Try to be specific when reporting problems, for example: ▶ If the archive is broken, specify symptoms ▶ If files are inaccessible or can’t be recalled, please send us the file names

34 GSFC NCCS NCCS User Forum12/11/0834 Plans ∙ Implement a better scheduling policy ∙ Implement integrated job performance monitoring ∙ Implement better job metrics reporting ∙ Or…

35 GSFC NCCS NCCS User Forum12/11/0835 Feedback ∙ Now – Voice your … ▶ Praises? ▶ Complaints? ▶ Suggestions? ∙ Later – NCCS Support ▶ support@nccs.nasa.gov support@nccs.nasa.gov ▶ (301) 286-9120 ∙ Later – USG Lead (me!) ▶ William.A.Ward@nasa.gov William.A.Ward@nasa.gov ▶ (301) 286-2954

36 GSFC NCCS NCCS User Forum12/11/0836 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred Reitz, Operations Manager NCCS Compute Capabilities Dan Duffy, Lead Architect Questions and Comments Phil Webster, CISTO Chief User Services Updates Bill Ward, User Services Lead

37 Open Discussion Questions Comments


Download ppt "NCCS User Forum 11 December 2008. GSFC NCCS NCCS User Forum12/11/082 Agenda Welcome & Introduction Phil Webster, CISTO Chief Current System Status Fred."

Similar presentations


Ads by Google