Download presentation
Presentation is loading. Please wait.
Published byFay Barbra Shelton Modified over 9 years ago
1
Cray XT3 Experience so far Horizon Grows Bigger Richard Alexander 24 January 2006 raa@cscs.ch
2
24 January 2006RAA - CSCS2 Support Team Dominik Ulmer Hussein Harake Richard Alexander Davide Tacchella Neil Stringfellow Francesco Benvenuto Claudio Redaelli + Cray Support
3
24 January 2006RAA - CSCS3 Cray turn over on 8 July 05 Early users –CSCS staff –PSI staff –Very early adopters Initially marked by great instability Short History
4
24 January 2006RAA - CSCS4 What is an XT3? Massively parallel Opteron-based UP Catamount job launcher on compute –One executable –No sockets, no fork, no shared memory Suse Linux on “service nodes” –I/O nodes, login nodes, system db, boot – “yod or pbs-mom” nodes
5
24 January 2006RAA - CSCS5 Current configuration 12 cabinets (rack) of approximately 96 compute nodes per rack 4 login nodes with DNS rotary name Software maintenance workstation (smw) Suse Linux w/ 2.4 kernel PBS/Pro TotalView Portland Group Compilers 6.0.5
6
24 January 2006RAA - CSCS6 Changes 6 Unicos/LC upgrades - today 1.03.14 (1.3) Multiple hardware upgrades –Most important was raising the Seastar ASIC from 1.5v to 1.6v Multiple software upgrades –Multiple Lustre versions (CFS) High performance file system Multiple firmware upgrades - “portals” »We have been up for 1 week at a time!
7
24 January 2006RAA - CSCS7 Acceptance Period Began: 21 Nov 05 Functional tests Performance tests Stability test for 30 days (7am - 6pm).
8
24 January 2006RAA - CSCS8 Cray XT 3: Commissioning & Acceptance Acceptance completed 22 December 2006 Put into production (i.e. opened to all users) Jan18; slow ramp-up planned Rebooted almost daily during acceptance, less often afterwards Current usage: –>75% node utilization –Jobs using 64-128 nodes are “typical” Two groups telling us they have done science they couldn’t do before.
9
24 January 2006RAA - CSCS9 Cray XT3 Use Model: Fit the work to the hardware Palu - 1100 compute nodes: Batch only Gele - 84 compute nodes: New users –compile, debug, scale Test system - 56 nodes: Test environment –Delivery planned March 06
10
24 January 2006RAA - CSCS10 Open issues and next steps High speed network not 100% stable High-speed file system Lustre young and immature Current batch system limited in functionality Bugs lead to intermittent node failures
11
24 January 2006RAA - CSCS11 High Speed Network Stability improved significantly in last last month Stability still not satisfactory; Cray analyzing problems Most errors affect/abort single jobs Occasional errors require a reboot (~1/week?) No high speed network => Nothing Works
12
24 January 2006RAA - CSCS12 File system Lustre Genuine parallel file system Great performance Very young and immature feature set When Lustre is unhealthy (~once or twice a month) the system must be rebooted Cray continuously improving Lustre; deadline for stable version hard to predict Real Lustre Errors versus HSN problems! Difficult to differentiate
13
24 January 2006RAA - CSCS13 Job Scheduling PBS/Pro batch system Current PBS can not implement desired management policies (Priority to large jobs, Back filling, Reservations) Cray is delivering a “special” version of PBS with a TCL scheduler –Will implement Priorities and Back filling –Testing begins February ‘06 Roadmap for Reservations still t.b.d.
14
24 January 2006RAA - CSCS14 Problemes du Jour: Intermittent node failures 3-15 times per day nodes die due to several bugs Cray working on each of these bugs –Extremely difficult to diagnose Currently nodes stay down until next machine reboot Single node reboot available March - April 2006??
15
24 January 2006RAA - CSCS15 Near Term Future Single node reboot PBS scheduler - a beginning Upgrade gele - novice user platform Bring up smallest system - systems Try out dual core - upgrade some system Test Linux on compute node
16
24 January 2006RAA - CSCS16 Summary System into production and available for development work System gaining maturity, but still far from SX- 5 or IBM Power-4 standards. Main current open issues in –Parallel file system maturity –Scheduling system –Node failures More Interruptions that we would like!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.