Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tier1 Report Cambridge 23rd October 2006 Martin Bly.

Similar presentations


Presentation on theme: "Tier1 Report Cambridge 23rd October 2006 Martin Bly."— Presentation transcript:

1 Tier1 Report HEPSysMan @ Cambridge 23rd October 2006 Martin Bly

2 23 October 2006 HEPSysMan @ Cambridge Overview Tier-1 Hardware changes Services

3 23 October 2006 HEPSysMan @ Cambridge RAL Tier-1 RAL hosts the UK WLCG Tier-1 –Funded via GridPP2 project from PPARC –Supports WLCG and UK Particle Physics users and collaborators VOs: –LHC: Atlas, CMS, LHCb, Alice, (dteam, ops) –Babar CDF, D0, H1, Zeus –bio, cedar, esr, fusion, geant4, ilc, magic, minos, pheno, t2k, … Other experiments: –Mice, SNO, UKQCD Theory users …

4 23 October 2006 HEPSysMan @ Cambridge Staff / Finance Bid to PPARC for ‘GridPP3’ project –For exploitation phase of LHC –September 2007 to March 2011 –Increase in staff and hardware resources –Result early 2007 Tier-1 is recruiting –2 x systems admins, 1 x hardware technician –1 x grid deployment –Replacement for Steve Traylen to head grid deployment and user support group CCLRC internal reorganisation –Business Units Tier1 service is run by E-Science department which is now part of the Facilities Business Unit (FBU)

5 23 October 2006 HEPSysMan @ Cambridge New building Funding approved for a new computer centre building –3 floors Computer rooms on ground floor, offices above –240m2 low power density room Tape robots, disk servers etc Minimum heat density 1.0 kW/m2, rising to 1.6kW/m2 by 2012 –490m2 high power density room Servers, CPU farms, HPC clusters »Minimum heat density 1.8kW/m2, rising to 2.8Kw/m2 by 2012 –UPS computer room 8 racks + 3 telecoms racks UPS system to provide continuous power of 400A/92KVA three phase for equipment plus power to air conditioning (total approx 800A/184KVA) –Overall Space for 300 racks (+ robots, telecoms) Power: 2700kVA initially, max 5000kVA by 2012 (inc air-con) UPS capacity to meet estimated 1000A/250KVA for 15-20 minutes for specific hardware for clean shutdown / surviving short breaks –Shared with HPC and other CCLRC computing facilities –Planned to be ready by summer 2008

6 23 October 2006 HEPSysMan @ Cambridge Hardware changes FY05/06 capacity procurement March 06 –52 x 1U twin dual-core AMD 270 units Tyan 2882 motherboard 4GB RAM, 250GB SATA HDD, dual 1GB NIC 208 job slots, 200kSI2K Commissioned May 06, running well –21 x 5U 24-bay disk servers 168TB (210TB) data capacity Areca 1170 PCI-X 24-port controller 22 x 400GB (500GB) SATA data drives, RAID 6 2 x 250GB SATA system drives, RAID 1 4GB RAM, dual 1GB NIC Commissioning delayed (more…)

7 23 October 2006 HEPSysMan @ Cambridge Hardware changes (2) FY 06/07 capacity procurements –47 x 3U 16-bay disk servers: 282TB data capacity 3Ware 9550SX-16ML PCI-X 16-port SATA RAID controller 14 x 500GB SATA data drives, RAID 5 2 x 250GB SATA system drives, RAID 1 Twin dual-core Opteron 275 CPUs, 4GB RAM, dual 1GB NIC Delivery expected October 06 –64 x 1U twin dual-core Intel Woodcrest 5130 units (550kSI2K) 4GB RAM, 250GB SATA HDD, dual 1GB NIC Delivery expected November 06 Upcoming in FY 06/07: –Further 210TB disk capacity expected December 06 Same spec as above –High Availability systems with UPS Redundant PSUs, hot-swap paired HDDs etc –AFS replacement –Enhancement to Oracle services (disk arrays or RAC servers)

8 23 October 2006 HEPSysMan @ Cambridge Hardware changes (3) SL8500 tape robot –Expanded from 6,000 to 10,000 slots –10 drives shared between all users of service –Additional 3 x T10K tape drives for PP –More when CASTOR service working STK Powderhorn –Decommissioned and removed

9 23 October 2006 HEPSysMan @ Cambridge Storage commissioning  Problems with March 06 procurement: –WD4000YR on Areca 1170, RAID 6 Many instances of multiple drive dropouts Un-warranted drive dropouts and then re-integrating the same drive –Drive electronics (ASIC) on 4000YR (400GB) units changed with no change of model designation We got the updated units –Firmware updates to Areca cards did not solve the issues –WD5000YS (500GB) units swapped-in by WD Fixes most issues but… –Status data and logs from drives showing several additional problems Testing under high load to gather statistics –Production further delayed

10 23 October 2006 HEPSysMan @ Cambridge Air-con issues Setup –13 x 80KW units in lower machine room, several paired units work together Several ‘hot’ days (for the UK) in July –Sunday: dumped ~70 jobs Alarm system failed to notify operators Pre-emptive automatic shutdown not triggered Ambient air temp reached >35C, machine exhaust temperature >50C ! HPC services not so lucky –Mid week 1: problems over two days attempts to cut load by suspending batch services to protect data services forced to dump 270 jobs –Mid week 2: 2 hot days predicted pre-emptive shutdown of batch services in lower machine room no jobs lost, data services remain available Problem –High ambient air temperature tripped high pressure cut-outs in refrigerant gas circuits –Cascade failure as individual air-con units work harder –Loss of control of machine room temperature Solutions –Sprinklers under units Successful but banned due to Health and Safety concerns –Up-rated refrigerant gas pressure settings to cope with higher ambient air temperature

11 23 October 2006 HEPSysMan @ Cambridge Operating systems Grid services, batch workers, service machines –SL3, mainly 3.0.3, 3.0.5, 4.2, all ix86 –SL4 before Xmas Considering x86_64 Disk storage –SL4 migration in progress Tape systems –AIX: caches –Solaris: controller –SL3/4: CASTOR systems, newer caches Oracle systems –RHEL3/4 Batch system –Torque/MAUI Fare-shares, allocation by User Board

12 23 October 2006 HEPSysMan @ Cambridge Databases 3D project –Participating since early days Single Oracle server for testing Successful –Production service 2 x Oracle RAC clusters –Two servers per RAC »Redundant PSUs, hot-swap RAID1 system drives –Single SATA/FC data array –Some transfer rate issues –UPS to come

13 23 October 2006 HEPSysMan @ Cambridge Storage Resource Management dCache –Performance issues LAN performance very good WAN performance and tuning problems –Stability issues –Now better: increased number of open file descriptors increased number of logins allowed. ADS –In-house system many years old Will remain for some legacy services CASTOR2 –Replace both dCache disk and tape SRMs for major data services –Replace T1 access to existing ADS services –Pre-production service for CMS –LSF for transfer scheduling

14 23 October 2006 HEPSysMan @ Cambridge Monitoring Nagios –Production service implemented –3 servers (1 master + 2 slaves) –Almost all systems covered 600+ –Replacing SURE –Add call-out facilities

15 23 October 2006 HEPSysMan @ Cambridge Networking All systems have 1Gb/s connections –Except oldest fraction of the batch farm 10GB/s links almost everywhere –10Gb/s backbone within Tier-1 Complete November 06 Nortel 5530/5510 stacks –10Gb/s link to RAL site backbone 10Gb/s backbone links at RAL expected end November 06 10Gb/s link to RAL Tier-2 –10Gb/s link to UK academic network SuperJanet5 (SJ5) Expected in production by end of November 06 Firewall still an issue –Planned bypass for Tier1 data traffic as part of RAL SJ5 and RAL backbone connectivity developments –10Gb/s OPN link to CERN active September 06 Using pre-production SJ5 circuit Production status at SJ5 handover

16 23 October 2006 HEPSysMan @ Cambridge Security Notified of intrusion at Imperial College London Searched logs –Unauthorised use of account from suspect source –Evidence of harvesting password maps –No attempt to conceal activity –Unauthorised access to other sites –No evidence of root compromise Notified sites concerned –Incident widespread Passwords changed –All inactive accounts disabled Cleanup –Changed NIS to use shadow password map –Reinstall all interactive systems

17 Questions?


Download ppt "Tier1 Report Cambridge 23rd October 2006 Martin Bly."

Similar presentations


Ads by Google