Download presentation
Presentation is loading. Please wait.
Published byAdela McDowell Modified over 9 years ago
1
18-20 October 2004HEPiX - Brookhaven RAL Tier1/A Site Report Martin Bly HEPiX – Brookhaven National Laboratory 18-20 October 2004
2
HEPiX - Brookhaven Overview Introduction Hardware Software Security
3
18-20 October 2004HEPiX - Brookhaven RAL Tier1/A RAL the Tier 1 centre in the UK –Supports all VOs but priority to ATLAS, CMS, LHCb –LCG Core site Babar collaboration Tier A Support for other experiments: –D0, H1, SNO, UKQCD, MINOS, Zeus, Theory, … Various test environments for grid projects
4
18-20 October 2004HEPiX - Brookhaven Pre-Grid Upgrade 1 July 2000 1 October 2000
5
18-20 October 2004HEPiX - Brookhaven Post-GRID Upgrade GRID Load 21-28 July Full again in 8 hours!
6
18-20 October 2004HEPiX - Brookhaven LCG in Production Since June Tier1 LCG service has evolved to become a full scale production facility –Sort of sneaked up on us! Gradual change from test/development environment to full scale production. –Availability and reliability of the LCG service are now a high priority for RAL staff. –Now the largest single CPU resource at RAL
7
18-20 October 2004HEPiX - Brookhaven GRID Production
8
18-20 October 2004HEPiX - Brookhaven Hardware Main Farms: 884 CPUs, approx 880kSI2K –312 CPUs x P3 @ 1.4GHz, –160 CPUs x P4/Xeon @ 2.66GHz, HT off –512 CPUs x P4/Xeon @ 2.8GHz, HT off Disk: approx 226TB –52 x 800GB R5 IDE/SCSI arrays, –22 x 2TB R5 IDE/SCSI arrays, –40 x 4TB R5 EonStor SATA/SCSI arrays Tape: –6000 slot Powderhorn Silo, 200GB/tape, 8 drives. Misc: –SUN disk servers, AIX (AFS cell) –140 CPUs x P3 @ 1GHz
9
18-20 October 2004HEPiX - Brookhaven Hardware Issues CPU and disks delivered June 16 CPU units: –6 in 256 failed under testing – memory, motherboard –Installed into production after ~4 weeks Disk systems: –Riser cards failing. Looks to be the batch. –Issues with EonStor firmware – fixes from vendor –Into production about now
10
18-20 October 2004HEPiX - Brookhaven Enhancements FY 2004/05 CPU/disk procurement starting shortly –expect lower volume of CPU and disk –CPU technology: Xeon/Opteron –Disk technology: SATA/SCSI, SATA/FC, … Sun systems services and data migrating to SL3 –mail, NIS -> SL3 –data -> RH7.3, SL3 –Due Xmas ’04. AFS cell migration to SL3/OpenAFS Investigating SANs, iSCSI, SAS
11
18-20 October 2004HEPiX - Brookhaven Environment Farms dispersed over three machine rooms Extra temporary air conditioning capacity for summer –Actually survived with it mostly idle! New air conditioning for lower machine room (A5L), independent from main building air-con system. 5 Units, 400kW; arrives November Extra power distribution (but not new power) All new rack kit to be located in A5L, shared with other high availability services (HPC etc). Issues: –New Nocona chips use more power – and create more heat –Rack weight on raised floors – latest kit is around 8 tonnes –Air con unit weight + power
12
18-20 October 2004HEPiX - Brookhaven
13
18-20 October 2004HEPiX - Brookhaven Network Site link – 2.5Gb/s to TVN Site backbone @ 1Gb/s. Tier1/A backbone @ 1Gb/s on Summit 7i and 3Com switches. –Latest purchases have single or dual 1Gb/s NIC –All batch workers connected @ 100Mb/s to 3Com fan-out switches with 1Gb/s uplink –Disk servers connected @ 1Gb/s to backbone switches Upgrades –All new hardware to have 1Gb/s NIC –Upgrade CPU rack network switches where necessary to 1Gb/s fan-out –New backbone switches: stackable units with 40Gb/s interlink and where possible, with 10Gb/s upgrade path to site router Joining UKLight network –10Gb/s –Fewer hops to HEP sites –Multiple Gb/s links to Tier1/A
14
18-20 October 2004HEPiX - Brookhaven Software Transition to SL3 Farms: –Scientific Linux 3 (Fermi) Babar batch, prototype frontend –RedHat 7.n 7.3: LCG batch, Tier1 batch, frontend systems 7.2: Babar frontend systems Servers: –SL3 Systems services (mail, NIS, loggers, scheduler) –Redhat 7.2/7.3 Disk servers (custom Kernels) –Fedora Core Consoles, personal desktops –Solaris 2.6, 8, 9 SUN systems –AIX AFS cell
15
18-20 October 2004HEPiX - Brookhaven Software Issues SL3 –Easy to install with PXE/Kickstart –Migration of Babar community from RH 7.3 batch service smooth after installation validated by Babar for batch work –Batch system using Torque/Maui versions from LCG rebuilt for SL3, with some local patches to config parameters (more jobs, more classes). Stable. RedHat 7.n –Security a big concern (!) Speed of patching Custom kernels a problem Enterprise (RHEL, SL) –Disk i/o (both read and write) performance not as good as can be achieved with RH 7.n (9). (SL, 2.4.21-15.0.n) Need to test the more recent kernels –NFS, LVM and Megaraid controllers don’t mix!
16
18-20 October 2004HEPiX - Brookhaven Projects Quattor –Ongoing preparation for implementation Infrastructure data challenge –Joining effort to test high speed / high availability / high bandwidth data transfers to simulate LCG requirements RSS news service dCache –disk pool manager with SRM combined –Software complex to configure Multiple layers – difficult to drill down to find exactly why a problem has occurred, somewhat sensitive to hardware/system configurations –Working test deployment 1 head node, 2 pool nodes –Next steps: create a multi-terabyte instance for CMS in LCG
17
18-20 October 2004HEPiX - Brookhaven Security Firewall at RAL is default Deny inbound –Keeps many but not all badguys™ out –Specific hosts have inbound Permit for specific ports Sets of rules for LCG components (CE, SE, RB etc) or services (AFS) –Outbound: generally open, port 80 via cache –X11 port was open but not to Tier1/A (closed 1997!) Now closed site-wide as of 8 th Oct The badguys™ still get in…
18
18-20 October 2004HEPiX - Brookhaven Recent Incident (1) Keyboard logger installed at a remote site A exposes password of account at remote site B Access to exposed@siteB –Scans account known_hosts for possible targets exposed@siteB has ssh keys unprotected by a pass-phrase –Unchallenged access to any account@host on list in known_hosts on which unprotected public key installed –!”£$%^&*#@;¬?>|
19
18-20 October 2004HEPiX - Brookhaven Recent Incident (2) Aug 26 at 23:05 BST, Badguy™ uses unprotected key of compromised account at remote site B to enter two systems at RAL: RedHat 7.2 systems. Downloads custom IRC bot based on Energy Mech –Contains a klogd binary which is the IRC bot Possibly tries for privilege escalation Installs IRC bot (klogd), attempting to usurp the system klogd or possibly other rogue klogds. Fails to kill system klogd. Two klogd now running: system on owned by root and badguy™ version owned by compromised user. At some time later the directory containing the bot code (/tmp/.mc) is deleted.
20
18-20 October 2004HEPiX - Brookhaven Recent Incident (3) Oct 7, am: we are told system has been exhibiting suspicious activity by legitimate remote IRC server admins who are monitoring for suspicious activity. Systems removed from network and forensic investigation begins Dump of bot/klogd process shows 4800+ hosts listed – it appears system was part of an IRC network –Badguy™ bot/klogd listens on ports tcp:8181 and udp:34058 –Contacts IRC servers at 4 addresses (port 6667), as "XzIbIt" Firewall logs show relatively small amount of traffic from affected host No trace of root exploits Second host was a user frontend system: no evidence of any IRC activity or root compromise
21
18-20 October 2004HEPiX - Brookhaven Lessons Unprotected ssh keys are bad news –If it is unprotected on your system then all keys owned everywhere by that user are likely unprotected too Use ssh-agent or similar –There are still.netrc files in use for production userids Communication –Lack of news from upstream sites a disappointment If we had been told of exploit at the remote site and the time frames involved we would have found the IRC bot within hours Protect infrastructure from user accessible hosts –Firewalling Staff time: 2-3 staff weeks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.