Tier1 Status Report Andrew Sansum GRIDPP12 1 February 2004.

Slides:



Advertisements
Similar presentations
S.L.LloydATSE e-Science Visit April 2004Slide 1 GridPP – A UK Computing Grid for Particle Physics GridPP 19 UK Universities, CCLRC (RAL & Daresbury) and.
Advertisements

NorthGrid status Alessandra Forti Gridpp15 RAL, 11 th January 2006.
Tony Doyle - University of Glasgow GridPP EDG - UK Contributions Architecture Testbed-1 Network Monitoring Certificates & Security Storage Element R-GMA.
Storage Review David Britton,21/Nov/ /03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC Data? Oversight.
RAL Tier1: 2001 to 2011 James Thorne GridPP th August 2007.
Partner Logo Tier1/A and Tier2 in GridPP2 John Gordon GridPP6 31 January 2003.
NorthGrid status Alessandra Forti Gridpp12 Brunel, 1 February 2005.
The RHIC-ATLAS Computing Facility at BNL HEPIX – Edinburgh May 24-28, 2004 Tony Chan RHIC Computing Facility Brookhaven National Laboratory.
12th September 2002Tim Adye1 RAL Tier A Tim Adye Rutherford Appleton Laboratory BaBar Collaboration Meeting Imperial College, London 12 th September 2002.
Exporting Raw/ESD data from Tier-0 Tier-1s Wrap-up.
Steve Traylen Particle Physics Department Experiences of DCache at RAL UK HEP Sysman, 11/11/04 Steve Traylen
The Atlas Petabyte Datastore A grid enabled, networked data storage system: CrystalGrid Workshop 15 th Sept 2004 David Corney.
LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.
Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
Martin Bly RAL CSF Tier 1/A RAL Tier 1/A Status HEPiX-HEPNT NIKHEF, May 2003.
Tier1A Status Andrew Sansum GRIDPP 8 23 September 2003.
Martin Bly RAL Tier1/A RAL Tier1/A Site Report HEPiX-HEPNT Vancouver, October 2003.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
March 27, IndiaCMS Meeting, Delhi1 T2_IN_TIFR of all-of-us, for all-of-us, by some-of-us Tier-2 Status Report.
Southgrid Status Report Pete Gronbech: February 2005 GridPP 12 - Brunel.
ScotGrid: a Prototype Tier-2 Centre – Steve Thorn, Edinburgh University SCOTGRID: A PROTOTYPE TIER-2 CENTRE Steve Thorn Authors: A. Earl, P. Clark, S.
Quarterly report SouthernTier-2 Quarter P.D. Gronbech.
April 2001HEPix/HEPNT1 RAL Site Report John Gordon CLRC, UK.
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
BINP/GCF Status Report BINP LCG Site Registration Oct 2009
Paul Scherrer Institut 5232 Villigen PSI HEPIX_AMST / / BJ95 PAUL SCHERRER INSTITUT THE PAUL SCHERRER INSTITUTE Swiss Light Source (SLS) Particle accelerator.
Federico Ruggieri INFN-CNAF GDB Meeting 10 February 2004 INFN TIER1 Status.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
23 Oct 2002HEPiX FNALJohn Gordon CLRC-RAL Site Report John Gordon CLRC eScience Centre.
18-20 October 2004HEPiX - Brookhaven RAL Tier1/A Site Report Martin Bly HEPiX – Brookhaven National Laboratory October 2004.
Tier1 Status Report Martin Bly RAL 27,28 April 2005.
Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum.
GStore: GSI Mass Storage ITEE-Palaver GSI Horst Göringer, Matthias Feyerabend, Sergei Sedykh
28 April 2003Imperial College1 Imperial College Site Report HEP Sysman meeting 28 April 2003.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
RAL Site Report John Gordon IT Department, CLRC/RAL HEPiX Meeting, JLAB, October 2000.
O AK R IDGE N ATIONAL L ABORATORY U.S. D EPARTMENT OF E NERGY Facilities and How They Are Used ORNL/Probe Randy Burris Dan Million – facility administrator.
J Coles eScience Centre Storage at RAL Tier1A Jeremy Coles
Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.
Tier1 Andrew Sansum GRIDPP 10 June GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP ( ). –“ GridPP will enable testing.
Tier1A Status Andrew Sansum 30 January Overview Systems Staff Projects.
RAL Site report John Gordon ITD October 1999
Trusted Virtual Machine Images a step towards Cloud Computing for HEP? Tony Cass on behalf of the HEPiX Virtualisation Working Group October 19 th 2010.
Status SC3 SARA/Nikhef 20 juli Status & results SC3 throughput phase SARA/Nikhef Mark van de Sanden.
Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.
Gareth Smith RAL PPD RAL PPD Site Report. Gareth Smith RAL PPD RAL Particle Physics Department Overview About 90 staff (plus ~25 visitors) Desktops mainly.
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
Randy MelenApril 14, Stanford Linear Accelerator Center Site Report April 1999 Randy Melen SLAC Computing Services/Systems HPC Team Leader.
Tier-1 Andrew Sansum Deployment Board 12 July 2007.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
Tier1 Status Report Andrew Sansum Service Challenge Meeting 27 January 2004.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
The 2001 Tier-1 prototype for LHCb-Italy Vincenzo Vagnoni Genève, November 2000.
Tier1A Status Martin Bly 28 April CPU Farm Older hardware: –108 dual processors (450, 600 and 1GHz) –156 dual processor 1400MHz PIII Recent delivery:
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
The RAL Tier-1 and the 3D Deployment Andrew Sansum 3D Meeting 22 March 2006.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
J Jensen/J Gordon RAL Storage Storage at RAL Service Challenge Meeting 27 Jan 2005.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
RAL Plans for SC2 Andrew Sansum Service Challenge Meeting 24 February 2005.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Andrew Sansum 21 March 2000 ITD/CLRC Site Report Andrew Sansum 21 March 2000
Experiences and Outlook Data Preservation and Long Term Analysis
Update on Plan for KISTI-GSDC
Luca dell’Agnello INFN-CNAF
HEPiX IPv6 Working Group F2F Meeting
Presentation transcript:

Tier1 Status Report Andrew Sansum GRIDPP12 1 February 2004

Tier-1 Status Report Overview Hardware configuration/utilisation That old SRM story dCache deployment Network developments Security stuff

1 February 2004Tier-1 Status Report Production Objectives Q1 Q2Q3 Q4 SC3 SC2 LHCB

1 February 2004Tier-1 Status Report Hardware CPU –500 dual Processor Intel – PIII and Xeon servers mainly rack mounts (About 884KSI2K) also 100 older systems. Disk Service – mainly standard configuration –Commodity disk service based on core of 57 Linux servers –External IDE&SATA/SCSI RAID arrays (Accusys and Infortrend) –ATA and SATA drives (1500 spinning drives – another 500 old SCSI drives) –About 220TB disk (last tranch deployed in October) –Cheap and (fairly) cheerful Tape Service –STK Powderhorn 9310 silo with B drives. Max capacity 1PB at present but capable of 3PB by 2007.

1 February 2004Tier-1 Status Report LCG in September

1 February 2004Tier-1 Status Report LCG Load

1 February 2004Tier-1 Status Report Babar Tier-A

1 February 2004Tier-1 Status Report Last 7 Days ACTIVE: Babar,D0, LHCB,SNO, H1, ATLAS,ZEUS

1 February 2004Tier-1 Status Report dCache Motivation: –Needed SRM access to disk (and tape) –We have 60+ disk servers (140+ filesystems) – needed disk pool management dCache was only plausible candidate

1 February 2004Tier-1 Status Report History (Norse saga) of dCache at RAL Mid 2003 –We deployed a non grid version for CMS. It was never used in production. End of 2003/Start of 2004 –RAL offered to package a production quality dCache. –Stalled due to bugs and holidays – went back to developers and LCG developers. September 2004 –Redeployed DCache into LCG system for CMS, and DTeam VOs. dCache deployed within JRA1 testing infrastructure for gLite i/o daemon testing. Jan 2005: Still working with CMS to resolve interoperation issues, partly due to hybrid grid/non-grid use. Jan 2005: Prototype back end to tape written.

1 February 2004Tier-1 Status Report dCache Deployment dCache deployed as production service (also test instance, JRA1, developer1 and developer2?) Now available in production for ATLAS, CMS, LHCB and DTEAM (17TB now configured – 4TB used) Reliability good – but load is light Will use dCache (preferably production instance) as interface to Service Challenge 2. Work underway to provide Tape backend, prototype already operational. This will be production SRM to tape at least until after July service challenge

1 February 2004Tier-1 Status Report Current Deployment at RAL

1 February 2004Tier-1 Status Report Original Plan for Service Challenges Experiments SJ4 UKLIGHT Production dCache test dCache(?) Service Challenges technology

1 February 2004Tier-1 Status Report head gridftp dcache Dcache for Service Challenges SJ4 UKLIGHT Disk Servers Service Challenge Experiments

1 February 2004Tier-1 Status Report Network Recent upgrade to Tier1 Network Begin to put in place new generation of network infrastructure Low cost solution based on commodity hardware 10 Gigabit ready Able to meet needs of: – forthcoming service challenges –Increasing production data flows

1 February 2004Tier-1 Status Report September Summit 7i 1 Gbit Site Router Disk CPU Disk CPU Summit 7i

1 February 2004Tier-1 Status Report Now (Production) Disk+CPU Nortel 5510 stack (80Gbit) Summit 7i N*1 Gbit 1 Gbit/link Site Router

1 February 2004Tier-1 Status Report Soon (Lightpath) Disk+CPU Nortel 5510 stack (80Gbit) Summit 7i N*1 Gbit 1 Gbit 1 Gbit/link Site Router UKLIGHTUKLIGHT Dual attach 10Gb10Gb 2*1Gbit

1 February 2004Tier-1 Status Report Next (Production) Disk+CPU Nortel 5510 stack (80Gbit) 10 Gigabit Switch N*10 Gbit 10 Gb 1 Gbit/link New Site Router RALSITERALSITE

1 February 2004Tier-1 Status Report Machine Room Upgrade Large mainframe cooling (cooking) Infrastructure: ~540KW –Substantial overhaul now completed good performance gains, but was close to maximum by mid summer, temporary chillers over August (funded by CCLRC) –Substantial additional capacity ramp-up planned for Tier-1 (and other E-Science) Service November (Annual) Air-conditioning RPM Shutdown –Major upgrade – new (independent) cooling system (400KW+) – funded by CCLRC Also: profiling power distribution, new access control system. Worrying about power stability (brownout and blackout in last quarter.

1 February 2004Tier-1 Status Report MSS Stress Testing Preparation for SC 3 (and beyond) underway (Tim Folkes). Underway since August. Motivation – service load has been historically rather low. Look for Gotchas Review known limitations. Stress test – part of the way through the process – just a taster here –Measure performance –Fix trivial limitations –Repeat –Buy more hardware –Repeat

1 February 2004Tier-1 Status Report STK x 9940 tape drives ADS_switch_1 ADS_Switch_2 Brocade FC switches 4 drives to each switch ermintrude AIX dataserver florence AIX dataserver zebedee AIX dataserver dougal AIX dataserver mchenry1 AIX Test flfsys basil AIX test dataserver brian AIX flfsys ADS0CNTR Redhat counter ADS0PT01 Redhat pathtape ADS0SB01 Redhat SRB interface dylan AIX Import/export buxton SunOS ACSLS User array4 array3 array2 array1 catalogue cache catalogue cache Test system SRB Inq; S commands; MySRB Tape devices ADS tape ADS sysreq admin commands create query User pathtape commands Logging Physical connection (FC/SCSI) Sysreq udp command User SRB command VTP data transfer SRB data transfer STK ACSLS command All sysreq, vtp and ACSLS connections to dougal also apply to the other dataserver machines, but are left out for clarity Production system SRB pathtape commands Thursday, 04 November 2004

1 February 2004Tier-1 Status Report flfstk tapeserv Farm Server flfsys (+libflf) user flfscan data transfer (libvtp) catalogue data STK tape drive cellmgr Catalogue Server (brian) flfdoexp (+libflf) flfdoback (+libflf) datastore (script) Robot Server (buxton) ACSLS API control info (mount/ dismount) data Tape Robot flfsys user commands (sysreq) SE recycling (+libflf) read Atlas Datastore Architecture 28 Feb B Strong SSI CSI flfsys farm commands (sysreq) LMU flfsys admin commands (sysreq) administrators flfaio IBM tape drive flfqryoff (copy of flfsys code) Backup catalogue stats flfsys tape commands (sysreq) servesys pathtape long name (sysreq) short name (sysreq) frontend backend Pathtape Server (rusty) (sysreq) importexport flfsys import/export commands (sysreq) libvtp User Node I/E Server (dylan) ? Copy B Copy C ACSLS cache disk Copy A vtp user program tape (sysreq)

1 February 2004Tier-1 Status Report Catalogue Manipulation

1 February 2004Tier-1 Status Report Write Performance Single Server Test

1 February 2004Tier-1 Status Report Conclusions Have found a number of easily fixable bugs Have found some less easily fixable architecture issues Have much better understanding of limitations of architecture Estimate suggests 60-80MB/s -> tape now. Buy more/faster disk and try again. Current drives good for 240MB/s peak – actual performance likely to be limited by ratio of drive (read+write):(load+unload+seek)

1 February 2004Tier-1 Status Report Security Incident 26 August X11 scan acquires userid/pw at upstream site Hacker logs on to upstream site – snoops known_hosts Ssh to Tier-1 front end host using unencrypted private key from upstream site Upstream site responds – but does not notify RAL Hacker loads IRC bot on Tier1 and registers at remote IRC server (for command/control/monitoring) Tries to root exploit (fails), attempts login to downstream sites (fails – we think) 7 th October – Tier-1 incident notified by IRC service and begins response sites involved globally

1 February 2004Tier-1 Status Report Objectives Comply with site security policy – disconnect … etc etc. –Will disconnect hosts promptly once active intrusion detected. Comply with external Security Policies (eg LCG) –Notification Protect downstream sites, by notification, pro-active disconnection –Identify involved sites –Establish direct contacts with uptream/downstream Minimise service outage –Eradicate infestation

1 February 2004Tier-1 Status Report Roles Incident controllermange information flood – deal with external contacts Log trawlerhunt for contacts Hunter/killer(s)Searching for compromised hosts Detailed forensicsUnderstand what happened on compromised hosts End user contactsGet in touch with users, confirm usage patterns, terminate IDs

1 February 2004Tier-1 Status Report Chronology At 09:30 on 7th October RAL network group forwarded a complaint from Undernet suggesting unauthorized connections from Tier1 hosts to Undernet. At 10:00 initial investigation suggests unauthorised activity on csfmove02. csfmove02 physically disconnected from network. By now 5 Tier1 staff + E-Science security Officer 100% engaged on incident. Additional support from CCLRC network group, and site security officer. Effort remained at this level for several days. Babar support staff at RAL also active tracking down unexplained activity. At 10:07 request made to site firewall admin for firewall logs for all contacts with suspected hostile IRC servers. At 10:37 firewall admin provides initial report, confirming unexplained current outbound activity from csfmove02, but no other nodes involved. At 11:29 babar report that bfactory account password was common to the following additional Ids (bbdatsrv and babartst) At 11:31 Steve completes rootkit check – no hosts found – although possible false positives on Redhat 7.2 which we are uneasy aboutBy 11:40 preliminary investigations at RAL had concluded that an unauthorized access had taken place onto host csfmove02 (a data mover node) which in turn was connected outbound to an IRC service. At this point we notified security mailing lists (hepix-security, lcg-security, hepsysman):

1 February 2004Tier-1 Status Report Security Events

1 February 2004Tier-1 Status Report Security Summary The intrusion took 1-2 staff months of CCLRC effort to investigate (6 staff fulltime for 3 days (5 Tier-1 plus E- Science security officer), working long hours. Also: –Networking group –CCLRC site security. –Babar support –Other sites Prompt notification of the incident by up-stream site would have substantially reduced the size and complexity of the investigation. The good standard of patching on the Tier1 minimised the spread of the incident internally (but we were lucky) Can no longer trust who logged on uses are many userids (globally) probably compromised.

1 February 2004Tier-1 Status Report Conclusions A period of consolidation User demand continues to fluctuate, but increasing number of experiments able to use LCG. Good progress on SRM to DISK (DCACHE Making progress with SRM to tape Having an SRM isnt enough – it has to meet the needs of the experiments Expect focus to shift (somewhat) towards service challenge