Presentation is loading. Please wait.

Presentation is loading. Please wait.

Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum.

Similar presentations


Presentation on theme: "Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum."— Presentation transcript:

1 Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum

2 Martin Bly RAL Tier1/A 1/2 July 20042 Overview Hardware Network Experiences / Challenges Management issues

3 Martin Bly RAL Tier1/A 1/2 July 20043 Tier1 in GRIDPP2 (2004-2007) The Tier-1 Centre will provide GRIDPP2 with a large computing resource of a scale and quality that can be categorised as an LCG Regional Computing Centre January 2004 – GRIDPP2 confirm RAL to be host for Tier1 Service –GRIDPP2 to commence September 2004 Tier1 Hardware budget: –£2.3M over 3 years Staff –Increase from 12.1 to 16.5 by September

4 Martin Bly RAL Tier1/A 1/2 July 20044 Current Tier1 Hardware CPU –350 dual Processor Intel – PIII and Xeon servers mainly rack mounts –About 400KSI2K –RedHat 7.3 –P2/450 tower units decommissioned April 04 –RH72 and Solaris batch services to be phased out this year Disk Service – mainly “standard” configuration –Dual Processor Server –Dual channel SCSI interconnect –External IDE/SCSI RAID arrays (Accusys and Infortrend) –ATA drives (mainly Maxtor) –About 80TB disk –Cheap and (fairly) cheerful Tape Service –STK Powderhorn 9310 silo with 8 9940B drives

5 Martin Bly RAL Tier1/A 1/2 July 20045 New Hardware 256 x dual Xeon HT/2800GHz@533MHz –2GB RAM (32 with 4GB RAM), 120GB HDD, 1Gb NIC: 8 racks. 20 disk servers with two 4TB IDE/SCSI arrays: 5 racks –Infortrend EonStore A16U-G1A units, each with 16 x WD 250GB SATA HDD – 4TB/array raw capacity –Servers: dual Xeon HT/2800@533MHz, 2GB RAM, dual 120GB SATA system disks, dual 1Gb/s NIC –160Tb raw, ~140TB available (RAID5) Delivered June 15 th, now running commissioning tests

6 Martin Bly RAL Tier1/A 1/2 July 20046

7 Martin Bly RAL Tier1/A 1/2 July 20047 Next Procurement Need in production by January 2005 –Original schedule of December delivery seems late –Will have to start very soon –Less chance for testing / new technology Exact proportions not agreed, but … –400 KSI2K (300-400 CPUs) –160TB disk –120TB tape?? –Network infrastructure? –Core servers (H/A??) –RedHat? Long range plan needs reviewing – also need long range experiment requirements so as to plan environment updates.

8 Martin Bly RAL Tier1/A 1/2 July 20048 CPU Capacity

9 Martin Bly RAL Tier1/A 1/2 July 20049 Tier1 Disk Capacity (TB)

10 Martin Bly RAL Tier1/A 1/2 July 200410 High Impact Systems Looking at replacement hardware for high impact systems: –/home/csf, /rutherford file systems –Mysql servers –AFS cell –Front end / UI hosts –Data movers –NIS master, Mail server Replacing mix of Solaris, Tru64 Unix and AIX servers with Linux – consolidation of expertise Migrate AFS to OpenAFS and then K5.

11 Martin Bly RAL Tier1/A 1/2 July 200411 Network Firewall Site Router Production SubnetTest Subnet Superjanet Servers Workers Test network (eg MBNG) Server Servers WorkersProductionVLAN TestVLAN SiteRoutableNetwork Rest of Site

12 Martin Bly RAL Tier1/A 1/2 July 200412 Network Firewall Site Router Tier1 Network SuperJanet Servers Workers Test network (eg MBNG) Server Servers Workers TestVLAN ProductionVLAN Rest of Site

13 Martin Bly RAL Tier1/A 1/2 July 200413 UKlight Connection to RAL in September Funded to end 2005 after which probably merges with SuperJanet 5 2.5Gb/s now  10Gb/s from 2006 Effectively dedicated light path to CERN Probably not for Tier1 production but suitable for LCG Data challenges etc, building experience for SuperJanet upgrade. UKLight -> Starlight

14 Martin Bly RAL Tier1/A 1/2 July 200414 Forthcoming Challenges Simplify service – less “duplication” Improve storage management Deploy new Fabric Management RedHat Enterprise 3 upgrade Network upgrade/reconfigure???? Another procurement/install Meet challenge of LCG – professionalism LCG Data Challenges …

15 Martin Bly RAL Tier1/A 1/2 July 200415 Clean up Spaghetti Diagram How to phase out “Classic” service.. Simplify Interfaces: Less GRIDS “More is not always better”

16 Martin Bly RAL Tier1/A 1/2 July 200416 Storage: Plus and Minus ATA and SATA drives External RAID arrays SCSI interconnect Ext2 file system Linux O/S NFS/Xrootd/http/gridftp/bbftp/srb/…. NO SAN No management layer NO HSM 2.5% failure per annum - OK Good architecture, choose well Surprisingly unreliable: change OK – but need journal: XFS? Move to Enterprise 3 Must have SRM Need SAN (Fibre or iSCSI …) Need virtualisation/DCACHE.. ????

17 Martin Bly RAL Tier1/A 1/2 July 200417 Benchmarking Work by George Prassas on various systems including a 3ware/SATA RAID5 system. Tuning gains extra performance on RH variants Performance of RHEL3 NFS servers and disk I/O not special despite tuning, c/w RH73 Considering buying SPEC suite to benchmark everything.

18 Martin Bly RAL Tier1/A 1/2 July 200418 Fabric Management Currently run: –Kickstart – cascading config files, implementing PXE –SURE exception monitoring –Automate – automatic interventions Running out of steam with old systems … –“Only” 800 systems – but many, many flavours –Evaluating Quator – no obvious alternatives – probably deploy –Less convinced by Lemon – bit early – running Nagios in parallel

19 Martin Bly RAL Tier1/A 1/2 July 200419 Yum / Yumit Kickstart scripts now use Yum to bootstrap systems to latest updates Post-install config now uses Yum wherever possible for local additions Yumit: –Nodes use Yum to check their status very night and report to central database –Web interface to show farm status –Easy to see which nodes need updating. Machine ownership tagging, port monitoring project

20 Martin Bly RAL Tier1/A 1/2 July 200420 Futures Storage Architectures –iSCSI, Fibre, dCache –Need to be more sophisticated to allow reallocation of available space CPUs –Xeon, Opteron, Itanium, Intel 64bit x86 architecture Network –Higher speed interconnect, iSCSI

21 Martin Bly RAL Tier1/A 1/2 July 200421 Conclusions After several years of relative stability must start re- engineering many Tier1 components. Must start to rationalise – support limited set of interfaces, operating systems, testbeds … simplify so we can do less better LCG becoming a big driver –Service commitments –Increase resilience and availability –Data challenges and move to steady state Major reality check in 2007!


Download ppt "Martin Bly RAL Tier1/A RAL Tier1/A Report HepSysMan - July 2004 Martin Bly / Andrew Sansum."

Similar presentations


Ads by Google