Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010
Outline Apel pies Lancaster status Liverpool status Manchester status Sheffield Conclusions
Apel pie (1)
Apel pie (2)
Apel pie (3)
Lancaster All WN moved to tarball Moving all nodes to SL5 solved sub-cluster problems. Deployed and decommissioned a test SCAS. – Will install glexec when user demand it In the middle of deploying CREAM CE Finished tendering for the HEC facility – Will give us access to 2500 cores – Extra 280 TB of storage – Shared Facility has Roger Jones as director so we have a strong voice for GridPP interests
Lancaster Older storage nodes are being re-tasked Tarball WN are working well but YAIM is suboptimal to configure them Maui continues to be weird for us – Jobs blocking other jobs – Confused by multiple queues – Jobs don't use their reservations when they are blocked Problems trying to use the same NFS server for experiment software and tarballs. – Now they have been split
Liverpool What we did (we were supposed to do) – Major hardware procurement 48TB unit with 4Gbit bonded link 7X4X8 units = 224 cores, 3GB mem, 2x1TB disk – Scrapped some 32bit nodes – CREAM test CE running Other things we did – General guide to capacity publishing – Horizontal job allocation – Improved use of Vms – Grid use of slack local HEP nodes
Liverpool Things in progress – Put CREAM in GOCDB (ready) – Scrap all 32 bit nodes (gradually) – Production runs of central computer cluster (other dept involved) Problems – Obsolete equipment – WMS/ICE fault at RAL What's next – Install/deploy newly procured storage and CPU hardware – Achieve production runs of central computing cluster
Manchester Since last time – Upgraded WN to SL5 – Eliminated all dcache setup from the nodes – Raid0 on internal disks – Increased scratch area – Unified two DPM instances – 106 TB/84 dedicated to atlas – Upgraded to – Changed network configuration of data servers – Installed squid cache – Installed Cream CE (still in test phase) – Last HC test in March 99% efficiency
Manchester Major UK site in atlas production 2 or 3 after RAL and Glasgow Last HC in March had 99% efficiency 80 TB almost empty – Not many jobs – But from the stats of the past few days also real users seem also fine. 96%
Manchester Tender – European Tender submitted 15/9/2009 – Vendors replies should be in 16/04/2010 (in two days) – Additional GridPP3 money can be added Included a clause for increased budget – Minimum requirements 4400 HEPSPEC/240TB Can be exceeded Buying only nodes – Talking to Uni for Green funding to replace what we can't replace Not easy
Sheffield Storage Upgrade – Storage moved to physics: 24/7 access – All nodes running SL5, DPM – 4x25TB disk pools, 2TB disks, RAID5, 4 cores – Memory will be upgaded to 8GB on all nodes – 95% reserved for atlas – Xfs crashed, problem solved with additional kernel module – Sw server 1TB (raid1) – Squid server
Sheffield Worker Nodes – 200 old 2.4GHz, 2GB, SL5 – 72 TB of local disk per 2 cores – Lcg-CE and MONBOX on SL4 – Additional 32 amp ring has been added – Fiber link between CICS and physics Availability – 97-98% since January 2008 – 94.5% efficiency in atlas
Sheffield Plans Additional storage – 20TB bring total 120TB for atlas Cluster integration – Local HEP and UKI-NORTHGRID-SHEF-HEP will have joint Wns – 128 CPU + 72 new nodes ??? – Torque server from local cluster and lcg-CE from grid cluster – Need 2 days DT waiting for atlas approval – CREAM CE installed waiting to complete cluster integration