Download presentation
Presentation is loading. Please wait.
Published byLeah Arnold Modified over 10 years ago
1
Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010
2
Outline Apel pies Lancaster status Liverpool status Manchester status Sheffield Conclusions
3
Apel pie (1)
4
Apel pie (2)
5
Apel pie (3)
6
Lancaster All WN moved to tarball Moving all nodes to SL5 solved sub-cluster problems. Deployed and decommissioned a test SCAS. – Will install glexec when user demand it In the middle of deploying CREAM CE Finished tendering for the HEC facility – Will give us access to 2500 cores – Extra 280 TB of storage – Shared Facility has Roger Jones as director so we have a strong voice for GridPP interests
7
Lancaster Older storage nodes are being re-tasked Tarball WN are working well but YAIM is suboptimal to configure them Maui continues to be weird for us – Jobs blocking other jobs – Confused by multiple queues – Jobs don't use their reservations when they are blocked Problems trying to use the same NFS server for experiment software and tarballs. – Now they have been split
8
Liverpool What we did (we were supposed to do) – Major hardware procurement 48TB unit with 4Gbit bonded link 7X4X8 units = 224 cores, 3GB mem, 2x1TB disk – Scrapped some 32bit nodes – CREAM test CE running Other things we did – General guide to capacity publishing – Horizontal job allocation – Improved use of Vms – Grid use of slack local HEP nodes
9
Liverpool Things in progress – Put CREAM in GOCDB (ready) – Scrap all 32 bit nodes (gradually) – Production runs of central computer cluster (other dept involved) Problems – Obsolete equipment – WMS/ICE fault at RAL What's next – Install/deploy newly procured storage and CPU hardware – Achieve production runs of central computing cluster
10
Manchester Since last time – Upgraded WN to SL5 – Eliminated all dcache setup from the nodes – Raid0 on internal disks – Increased scratch area – Unified two DPM instances – 106 TB/84 dedicated to atlas – Upgraded to 1.7.2 – Changed network configuration of data servers – Installed squid cache – Installed Cream CE (still in test phase) – Last HC test in March 99% efficiency
11
Manchester Major UK site in atlas production 2 or 3 after RAL and Glasgow Last HC in March had 99% efficiency 80 TB almost empty – Not many jobs – But from the stats of the past few days also real users seem also fine. 96%
12
Manchester Tender – European Tender submitted 15/9/2009 – Vendors replies should be in 16/04/2010 (in two days) – Additional GridPP3 money can be added Included a clause for increased budget – Minimum requirements 4400 HEPSPEC/240TB Can be exceeded Buying only nodes – Talking to Uni for Green funding to replace what we can't replace Not easy
13
Sheffield Storage Upgrade – Storage moved to physics: 24/7 access – All nodes running SL5, DPM 1.7.3 – 4x25TB disk pools, 2TB disks, RAID5, 4 cores – Memory will be upgaded to 8GB on all nodes – 95% reserved for atlas – Xfs crashed, problem solved with additional kernel module – Sw server 1TB (raid1) – Squid server
14
Sheffield Worker Nodes – 200 old 2.4GHz, 2GB, SL5 – 72 TB of local disk per 2 cores – Lcg-CE and MONBOX on SL4 – Additional 32 amp ring has been added – Fiber link between CICS and physics Availability – 97-98% since January 2008 – 94.5% efficiency in atlas
15
Sheffield Plans Additional storage – 20TB bring total 120TB for atlas Cluster integration – Local HEP and UKI-NORTHGRID-SHEF-HEP will have joint Wns – 128 CPU + 72 new nodes ??? – Torque server from local cluster and lcg-CE from grid cluster – Need 2 days DT waiting for atlas approval – CREAM CE installed waiting to complete cluster integration
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.