Liverpool HEP - Site Report June 2010 John Bland, Robert Fay.

Slides:



Advertisements
Similar presentations
Southgrid Status Pete Gronbech: 21 st March 2007 GridPP 18 Glasgow.
Advertisements

QMUL e-Science Research Cluster Introduction (New) Hardware Performance Software Infrastucture What still needs to be done.
Northgrid Status Alessandra Forti Gridpp24 RHUL 15 April 2010.
NorthGrid status Alessandra Forti Gridpp12 Brunel, 1 February 2005.
Liverpool HEP – Site Report May 2007 John Bland, Robert Fay.
Liverpool HEP - Site Report June 2008 Robert Fay, John Bland.
Manchester HEP Desktop/ Laptop 30 Desktop running RH Laptop Windows XP & RH Home server AFS using openafs 3 DB servers. Web server AFS Mail Server.
UCL HEP Computing Status HEPSYSMAN, RAL,
24-Apr-03UCL HEP Computing Status April DESKTOPS LAPTOPS BATCH PROCESSING DEDICATED SYSTEMS GRID MAIL WEB WTS SECURITY SOFTWARE MAINTENANCE BACKUP.
A couple of slides on RAL PPD Chris Brew CCLRC - RAL - SPBU - PPD.
ITT NETWORK SOLUTIONS. Quick Network Facts Constant 100 Mbps operation for users Infrastructure ready for 1000 Mbps operation to the user Cisco routing.
Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.
Chris Brew RAL PPD Site Report Chris Brew SciTech/PPD.
Birmingham site report Lawrie Lowe: System Manager Yves Coppens: SouthGrid support HEP System Managers’ Meeting, RAL, May 2007.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
Northgrid Status Alessandra Forti Gridpp25 Ambleside 25 August 2010.
Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Site Report HEPHY-UIBK Austrian federated Tier 2 meeting
IFIN-HH LHCB GRID Activities Eduard Pauna Radu Stoica.
Liverpool HEP - Site Report June 2009 John Bland, Robert Fay.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
CPP Staff - 30 CPP Staff - 30 FCIPT Staff - 35 IPR Staff IPR Staff ITER-India Staff ITER-India Staff Research Areas: 1.Studies.
SouthGrid Status Pete Gronbech: 4 th September 2008 GridPP 21 Swansea.
Gareth Smith RAL PPD HEP Sysman. April 2003 RAL Particle Physics Department Site Report.
1 INDIACMS-TIFR TIER-2 Grid Status Report IndiaCMS Meeting, Sep 27-28, 2007 Delhi University, India.
Edinburgh Site Report 1 July 2004 Steve Thorn Particle Physics Experiments Group.
Liverpool HEP - Site Report June 2011 John Bland, Robert Fay.
Tier 3g Infrastructure Doug Benjamin Duke University.
Southgrid Status Report Pete Gronbech: February 2005 GridPP 12 - Brunel.
UCL Site Report Ben Waugh HepSysMan, 22 May 2007.
ScotGrid: a Prototype Tier-2 Centre – Steve Thorn, Edinburgh University SCOTGRID: A PROTOTYPE TIER-2 CENTRE Steve Thorn Authors: A. Earl, P. Clark, S.
27/04/05Sabah Salih Particle Physics Group The School of Physics and Astronomy The University of Manchester
US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
30-Jun-04UCL HEP Computing Status June UCL HEP Computing Status April DESKTOPS LAPTOPS BATCH PROCESSING DEDICATED SYSTEMS GRID MAIL WEB WTS.
David Hutchcroft on behalf of John Bland Rob Fay Steve Jones And Mike Houlden [ret.] * /.\ /..‘\ /'.‘\ /.''.'\ /.'.'.\ /'.''.'.\ ^^^[_]^^^ * /.\ /..‘\
SouthGrid Status Pete Gronbech: 2 nd April 2009 GridPP22 UCL.
Northgrid Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April 2011.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPIX 2009 Umea, Sweden 26 th May 2009.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN RAL 30 th June 2009.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
ScotGRID:The Scottish LHC Computing Centre Summary of the ScotGRID Project Summary of the ScotGRID Project Phase2 of the ScotGRID Project Phase2 of the.
INDIACMS-TIFR Tier 2 Grid Status Report I IndiaCMS Meeting, April 05-06, 2007.
RAL PPD Computing A tier 2, a tier 3 and a load of other stuff Rob Harper, June 2011.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
Manchester HEP Desktop/ Laptop 30 Desktop running RH Laptop Windows XP & RH OS X Home server AFS using openafs 3 DB servers Kerberos 4 we will move.
Support in setting up a non-grid Atlas Tier 3 Doug Benjamin Duke University.
UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.
London Tier 2 Status Report GridPP 11, Liverpool, 15 September 2004 Ben Waugh on behalf of Owen Maroney.
Southgrid Technical Meeting Pete Gronbech: May 2005 Birmingham.
CASPUR Site Report Andrei Maslennikov Lead - Systems Amsterdam, May 2003.
HEP Computing Status Sheffield University Matt Robinson Paul Hodgson Andrew Beresford.
Gareth Smith RAL PPD RAL PPD Site Report. Gareth Smith RAL PPD RAL Particle Physics Department Overview About 90 staff (plus ~25 visitors) Desktops mainly.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.
Site Report: Prague Jiří Chudoba Institute of Physics, Prague WLCG GridKa+T2s Workshop.
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
BaBar Cluster Had been unstable mainly because of failing disks Very few (
15-Feb-02Steve Traylen, RAL WP6 Test Bed Report1 RAL/UK WP6 Test Bed Report Steve Traylen, WP6 PPGRID/RAL, UK
TCD Site Report Stuart Kenny*, Stephen Childs, Brian Coghlan, Geoff Quigley.
INRNE's participation in LCG Elena Puncheva Preslav Konstantinov IT Department.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
10/18/01Linux Reconstruction Farms at Fermilab 1 Steven C. Timm--Fermilab.
STORAGE EXPERIENCES AT MWT2 (US ATLAS MIDWEST TIER2 CENTER) Aaron van Meerten University of Chicago Sarah Williams Indiana University OSG Storage Forum,
Belle II Physics Analysis Center at TIFR
Oxford Site Report HEPSYSMAN
Pete Gronbech, Kashif Mohammad and Vipul Davda
Presentation transcript:

Liverpool HEP - Site Report June 2010 John Bland, Robert Fay

Staff Status No changes to technical staff since last year: Two full time HEP system administrators John Bland, Robert Fay One full time Grid administrator (0.75FTE) Steve Jones, started September 2008 Not enough! Mike Houlden has retired and David Hutchcroft has taken over as the academic in charge of computing.

Current Hardware - Users Desktops ~100 Desktops: Scientific Linux 4.3, Windows XP, Legacy systems Minimum spec of 3GHz P4, 1GB RAM + TFT Monitor Hardware upgrades this Summer, SL5.x, Windows 7 Laptops ~60 Laptops: Mixed architecture, Windows+VM, MacOS+VM, Netbooks Printers Samsung and Brother desktop printers Various OKI and HP model group printers Recently replaced aging HP LaserJet 4200 with HP LaserJet P4015X

Current Hardware – Tier 3 Batch Tier3 Batch Farm Software repository (0.5TB), storage (3TB scratch, 13TB bulk) medium32, short queues consist of 40 32bit SL4 (3GHz P4, 1GB/core) medium64, short64 queues consist of 9 64bit SL5 nodes (2xL5420, 2GB/core) 2 of the 9 SL5 nodes can also be used interactively 5 older interactive nodes (dual 32bit Xeon 2.4GHz, 2GB/core) Using Torque/PBS/Maui+Fairshares Used for general, short analysis jobs Grid jobs now also run opportunistically on this cluster

Current Hardware – Servers ~40 core servers (HEP+Tier2) ~60 Gigabit switches 1 High density Force10 switch Console access via KVMoIP (when it works) LCG Servers SE 8-core Xeon 2.66GHz, 10GB RAM, Raid 10 array Unstable under SL4, crashes triggered by mysqldumps Temporarily replaced with alternate hardware Testing shows it appears to be stable under SL5 CEs, SE, UI, MON all SL4, gLite 3.1 BDII SL5, gLite 3.2 VMware Server being used for some servers and for testing MON, BDII, CE+Torque for 64-bit cluster, CREAM CE, all VMs

Current Hardware – Nodes MAP2 Cluster Still going! Originally 24 rack (960 node) (Dell PowerEdge 650) cluster Nodes are 3GHz P4, 1-1.5GB RAM, 120GB disk – 5.32 HEPSPEC06 2 racks (80 nodes) shared with other departments 18 racks (~700 nodes) primarily for LCG jobs 1 rack (40 nodes) for general purpose local batch processing 3 racks retired (Dell nodes replaced with other hardware) Each rack has two 24 port gigabit switches, 2Gbit/s uplink All racks connected into VLANs via Force10 managed switch/router 1 VLAN/rack, all traffic Layer3 Still repairing/retiring broken nodes on a weekly basis But…

Current Hardware – Nodes MAP2 Cluster (continued) Its days are hopefully numbered Internal agreement to fund replacement from energy savings Proposed replacement will be 72 E5620 CPUs (288 cores) or equivalent Power consumption will go from up to ~140kW to ~10.5kW (peak)

Current Hardware – Nodes New cluster Started with 7 dual L5420 nodes (56 cores with HEPSPEC ) With last round of GridPP funding added 7 of: SuperMicro SYS-6026TT-TF quad-board 2U chassis Dual 1400W redundant PSUs 4 x SuperMicro X8DDT-F motherboards 8 x Intel E5620 Xeon CPUs 96GB RAM 8 x 1TB Enterprise SATA drive 224 cores total

Current Hardware – Nodes

Storage RAID Majority of file stores using RAID6. Few legacy RAID5+HS. Mix of 3ware and Areca SATA controllers Adaptec SAS controller for grid software. Scientific Linux 4.3, newer systems on SL5.x Arrays monitored with 3ware/Areca web software and alerts Now tied in with Nagios as well Software RAID1 system disks on all new servers/nodes.

Storage File stores 13TB general purpose hepstore for bulk storage 3TB scratch area for batch output (RAID10) 2.5TB high performance SAS array for grid/local software Sundry servers for user/server backups, group storage etc 270TB RAID6 for LCG storage element (Tier2 + Tier3 storage/access via RFIO/GridFTP)

Storage - Grid Now running head node + 12 DPM pool nodes, ~270TB of storage This is combined LCG and local data Using DPM as cluster filesystem for Tier2 and Tier3 Local ATLASLIVERPOOLDISK space token Still watching Lustre-based SEs

Joining Clusters High performance central Liverpool Computing Services Department (CSD) cluster Physics has a share of the CPU time Decided to use it as a second cluster for the Liverpool Tier2 Extra cost was second CE node (£2k) Plus line rental for 10Gb fibre between machine rooms Liverpool HEP attained NGS Associate status Can take NGS-submitted jobs from traditional CSD serial job users Sharing load across both clusters more efficiently Compute cluster in CSD, Service/Storage nodes in Physics

Joining Clusters CSD nodes Dual quad-core AMD Opteron 2356 CPUs, 16GB RAM HEPSPEC OS was SuSE10.3, moved to RHEL5 in February Using tarball WN + extra software on NFS (no root access to nodes) Needed a high performance central software server Using SAS 15K drives and 10G link Capacity upgrade required for local systems anyway (ATLAS!) Copes very well with ~800nodes apart from jobs that compile code NFS overhead on file lookup is the major bottleneck Very close to going live once network troubles sorted Relying on remote administration frustrating at times CSD also short-staffed, struggling with hardware issues on the new cluster

HEP Network topology Force10 switch WAN HEP Offices Tier2 servers HEP servers CSD Cluster Tier2 nodes Research VLANs 1500MTU9000MTU 2G 2G x 24 10G 1G 1-3G

Configuration and deployment Kickstart used for OS installation and basic post install Previously used for desktops only Now used with PXE boot for automated grid node installs Puppet used for post-kickstart node installation (glite-WN, YAIM etc) Also used for keeping systems up to date and rolling out packages And used on desktops for software and mount points Custom local testnode script to periodically check node health and software status Nodes put offline/online automatically Keep local YUM repo mirrors, updated when required, no surprise updates (being careful of gLite generic repos)

Monitoring Ganglia on all worker nodes and servers Use monami with ganglia on CE, SE and pool nodes Torque/Maui stats, DPM/MySQL stats, RFIO/GridFTP connections Nagios monitoring all servers and nodes Continually increasing number of service checks Increasing number of local scripts and hacks for alerts and ticketing Cacti used to monitor building switches Throughput and error readings Ntop monitors core Force10 switch, but still unreliable sFlowTrend tracks total throughput and biggest users, stable LanTopolog tracks MAC addresses and building network topology arpwatch monitors ARP traffic (changing IP/MAC address pairings).

Monitoring - Monami

Monitoring - Cacti Cacti Weathermap

Security Network security University firewall filters off-campus traffic Local HEP firewalls to filter on-campus traffic Monitoring of LAN devices (and blocking of MAC addresses on switch) Single SSH gateway, Denyhosts Snort and BASE (need to refine rules to be useful, too many alerts) Physical security Secure cluster room with swipe card access Laptop cable locks (some laptops stolen from building in past) Promoting use of encryption for sensitive data Parts of HEP building publically accessible Logging Server system logs backed up daily, stored for 1 year Auditing logged MAC addresses to find rogue devices

Plans and Issues Replace MAP-2 Installation of new nodes shouldnt be a problem Getting rid of several hundred Dell PowerEdge 650s more of a challenge Still need to think of a name for the new cluster Possibility of rewiring the network in the Physics building Computing Services want to rewire and take over network management for offices But theres asbestos in the building so maybe they dont IPv6 IANA pool of IPv4 addresses predicted to be exhausted by late 2010 / early 2011 Need to be ready to switch at some point… Would remove any NAT issues!

Conclusion *New kit in, older kit soon (?) to be replaced Were just about keeping on top of things with the resources we have