Liverpool HEP - Site Report June 2008 Robert Fay, John Bland
Staff Status One members of staff left in the past year: Paul Trepka, left March 2008 Two full time HEP system administrators John Bland, Robert Fay One full time Grid administrator currently being hired *Closing date for applications was Friday 13 th, 15 applications received One part time hardware technician Dave Muskett
Current Hardware Desktops ~100 Desktops: Scientific Linux 4.3, Windows XP Minimum spec of 2GHz x86, 1GB RAM + TFT Monitor Laptops ~60 Laptops: Mixed architectures, specs and OSes. Batch Farm Software repository (0.7TB), storage (1.3TB) Old batch queue has 10 SL3 dual 800MHz P3s with 1GB RAM medium, short queues consist of 40 SL4 MAP-2 nodes (3GHz P4s) 5 interactive nodes (dual Xeon 2.4GHz) Using Torque/PBS Used for general analysis jobs
Current hardware – continued Matrix 1 dual 2.40GHz Xeon, 1GB RAM 6TB RAID array Used for CDF batch analysis and data storage HEP Servers *4 core servers User file store + bulk storage via NFS (Samba front end for Windows) Web (Apache), (Sendmail) and database (MySQL) User authentication via NIS (+Samba for Windows) Dual Xeon 2.40GHz shell server and ssh server Core servers have a failover spare
Current Hardware - continued LCG Servers CE, SE upgraded to new hardware: CE now 8-core Xeon 2 GHz, 8GB RAM SE now 4-core Xeon 2.33GHz, 8GB RAM, Raid 10 array CE, SE, UI all SL4, GLite 3.1 Mon still SL3, GLite 3.0 BDII SL4, Glite 3.0
Current Hardware – continued MAP2 Cluster 24 rack (960 node) (Dell PowerEdge 650) cluster 4 racks (280 nodes) shared with other departments Each node has 3GHz P4, 1GB RAM, 120GB local storage 19 racks (680 nodes) primarily for LCG jobs (5 racks currently allocated for local ATLAS/T2K/Cockcroft batch processing) 1 rack (40 nodes) for general purpose local batch processing Front end machines for ATLAS, T2K, Cockcroft Each rack has two 24 port gigabit switches All racks connected into VLANs via Force10 managed switch
Storage RAID All file stores are using at least RAID5. Newer servers using RAID6. All RAID arrays using 3ware 7xxx/9xxx controllers on Scientific Linux 4.3. Arrays monitored with 3ware 3DM2 software. File stores New User and critical software store, RAID6+HS, 2.25TB ~10B general purpose hepstores for bulk storage 1.4TB + 0.7TB batchstore+batchsoft for the Batch farm cluster 1.4TB hepdata for backups 37TB RAID6 for LCG storage element
Storage (continued) 3ware Problems! 3w-9xxx: scsi0: WARNING: (0x06:0x0037): Character ioctl (0x108) timed out, resetting card. 3w-9xxx: scsi0: ERROR: (0x06:0x001F): Microcontroller not ready during reset sequence. 3w-9xxx: scsi0: AEN: ERROR: (0x04:0x005F): Cache synchronization failed; some data lost:unit=0. Leads to total loss of data access until system is rebooted. Sometimes leads to data corruption at array level. Seen under iozone load, normal production load, due to drive failure. Anyone else seen this?
Network Topology Force10 Gigabit Switch WAN firewall LCG servers MAP2 OfficesServers 2GB VLAN 1GB link
Network (continued) Core Force10 E600 managed switch. Now have 450 gigabit ports (240 at line rate) Used as central departmental switch, using VLANs Increased bandwidth to WAN using link aggregation to 2-3GBit/s Increased to departmental backbone to 2GBit/s Added departmental firewall/gateway Network intrusion monitoring with snort Most office PCs and laptops are on internal private network Building network infrastructure is creaking -needs rewiring, old cheap hubs and switches need replacing
Security & Monitoring Security Logwatch (looking to develop filters to reduce noise) University firewall + local firewall + network monitoring (snort) Secure server room with swipe card access Monitoring Core network traffic usage monitored with ntop and cacti (all traffic to be monitored after network upgrade) Use sysstat on core servers for recording system statistics Rolling out system monitoring on all servers and worker nodes, using SNMP, Ganglia, Cacti, and Nagios Hardware temperature monitors on water cooled racks, to be supplemented by software monitoring on nodes via SNMP. Still investigating other environment monitoring solutions.
System Management Puppet used for configuration management Dotproject used for general helpdesk RT integrated with Nagios for system management -Nagios automatically creates/updates tickets on acknowledgement -Each RT ticket serves as a record for an individual system
Plans Additional storage for the Grid GridPP3 funded Will be approx. 60? TB May switch from dCache to DPM Upgrades to local batch farm Plans to purchase several multi-core (most likely 8-core) nodes Collaboration with local Computing Services Department Share of their newly commissioned multi-core cluster available