Download presentation
Presentation is loading. Please wait.
Published byAlice Andrews Modified over 9 years ago
1
Winnie Lacesso Bristol Site Report June 2009
2
2 Staff & Users Departmental Physics / Networks: JP Melot, Neil Laws (Microsoft); Rhys Morris (Astrophysics support) Particle Physics: Winnie Lacesso, Rhys Morris (.2) About 40 PP staff & students Desktops: less than 10 - Lx, MS, 2 x iMac Laptops: 40 or so mainly Mac (~16), Xp (~15), Lx (SL4/5, FC) STAFF CHANGES: Yves Coppens = SouthGrid Technical Support, left; Jon Wakelin =.5 Particle Physics support (GPFS, StoRM) left Dr Bob Cregan joined as HPC Storage Admin - will help with StoRM & GPFS
3
3 Servers About 10 non-LCG servers (was 20) consolidated/reitired 10 in 1 yr!! Win2003: fileserving (480GB); considering Unix/Samba replacement Win2K AFS (IBM TransArc 3.6) (230GB): have Unix server ready, no time to get to it & Win2K server keeps working... Most servers =SL4/5: NFS (1 (was 5)), PBS batch(3), compute (~3), subversion/elog, mediawiki, infrastructure (web, DHCP, kickstart)
4
4 UBristol HPC: PP usage Was 30 jobslots, now up to 90 on SL4 HPC cluster (2GB RAM/core) Not yet using SL5 HPC cluster (only 1GB RAM/core) Jon W was instrumental in getting CE & SE up+running!
5
5 RAID Grief SCSI Agrro DPM has 2 x RAID arrays attached. 16-bay slid into borken/faulty after commissioning & 2 years work. Months of grief + debugging. Aug 5 10:30:17 lcgse01 kernel: SCSI error : return code = 0x10000 Aug 5 10:30:17 lcgse01 kernel: end_request: I/O error, dev sdf, sector 787223 Aug 5 10:30:17 lcgse01 kernel: Buffer I/O error on device sdf1, logical block 98395 Aug 5 10:30:17 lcgse01 kernel: lost page write due to I/O error on sdf1 Aug 5 10:30:37 lcgse01 kernel: scsi1:0:2:0: Attempting to abort cmd ebdd0e00: 0x28 0x0 0x89 0xbf Aug 5 10:30:37 lcgse01 kernel: scsi1: At time of recovery, card was not paused Aug 5 10:30:37 lcgse01 kernel: >>>>>>>>>>>>>>>>>> Dump Card State Begins<<<<<<<<<<<<<<<<< Aug 5 10:30:37 lcgse01 kernel: scsi1: Dumping Card State at program address 0x26 Mode 0x33 Aug 5 10:30:37 lcgse01 kernel: Card was paused Replace SCSI controller (Adaptec, for LSI) - no diff Vendor agreed & sent replacement Dec 2008; installed Jan 2009
6
6 Shoulder that Load!
7
7 StoRM SE, GPFS New hardware for HPC CE & StoRM SE, also gridftp server & new MON (syslog, Nagios, etc): X7DBU Xeon E5405 with 2GB RAM/core HPC CE working well except gpfs timeouts – patchy OPS SAM fails Problems with StoRM - gpfs multiclustering not yet working, rfio permission problems (ACLs??) - thought Jon left it in working order but guess not... New Storage Admin (Bob Cregan) will help get gpfs multiclustering working Good performance on new hardware!
8
8 Security User laptops frequently go offsite (home, CERN, RAL), come back & reconnect to internal network. No (detected) incidents. Even from users with root/admin access on laptops. One laptop lost - student forgot bag at bus stop. Not there on return. Fortunately, USB backup disk kept in different location. Moral of story: carry USB backup disk separate from laptop. Ongoing scary ssh-linux incident: no intrusions detected here so far
9
9 Issues Upcoming/pending work : Ongoing: New servers replacing old – servers waiting VMs will replace existing web/svn/elog/wiki server, existing SL3 MON, & probably others Recent/ongoing problems : UPS needs rearranging – some important servers not on UPS Workload really increased since Yves & Jon left A/C failure May 2009 – A/C being replaced (before too hot we hope)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.