Download presentation
Presentation is loading. Please wait.
Published byMeredith Bradley Modified over 9 years ago
1
Southgrid Technical Meeting Pete Gronbech: 26 th August 2005 Oxford
2
Present Pete Gronbech – Oxford Ian Stokes-Rees - Oxford Chris Brew – RAL PPD Santanu Das - Cambridge Yves Coppens - Birmingham
3
Agenda Chat 10:30 Coffee Pete + Others 1pm Lunch Ineteractive Workshop!! 3:15pm Coffee Finish
4
Southgrid Member Institutions Oxford RAL PPD Cambridge Birmingham Bristol HP-Bristol Warwick
5
Stability, Throughput and Involvement The last Quarter has been a good stable period for SouthGrid Addition of Bristol PP 4 out of 5 already upgraded to 2_6_0 Large involvement in Biomed DC
8
Monitoring http://www.gridpp.ac.uk/ganglia/ http://map.gridpp.ac.uk/ http://lcg-testzone-reports.web.cern.ch/lcg-testzone- reports/cgi-bin/lastreport.cgihttp://lcg-testzone-reports.web.cern.ch/lcg-testzone- reports/cgi-bin/lastreport.cgi Configure view UKI http://www.physics.ox.ac.uk/users/gronbech/gridmon.htm Dave Kants helpful doc in the minutes of a tbsupport meeting links tominutes of a tbsupport http://goc.grid-support.ac.uk/gridsite/accounting/tree/gridpp_view.php
12
Ganglia mods for Oxford August 2005
13
Status at RAL PPD SL3 cluster on 2.6.0 CPUs: 11 2.4 GHz, 33 2.8GHz –100% Dedicated to LCG 0.7 TB Storage –100% Dedicated to LCG Configured 6.4TB of IDE RAID disks for use by dcache 5 systems to be used for preprodution testbed
14
RAL 2 Dcache installation Pre Production? Upgrade to 2_6_0 report –RGMA mon node in early yaim did not work for upgrade (only fresh installs) Problems with connector order (Tomcat openssl before insecure connector, if no cert then had to fix), Latest release of yaim and CERT is OK –yum name changes caused some problems perl api rpm needs deleting.
15
Status at Cambridge Currently LCG 2.6.0 on SL3 CPUs: 42 2.8GHz (Extra Nodes only 2/10 any good) –100% Dedicated to LCG 2 TB Storage (have 3 but only 2 available) –100% Dedicated to LCG Condor Batch System Lack of Condor support from LCG teams
16
Cambridge 2 Cam Grid – LCG interaction –All nodes would need LCG wn software in order to make them available everywhere –gridpp ce has central manager –nodes have 2 ips, one cambridge private and one lcg public ip. –condor user1 and condor user2 –atlas jobs not working due to software is installed but not verified. Due to above probs. Condor Issues Monitoring / Accounting –ganglia installed nearly ready, need to inform A McNab. Upgrade (260) report –same rpm probs –rgma fixes from Yves –Tomcat –overall quite easy cf previous releases
17
Status at Bristol Status –Yves and Pete Installed SL304 and LCG-2_4_0 and went live on July 5 th 2005. Yves upgraded to 2_6_0 in last week of July as part of pre-release testing. Existing resources –80-CPU BaBar farm moved to Birmingham –GridPP nodes plus local cluster nodes used to bring site on line. Local cluster needs to be integrated. New resources –Funding now confirmed for large University investment in hardware –Includes CPU, high quality and scratch disk resources Humans –New system manager post (RG) should be in place. –New SouthGrid support / development post (GridPP / HP) being filled –HP have moved ia64 bit machines on to Cancer research due to lack of use by LCG.
18
Status at Birmingham Currently SL3 with LCG- 2_6_0 CPUs: 24 2.0GHz Xenon (+48 local nodes which could in principle be used but…) –100% LCG 1.8TB Classic se –100% LCG. Babar Farm moving to SL3 and Bristol integrated but not yet on LCG
19
Birmingham 2 Babar Cluster expansion LCG-2_6_0 early testing July Involvement in Pre Production Grid Installation of DPM? –How to migrate data –or just close old se Integration of Local users vs grid users
20
Status at Oxford Currently LCG 2.4.0 on SL304 All 74 cpus’s running since ~June 20th CPUs: 80 2.8 GHz –100% LCG 1.5 TB Storage – second 1.5TB will be brought on line as DPM or dcache. –100% LCG. Some further Air Conditioning Problems now resolved for Room 650, Second rack in overheating basement. Heavy use by Biomed during their DC Plan to give local users access
21
Oxford 2 Need to upgrade to 2_6_0 next week. Early testing of 2_6_0 in July on tbce01 Integration with pp cluster to give local access to grid queues
22
Security Best practices link https://www.gridpp.ac.uk/deployment/security/index.html https://www.gridpp.ac.uk/deployment/security/index.html Wiki entry http://goc.grid.sinica.edu.tw/gocwiki/AdministrationFaq http://goc.grid.sinica.edu.tw/gocwiki/AdministrationFaq iptables?? – Birmingham to share their setup on the South Grid web pages Completed
23
Action Plan for Bristol Plan to visit on June 9 th to install an installation server –dhcp server –NFS copies of SL (local mirror) –PXE boot setup etc Second visit to reinstall head nodes with SL304 and LCG-2_4_0 and some worker nodes Babar cluster to go to Birmingham –Fergus, Chris, Yves to Liaise. – Completed
24
Action plan for SouthGRID Ensure all upto date for GridPP14 (Oxford) SRM installations SC4 Preparations LHC DC Awareness
25
Grid site wiki http://www.gridsite.org/wiki/main_page http://www.physics.gla.ac.uk/gridpp/data managementhttp://www.physics.gla.ac.uk/gridpp/data management
26
LCG Deployment Schedule
27
Overall Schedule (Raw-ish) Sep Oct Nov Dec ALICE ATLAS CMS LHCb Sep Oct Nov Dec ALICE ATLAS CMS LHCb
28
Service Challenge 4 – SC4 SC4 starts April 2006 SC4 ends with the deployment of the FULL PRODUCTION SERVICE Deadline for component (production) delivery: end January 2006 Adds further complexity over SC3 –Additional components and services –Analysis Use Cases –SRM 2.1 features required by LHC experiments –All Tier2s (and Tier1s…) at full service level –Anything that dropped off list for SC3… –Services oriented at analysis and end-user –What implications for the sites? Analysis farms: –Batch-like analysis at some sites (no major impact on sites) –Large-scale parallel interactive analysis farms and major sites –(100 PCs + 10TB storage) x N User community: –No longer small (<5) team of production users –20-30 work groups of 15-25 people –Large (100s – 1000s) numbers of users worldwide
29
SC4 Timeline September 2005: first SC4 workshop(?) – 3 rd week September proposed January 31 st 2006: basic components delivered and in place February / March: integration testing February: SC4 planning workshop at CHEP (w/e before) March 31 st 2006: integration testing successfully completed April 2006: throughput tests May 1 st 2006: Service Phase starts (note compressed schedule!) September 1 st 2006: Initial LHC Service in stable operation Summer 2007: first LHC event data
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.