Installing and Running a CMS T3 Using OSG Software - UCR

Slides:



Advertisements
Similar presentations
Report of Liverpool HEP Computing during 2007 Executive Summary. Substantial and significant improvements in the local computing facilities during the.
Advertisements

Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
Implementing Finer Grained Authorization in the Open Science Grid Gabriele Carcassi, Ian Fisk, Gabriele, Garzoglio, Markus Lorch, Timur Perelmutov, Abhishek.
Duke Atlas Tier 3 Site Doug Benjamin (Duke University)
Lesson 15 – INSTALL AND SET UP NETWARE 5.1. Understanding NetWare 5.1 Preparing for installation Installing NetWare 5.1 Configuring NetWare 5.1 client.
Moving to Win 7 Considerations Dean Steichen A2CAT 2010.
Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.
Chapter-4 Windows 2000 Professional Win2K Professional provides a very usable interface and was designed for use in the desktop PC. Microsoft server system.
Introduction to HP LoadRunner Getting Familiar with LoadRunner >>>>>>>>>>>>>>>>>>>>>>
Computing/Tier 3 Status at Panjab S. Gautam, V. Bhatnagar India-CMS Meeting, Sept 27-28, 2007 Delhi University, Delhi Centre of Advanced Study in Physics,
OSG End User Tools Overview OSG Grid school – March 19, 2009 Marco Mambelli - University of Chicago A brief summary about the system.
Chapter 18: Windows Server 2008 R2 and Active Directory Backup and Maintenance BAI617.
SRM at Clemson Michael Fenn. What is a Storage Element? Provides grid-accessible storage space. Is accessible to applications running on OSG through either.
Chapter 8 Implementing Disaster Recovery and High Availability Hands-On Virtual Computing.
Alexandre A. P. Suaide VI DOSAR workshop, São Paulo, 2005 STAR grid activities and São Paulo experience.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
OSG Services at Tier2 Centers Rob Gardner University of Chicago WLCG Tier2 Workshop CERN June 12-14, 2006.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
G RID M IDDLEWARE AND S ECURITY Suchandra Thapa Computation Institute University of Chicago.
UCSD CMS 2009 T2 Site Report Frank Wuerthwein James Letts Sanjay Padhi Abhishek Rana Haifen Pi Presented by Terrence Martin.
São Paulo Regional Analysis Center SPRACE Status Report 22/Aug/2006 SPRACE Status Report 22/Aug/2006.
Open Science Grid OSG CE Quick Install Guide Siddhartha E.S University of Florida.
UMD TIER-3 EXPERIENCES Malina Kirn October 23, 2008 UMD T3 experiences 1.
Architecture and ATLAS Western Tier 2 Wei Yang ATLAS Western Tier 2 User Forum meeting SLAC April
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Overview of Privilege Project at Fermilab (compilation of multiple talks and documents written by various authors) Tanya Levshina.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
HEP Computing Status Sheffield University Matt Robinson Paul Hodgson Andrew Beresford.
ITGS Network Architecture. ITGS Network architecture –The way computers are logically organized on a network, and the role each takes. Client/server network.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
December 26, 2015 RHIC/USATLAS Grid Computing Facility Overview Dantong Yu Brookhaven National Lab.
Tier 3 Status at Panjab V. Bhatnagar, S. Gautam India-CMS Meeting, July 20-21, 2007 BARC, Mumbai Centre of Advanced Study in Physics, Panjab University,
VO Box Issues Summary of concerns expressed following publication of Jeff’s slides Ian Bird GDB, Bologna, 12 Oct 2005 (not necessarily the opinion of)
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
ALCF Argonne Leadership Computing Facility GridFTP Roadmap Bill Allcock (on behalf of the GridFTP team) Argonne National Laboratory.
RHIC/US ATLAS Tier 1 Computing Facility Site Report Christopher Hollowell Physics Department Brookhaven National Laboratory HEPiX Upton,
Florida Tier2 Site Report USCMS Tier2 Workshop Livingston, LA March 3, 2009 Presented by Yu Fu for the University of Florida Tier2 Team (Paul Avery, Bourilkov.
RAL PPD Tier 2 (and stuff) Site Report Rob Harper HEP SysMan 30 th June
BaBar Cluster Had been unstable mainly because of failing disks Very few (
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
Ole’ Miss DOSAR Grid Michael D. Joy Institutional Analysis Center.
T3g software services Outline of the T3g Components R. Yoshida (ANL)
Feedback from CMS Andrew Lahiff STFC Rutherford Appleton Laboratory Contributions from Christoph Wissing, Bockjoo Kim, Alessandro Degano CernVM Users Workshop.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Automating Installations by Using the Microsoft Windows 2000 Setup Manager Create setup scripts simply and easily. Create and modify answer files and UDFs.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
StoRM + Lustre Proposal YAN Tian On behalf of Distributed Computing Group
Open Science Grid Consortium Storage on Open Science Grid Placing, Using and Retrieving Data on OSG Resources Abhishek Singh Rana OSG Users Meeting July.
BeStMan/DFS support in VDT OSG Site Administrators workshop Indianapolis August Tanya Levshina Fermilab.
The CMS Beijing Tier 2: Status and Application Xiaomei Zhang CMS IHEP Group Meeting December 28, 2007.
Compute and Storage For the Farm at Jlab
Chapter 7: Using Windows Servers
Open-E Data Storage Software (DSS V6)
Jean-Philippe Baud, IT-GD, CERN November 2007
DIT314 ~ Client Operating System & Administration
The EDG Testbed Deployment Details
Create setup scripts simply and easily.
The Beijing Tier 2: status and plans
Belle II Physics Analysis Center at TIFR
Bulk production of Monte Carlo
NGS Oracle Service.
Artem Trunov and EKP team EPK – Uni Karlsruhe
Universita’ di Torino and INFN – Torino
Composition and Operation of a Tier-3 cluster on the Open Science Grid
Design Unit 26 Design a small or home office network
Michael P. McCumber Task Force Meeting April 3, 2006
The CMS Beijing Site: Status and Application
The LHCb Computing Data Challenge DC06
Presentation transcript:

Installing and Running a CMS T3 Using OSG Software - UCR Bill Strossman

Overview of Cluster 1 Head Node 4 Storage Nodes (top, bottom, strange, charm)‏ 4 Apple XServe RAID boxes 2 RAID 0 arrays of 7 500 GB drives each Arrays form a 7 TB logical volume (RAID 00?)‏ 10 Compute Nodes Warewulf 2.4 clustering software – now 2.6 SL 3.0.5 installed initially – now SL 4.5 (x86_64)‏ 32-bit compatibility libraries, compilers installed Compute nodes get their system from an image file via TFTP

Head Node (2) Dual Core AMD Opteron 275 CPUs 4 GB RAM (2) 250 GB drives – mirrored (3) 1 Gb ethernet ports

Storage and Compute Nodes Storage Nodes 2 AMD Opteron 250 4 GB RAM 1 250 GB disk 2 Apple FC ports 2 1 Gb ethernet 7 TB of externally attached FC disk storage (Apple XServe RAID)‏ Compute Nodes 2 AMD Opteron 275 4 GB RAM 2 250 GB disks 2 1 Gb ethernet ports

UCR-HEP

Close-up

Challenges Hardware failures Installation of OSG Software Compute Element GUMS Squid Condor OSG Client Installation of CMS Software CMSSW PhEDEx Operation and Issues

Hardware Failures Many, many bad RAM sticks Drive failures Fans At least one-third of the RAM has been replaced Vendor acknowledges getting a bad batch Drive failures 2 in Apple XServe RAID boxes 1 in storage node Fans 5 power supply fans 2 CPU fans Miscellaneous Fibre channel controller in APPLE Xserve RAID Heat-related incidents in APPLE Xserve RAID box

Installation of OSG Software Compute Element Started at 0.40, now at 0.80 Had a lot of help with first install Upgrades have been fairly smooth; had to make changes to site-local-conf.xml and storage.xml among others (more on this later). GUMS Was a nightmare to get going at first All notices about adding a new VO, etc. only list instructions for sites using a gridmap file. Dies on rare occasions leading to inability to run grid jobs and “red” SAM results.

Installation of OSG Software (continued)‏ Squid Very easy to get up and running Documentation is good Condor Had a lot of trouble getting it to work with both internal and external networks at first Upgrades are fairly easy as configuration files can be copied over for the most part Does not handle group quotas in a desirable way OSG Client Easy to install and configure Need to configure Condor for this as well

Installation of CMS Software PhEDEx A nightmare to get going initially 32/64-bit incompatibilities (we were x86_64)‏ SL3 -vs- SL4 (we were running SL4)‏ Documentation was sketchy; much better now Had a lot of help to finally get it going Upgrading from 2.5 – 3.0 was pretty painless Needed to modify storage.xml, ConfigPart.Download, Config.Prod, etc. for srmv2 and new site name. Had to get new database roles and update DBParam Had to request a link to UCSD to retrieve data that was only stored there.

Installation of CMS Software (continued)‏ CMSSW Easy to install once you match up the architecture Not so easy to get working properly After installing a few versions by hand, I jumped at the opportunity to have Bockjoo Kim install and maintain the many versions using a grid job We do not have a Storage Element (yet), so it is necessary to maintain site-local-conf.xml and storage.xml versions that are separate from PhEDEx for stage-out to UCSD.

Software Layout (storage nodes were a boon!)‏ top CE Condor submit host for OSG users bottom GUMS PhEDEx charm Squid strange OSG Client Condor submit host for local users

Operations and Issues PhEDEx Voms proxy -vs- grid proxy Grid proxies can be set not to expire for several months Some sites accept voms proxy only, which has a maximum lifetime of 8 days. Need good, reliable way to automate renewal of the voms proxy (modify and use provided myProxy server configuration file ?)‏ Need to make sure that enough space exists for the desired dataset. It has been necessary to move things around and create soft links to maintain the directory structure

Operations and Issues (continued)‏ OSG Software GUMS must be restarted occasionally Renew host and service certificates annually Update CRLs (now automatic if enabled)‏ Getting our site to be fully “green” Needed to obtain the file oneEvt.root in order to pass the SAM “mc” test. Needed to obtain the 24 QCD root files in order to pass the SAM “analysis” test Making sure that it stays green It can be difficult to enforce group quotas in Condor, so I modified a script to assign a group based on the mapped userid (UCSD)‏

Operations and Issues (continued)‏ Other Non-OSG upgrades SL 3.0.5 – SL 4.5 Warewulf 2.4 - Warewulf 2.6 Regularly check that: Disk volumes are not full All nodes are up and all NFS mounts are intact All fans are operational System logs do not contain critical error messages

Upgrade Notes OSG mv osgce osgce-0.6; mkdir osgce update pacman, if necessary follow instructions on Compute Element Install twiki rename new condor_config and copy old one over Certs, gsi-authz.conf, and prima-authz.conf are physically located in /etc/grid-security so no need to worry about them vdt-control --on

Upgrade Notes (continued)‏ PhEDEx (from 2.x - 3.x)‏ mv /home/phedex /home/phedex-old mkdir /home/phedex chown -R phedex.phedex /home/phedex follow instruction at the Site deployment link under Documentation (PhEDEX home page)‏ copy over and modify configuration files mentioned earlier

Future Plans dCache Storage Element Move from Warewulf to Perceus Head node and two storage nodes Two Xserve RAID boxes Move from Warewulf to Perceus SL 5.x (?)‏ OSG-1.0 myProxy server to automate renewal of phEDEx voms proxy (?)

Acknowledgements Terrence Martin (UCSD) Frank Wuerthwein (UCSD)‏ Brian Bockelman (UNL)‏ Bockjoo Kim (UF)‏ Burt Holzman (FNAL)‏ Robert Clare (UCR)‏ OSG Operations Staff PhEDEx Hypernews contributors UCR Network Operations Bob Grant (UCR Comp. & Comm.)‏