Production Coordination Staff Retreat July 21, 2010 Dan Fraser – Production Coordinator.

Slides:



Advertisements
Similar presentations
Dec 14, 20061/10 VO Services Project – Status Report Gabriele Garzoglio VO Services Project WBS Dec 14, 2006 OSG Executive Board Meeting Gabriele Garzoglio.
Advertisements

OSG Technology Plans OSG Staff Meeting July 2012 Brian Bockelman.
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
Campus High Throughput Computing (HTC) Infrastructures (aka Campus Grids) Dan Fraser OSG Production Coordinator Campus Grids Lead.
Jan 2010 Current OSG Efforts and Status, Grid Deployment Board, Jan 12 th 2010 OSG has weekly Operations and Production Meetings including US ATLAS and.
OSG Area Coordinators Meeting Operations Rob Quick 2/22/2012.
MyOSG: A user-centric information resource for OSG infrastructure data sources Arvind Gopu, Soichi Hayashi, Rob Quick Open Science Grid Operations Center.
Open Science Grid Software Stack, Virtual Data Toolkit and Interoperability Activities D. Olson, LBNL for the OSG International.
Key Project Drivers - FY11 Ruth Pordes, June 15th 2010.
Rsv-control Marco Mambelli – Site Coordination meeting October 1, 2009.
OSG Public Storage and iRODS
OSG Operations Rob Quick July 10th, 2012 OSG Staff Retreat.
OSG Security Review Mine Altunay June 19, June 19, Security Overview Current Initiatives  Incident response procedure – top priority (WBS.
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
OSG Services at Tier2 Centers Rob Gardner University of Chicago WLCG Tier2 Workshop CERN June 12-14, 2006.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
Integration and Sites Rob Gardner Area Coordinators Meeting 12/4/08.
OSG Middleware Roadmap Rob Gardner University of Chicago OSG / EGEE Operations Workshop CERN June 19-20, 2006.
Publication and Protection of Site Sensitive Information in Grids Shreyas Cholia NERSC Division, Lawrence Berkeley Lab Open Source Grid.
May 8, 20071/15 VO Services Project – Status Report Gabriele Garzoglio VO Services Project – Status Report Overview and Plans May 8, 2007 Computing Division,
G RID M IDDLEWARE AND S ECURITY Suchandra Thapa Computation Institute University of Chicago.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Overview of Monitoring and Information Systems in OSG MWGS08 - September 18, Chicago Marco Mambelli - University of Chicago
OSG Area Coordinators Meeting Proposal Chander Sehgal Fermilab
OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL
OSG Software and Operations Plans Rob Quick OSG Operations Coordinator Alain Roy OSG Software Coordinator.
OSG Project Manager Report for OSG Council Meeting OSG Project Manager Report for OSG Council Meeting October 14, 2008 Chander Sehgal.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
Mar 28, 20071/18 The OSG Resource Selection Service (ReSS) Gabriele Garzoglio OSG Resource Selection Service (ReSS) Don Petravick for Gabriele Garzoglio.
OSG Area Coordinators Meeting Security Team Report Mine Altunay 8/15/2012.
OSG Production Report OSG Area Coordinator’s Meeting Aug 12, 2010 Dan Fraser.
Job and Data Accounting on the Open Science Grid Ruth Pordes, Fermilab with thanks to Brian Bockelman, Philippe Canal, Chris Green, Rob Quick.
Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.
OSG Tier 3 support Marco Mambelli - OSG Tier 3 Dan Fraser - OSG Tier 3 liaison Tanya Levshina - OSG.
July 25, 20071/21 OSG Information Services Gabriele Garzoglio, Rob Quick, Chris Green OSG Information Services, VO Monitoring Services and Resource Selection.
Grid Operations Lessons Learned Rob Quick Open Science Grid Operations Center - Indiana University.
OSG Production Report OSG Area Coordinator’s Meeting Nov 17, 2010 Dan Fraser.
Grid Deployment Enabling Grids for E-sciencE BDII 2171 LDAP 2172 LDAP 2173 LDAP 2170 Port Fwd Update DB & Modify DB 2170 Port.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
The OSG and Grid Operations Center Rob Quick Open Science Grid Operations Center - Indiana University ATLAS Tier 2-Tier 3 Meeting Bloomington, Indiana.
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
RSV: OSG Grid Fabric Monitoring and Interoperation with WLCG Monitoring Systems Rob Quick, Arvind Gopu, and Soichi Hayashi Computing in High Energy and.
Production Oct 31, 2012 Dan Fraser. Current Production Focus Transition to RPMs 52(44) sites using RPM based installs 52(44) sites using RPM based installs.
OSG Area Report Production – Operations – Campus Grids Jan 11, 2011 Dan Fraser.
Auditing Project Architecture VERY HIGH LEVEL Tanya Levshina.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
OSG Report for DOE/NSF Joint Oversight Group U.S. Large Hadron Collider Program OSG Report for DOE/NSF Joint Oversight Group U.S. Large Hadron Collider.
OSG Storage VDT Support and Troubleshooting Concerns Tanya Levshina.
OSG Area Coordinator’s Report: Workload Management March 25 th, 2010 Maxim Potekhin BNL
WLCG Information System Use Cases Review WLCG Operations Coordination Meeting 18 th June 2015 Maria Alandes IT/SDC.
Area Coordinator Report for Operations Rob Quick 4/10/2008.
OSG Area Report Production – Operations – Campus Grids June 19, 2012 Dan Fraser Rob Quick.
Operations Area Coordinator Report. 31 Jan Overview Operations Current Initiatives  RSV Version 2  New Probes, Easier Configuration, Improved.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
OSG Status and Rob Gardner University of Chicago US ATLAS Tier2 Meeting Harvard University, August 17-18, 2006.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
OSG Area Coordinator’s Report: Workload Management June 3 rd, 2010 Maxim Potekhin BNL
OSG Area Coordinators Meeting Security Team Report Mine Altunay 8/15/2012.
Parag Mhashilkar Computing Division, Fermilab.  Status  Effort Spent  Operations & Support  Phase II: Reasons for Closing the Project  Phase II:
RSV: OSG Grid Monitoring and User Customizable Views Rob Quick, Arvind Gopu, and Soichi Hayashi High Performance Distributed Computing Location: Munich,
OSG Facility Miron Livny OSG Facility Coordinator and PI University of Wisconsin-Madison Open Science Grid Scientific Advisory Group Meeting June 12th.
Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.
Grid Colombia Workshop with OSG Week 2 Startup Rob Gardner University of Chicago October 26, 2009.
OSG User Group August 14, Progress since last meeting OSG Users meeting at BNL (Jun 16-17) –Core Discussions on: Workload Management; Security.
Presentation transcript:

Production Coordination Staff Retreat July 21, 2010 Dan Fraser – Production Coordinator

Production WBS Identify & resolve OSG issues Prod calls, look at usage patterns, metrics … Prod calls, look at usage patterns, metrics … T3 Liaison Docs, Security, Xrootd, WLCG client, Coord w/ Hiro, Asoka… Manage relationships & raise issues with the ET when necessary T1/T2/T3 Admins – identify requirements T1/T2/T3 Admins – identify requirements Assist the ET team w/Operations, Sites, VOs, & Education/training Lead & Coordinate plans to improve the OSG facility Glide-in effort (paper, VO transitions, testing format) Glide-in effort (paper, VO transitions, testing format)

Some Production Examples… Effort from the entire team SE-only solution for Atlas T3s SE-only solution for Atlas T3s ITIL like processes for Operations ITIL like processes for Operations Updated process for CA testing prior to production Updated process for CA testing prior to production CEMON issues (hanging the BDII) CEMON issues (hanging the BDII) CERN BDII not reporting (RG data limit exceeded) CERN BDII not reporting (RG data limit exceeded) CERN BDII not a high priority (pushed but no mvmnt) CERN BDII not a high priority (pushed but no mvmnt) Transitioning sites to use the new Gratia collector address (in progress) Transitioning sites to use the new Gratia collector address (in progress) Urgent security updates for sites running Condor/Gratia Urgent security updates for sites running Condor/Gratia Addressing VO Issues: Addressing VO Issues: LIGO Production running well (Rob E.) Between Rank #1 & #2 on OSG Between Rank #1 & #2 on OSG SBGRID reaching new peaks (~7000 parallel jobs)

A View from the Production Coordinator What are the biggest problems in OSG? Supporting VO’s Is difficult Supporting VO’s Is difficult Lowering the barrier for scaling across sites Site differences often require site-by-site investigation Site differences often require site-by-site investigation Effort in progress to understand this (Dan, Abhishek) Big win possible with Glide-ins Big win possible with Glide-ins New paper comparing job submission strategies How to get opportunistic storage Current method is to talk to each site… Current method is to talk to each site… New strategies being explored (Tanya, Brian, Dan) New strategies being explored (Tanya, Brian, Dan)

OSG Health Monitoring All links now on the production page me me Usage Charts Weekly Calls OSG Data movement Job/Error ratios DOE display showing last 24 hours and much more … and much more …

Solving Production Problems Solving problems is a TEAM sport The weekly production call has key people from all the teams that are needed to solve problems CMS, Atlas, LIGO, VOs, Engage, Integration, Sites, STG, Security, Operations, Metrics Problems accurately prioritized and channeled to the correct avenue Sometimes solved on the call. Forewarning to prepare for upcoming issues.

Example Problems Handling of job pre-emption (LIGO / D0) VO Package Validation probe needed GIP “truth in advertising” GIP “truth in advertising” LIGO switch to GT2 and also Condor-G job submission Condor scaling limits in GridMon (Atlas) Globus LSF gatekeeper bug (D0/CMS) Security Drill successes (for T1) Gratia probe introduction & ITB testing

Example Issues cont. STEP09 monitoring (partially successful) IceCube management of opportunistic storage Gratia file transfer data catch up Transition from VORS to myOSG New location for RSV probes and ability to update from the “production” cache Also, ensure that config_OSG does not update the probes automatically Also, ensure that config_OSG does not update the probes automatically Root Cause Analysis of CMS BDII outage

Example Issues cont. Plan to localize data transfer information and upload summary transfer packets. Globus memory leak was causing frequent reboots at BNL. Site name mapping problem to enable different names internal to OSG. OIM display difference (http vs https) Site admin meeting & materials prep to help sites upgrade to OSG 1.2.

Example Issues cont. Condor problem with directory creation in a multiple gateway scenario. (Nebraska) Gratia collector problem with handling records that accumulate faster than they can be processed. LIGO/Pegasus transition to use BDII data instead of central probe data. LIGO/Pegasus transition to use BDII data instead of central probe data.