Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator.

Slides:



Advertisements
Similar presentations
Applications Area Issues RWL Jones GridPP13 – 5 th June 2005.
Advertisements

Dec 14, 20061/10 VO Services Project – Status Report Gabriele Garzoglio VO Services Project WBS Dec 14, 2006 OSG Executive Board Meeting Gabriele Garzoglio.
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
Campus High Throughput Computing (HTC) Infrastructures (aka Campus Grids) Dan Fraser OSG Production Coordinator Campus Grids Lead.
Jan 2010 Current OSG Efforts and Status, Grid Deployment Board, Jan 12 th 2010 OSG has weekly Operations and Production Meetings including US ATLAS and.
OSG Area Coordinators Meeting Operations Rob Quick 2/22/2012.
MyOSG: A user-centric information resource for OSG infrastructure data sources Arvind Gopu, Soichi Hayashi, Rob Quick Open Science Grid Operations Center.
Open Science Grid Software Stack, Virtual Data Toolkit and Interoperability Activities D. Olson, LBNL for the OSG International.
Key Project Drivers - FY11 Ruth Pordes, June 15th 2010.
Rsv-control Marco Mambelli – Site Coordination meeting October 1, 2009.
OSG Public Storage and iRODS
OSG Area Coordinators Campus Infrastructures Update Dan Fraser Miha Ahronovitz, Jaime Frey, Rob Gardner, Brooklin Gore, Marco Mambelli, Todd Tannenbaum,
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
OSG Services at Tier2 Centers Rob Gardner University of Chicago WLCG Tier2 Workshop CERN June 12-14, 2006.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
Integration and Sites Rob Gardner Area Coordinators Meeting 12/4/08.
OSG Middleware Roadmap Rob Gardner University of Chicago OSG / EGEE Operations Workshop CERN June 19-20, 2006.
Publication and Protection of Site Sensitive Information in Grids Shreyas Cholia NERSC Division, Lawrence Berkeley Lab Open Source Grid.
May 8, 20071/15 VO Services Project – Status Report Gabriele Garzoglio VO Services Project – Status Report Overview and Plans May 8, 2007 Computing Division,
G RID M IDDLEWARE AND S ECURITY Suchandra Thapa Computation Institute University of Chicago.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Overview of Monitoring and Information Systems in OSG MWGS08 - September 18, Chicago Marco Mambelli - University of Chicago
OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL
OSG Software and Operations Plans Rob Quick OSG Operations Coordinator Alain Roy OSG Software Coordinator.
Mar 28, 20071/9 VO Services Project Gabriele Garzoglio The VO Services Project Don Petravick for Gabriele Garzoglio Computing Division, Fermilab ISGC 2007.
Production Coordination Staff Retreat July 21, 2010 Dan Fraser – Production Coordinator.
May 11, 20091/17 VO Services Project – Stakeholders’ Meeting Gabriele Garzoglio VO Services Project Stakeholders’ Meeting May 11, 2009 Gabriele Garzoglio.
OSG Area Coordinators Meeting Security Team Report Mine Altunay 8/15/2012.
OSG Production Report OSG Area Coordinator’s Meeting Aug 12, 2010 Dan Fraser.
Turning Software Projects into Production Solutions Dan Fraser, PhD Production Coordinator Open Science Grid OU Supercomputing Symposium October 2009.
Job and Data Accounting on the Open Science Grid Ruth Pordes, Fermilab with thanks to Brian Bockelman, Philippe Canal, Chris Green, Rob Quick.
Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.
OSG Tier 3 support Marco Mambelli - OSG Tier 3 Dan Fraser - OSG Tier 3 liaison Tanya Levshina - OSG.
OSG PKI Transition: Transition Phase Report Von Welch OSG PKI Transition Lead Indiana University Center for Applied Cybersecurity Research.
Grid Operations Lessons Learned Rob Quick Open Science Grid Operations Center - Indiana University.
OSG Production Report OSG Area Coordinator’s Meeting Nov 17, 2010 Dan Fraser.
Grid Deployment Enabling Grids for E-sciencE BDII 2171 LDAP 2172 LDAP 2173 LDAP 2170 Port Fwd Update DB & Modify DB 2170 Port.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
Andrew McNab - Manchester HEP - 17 September 2002 UK Testbed Deployment Aim of this talk is to the answer the questions: –“How much of the Testbed has.
The OSG and Grid Operations Center Rob Quick Open Science Grid Operations Center - Indiana University ATLAS Tier 2-Tier 3 Meeting Bloomington, Indiana.
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
RSV: OSG Grid Fabric Monitoring and Interoperation with WLCG Monitoring Systems Rob Quick, Arvind Gopu, and Soichi Hayashi Computing in High Energy and.
Production Oct 31, 2012 Dan Fraser. Current Production Focus Transition to RPMs 52(44) sites using RPM based installs 52(44) sites using RPM based installs.
OSG Area Report Production – Operations – Campus Grids Jan 11, 2011 Dan Fraser.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
OSG Site Admin Workshop - Mar 2008Using gLExec to improve security1 OSG Site Administrators Workshop Using gLExec to improve security of Grid jobs by Alain.
ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.
OSG Report for DOE/NSF Joint Oversight Group U.S. Large Hadron Collider Program OSG Report for DOE/NSF Joint Oversight Group U.S. Large Hadron Collider.
OSG Area Coordinator’s Report: Workload Management March 25 th, 2010 Maxim Potekhin BNL
OSG Area Coordinator’s Report: Workload Management October 6 th, 2010 Maxim Potekhin BNL
OSG Area Coordinator’s Report: Workload Management August 20 th, 2009 Maxim Potekhin BNL
WLCG Information System Use Cases Review WLCG Operations Coordination Meeting 18 th June 2015 Maria Alandes IT/SDC.
Area Coordinator Report for Operations Rob Quick 4/10/2008.
OSG Area Report Production – Operations – Campus Grids June 19, 2012 Dan Fraser Rob Quick.
Operations Area Coordinator Report. 31 Jan Overview Operations Current Initiatives  RSV Version 2  New Probes, Easier Configuration, Improved.
OSG PKI Transition Impact on CMS. Impact on End User After March , DOEGrids CA will stop issuing or renewing certificates. If a user is entitled.
OSG Status and Rob Gardner University of Chicago US ATLAS Tier2 Meeting Harvard University, August 17-18, 2006.
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
OSG Area Coordinator’s Report: Workload Management June 3 rd, 2010 Maxim Potekhin BNL
OSG Area Coordinators Meeting Security Team Report Mine Altunay 8/15/2012.
Parag Mhashilkar Computing Division, Fermilab.  Status  Effort Spent  Operations & Support  Phase II: Reasons for Closing the Project  Phase II:
RSV: OSG Grid Monitoring and User Customizable Views Rob Quick, Arvind Gopu, and Soichi Hayashi High Performance Distributed Computing Location: Munich,
OSG Facility Miron Livny OSG Facility Coordinator and PI University of Wisconsin-Madison Open Science Grid Scientific Advisory Group Meeting June 12th.
Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.
Why you should care about glexec OSG Site Administrator’s Meeting Written by Igor Sfiligoi Presented by Alain Roy Hint: It’s about security.
Grid Colombia Workshop with OSG Week 2 Startup Rob Gardner University of Chicago October 26, 2009.
OSG User Group August 14, Progress since last meeting OSG Users meeting at BNL (Jun 16-17) –Core Discussions on: Workload Management; Security.
POW MND section.
Presentation transcript:

Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator

OSG Health Monitoring Overview Weekly Calls Data movement Job/Error ratios nGraphs

A New Visual for DOE DOE Display Created by Brian Bockelman & Soichi Hayashi

Some Production Examples… Effort from the entire team CERN BDII stale entries problem fixed CERN BDII stale entries problem fixed GGUS-GOC Ticket automation GGUS-GOC Ticket automation LIGO Production at an all time high (Rob E.) LIGO Production at an all time high (Rob E.) Currently running at Rank #1 on OSG Looking into added stress to some NFS systems SBGRID now at ~3000 parallel jobs SBGRID now at ~3000 parallel jobs Zero length CRL problem resolved Zero length CRL problem resolved …

A View from the Production Coordinator What are the biggest problems in OSG? Production is currently stable Production is currently stable Need to keep it that way! (limit changes) Supporting VO’s I difficult Supporting VO’s I difficult How to get opportunistic storage Current method is to talk to each site… Current method is to talk to each site… New strategies being explored (Tanya, Brian, Dan) New strategies being explored (Tanya, Brian, Dan) Lowering the barrier for scaling across sites Site differences often require site-by-site investigation Site differences often require site-by-site investigation OSG operated Pilot capability could be a big win!! OSG operated Pilot capability could be a big win!!

Job count (2 weeks)

Wasted%20Hours%20By%20User%20and%20Site

asted%20Hours%20By%20User%20and%20Site

Solving Production Problems Solving problems is a TEAM sport The weekly production call has key people from all the teams that are needed to solve problems CMS, Atlas, LIGO, VOs, Engage, Integration, Sites, STG, Security, Operations, Metrics Problems accurately prioritized and channeled to the correct avenue Sometimes solved on the call. Forewarning to prepare for upcoming issues.

Example Problems Handling of job pre-emption (LIGO / D0) VO Package Validation probe needed GIP “truth in advertising” GIP “truth in advertising” LIGO switch to GT2 and also Condor-G job submission Condor scaling limits in GridMon (Atlas) Globus LSF gatekeeper bug (D0/CMS) Security Drill successes (for T1) Gratia probe introduction & ITB testing

Example Issues cont. STEP09 monitoring (partially successful) IceCube management of opportunistic storage Gratia file transfer data catch up Transition from VORS to myOSG New location for RSV probes and ability to update from the “production” cache Also, ensure that config_OSG does not update the probes automatically Also, ensure that config_OSG does not update the probes automatically Root Cause Analysis of CMS BDII outage

Example Issues cont. Plan to localize data transfer information and upload summary transfer packets. Globus memory leak was causing frequent reboots at BNL. Site name mapping problem to enable different names internal to OSG. OIM display difference (http vs https) Site admin meeting & materials prep to help sites upgrade to OSG 1.2.

Example Issues cont. Condor problem with directory creation in a multiple gateway scenario. (Nebraska) Gratia collector problem with handling records that accumulate faster than they can be processed. LIGO/Pegasus transition to use BDII data instead of central probe data. LIGO/Pegasus transition to use BDII data instead of central probe data.