OSG Production Report OSG Area Coordinator’s Meeting Aug 12, 2010 Dan Fraser.

Slides:



Advertisements
Similar presentations
Applications Area Issues RWL Jones GridPP13 – 5 th June 2005.
Advertisements

Dec 14, 20061/10 VO Services Project – Status Report Gabriele Garzoglio VO Services Project WBS Dec 14, 2006 OSG Executive Board Meeting Gabriele Garzoglio.
Jan 2010 Current OSG Efforts and Status, Grid Deployment Board, Jan 12 th 2010 OSG has weekly Operations and Production Meetings including US ATLAS and.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
MyOSG: A user-centric information resource for OSG infrastructure data sources Arvind Gopu, Soichi Hayashi, Rob Quick Open Science Grid Operations Center.
Key Project Drivers - FY11 Ruth Pordes, June 15th 2010.
Rsv-control Marco Mambelli – Site Coordination meeting October 1, 2009.
OSG Public Storage and iRODS
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
OSG Services at Tier2 Centers Rob Gardner University of Chicago WLCG Tier2 Workshop CERN June 12-14, 2006.
OSG Site Provide one or more of the following capabilities: – access to local computational resources using a batch queue – interactive access to local.
Integration and Sites Rob Gardner Area Coordinators Meeting 12/4/08.
OSG Middleware Roadmap Rob Gardner University of Chicago OSG / EGEE Operations Workshop CERN June 19-20, 2006.
Publication and Protection of Site Sensitive Information in Grids Shreyas Cholia NERSC Division, Lawrence Berkeley Lab Open Source Grid.
May 8, 20071/15 VO Services Project – Status Report Gabriele Garzoglio VO Services Project – Status Report Overview and Plans May 8, 2007 Computing Division,
G RID M IDDLEWARE AND S ECURITY Suchandra Thapa Computation Institute University of Chicago.
Concept: Well-managed provisioning of storage space on OSG sites owned by large communities, for usage by other science communities in OSG. Examples –Providers:
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Overview of Monitoring and Information Systems in OSG MWGS08 - September 18, Chicago Marco Mambelli - University of Chicago
OSG Area Coordinator’s Report: Workload Management April 20 th, 2011 Maxim Potekhin BNL
OSG Software and Operations Plans Rob Quick OSG Operations Coordinator Alain Roy OSG Software Coordinator.
Mar 28, 20071/9 VO Services Project Gabriele Garzoglio The VO Services Project Don Petravick for Gabriele Garzoglio Computing Division, Fermilab ISGC 2007.
Evolution of the Open Science Grid Authentication Model Kevin Hill Fermilab OSG Security Team.
Production Coordination Staff Retreat July 21, 2010 Dan Fraser – Production Coordinator.
Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.
Mar 28, 20071/18 The OSG Resource Selection Service (ReSS) Gabriele Garzoglio OSG Resource Selection Service (ReSS) Don Petravick for Gabriele Garzoglio.
OSG Area Coordinators Meeting Security Team Report Mine Altunay 8/15/2012.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.
OSG Tier 3 support Marco Mambelli - OSG Tier 3 Dan Fraser - OSG Tier 3 liaison Tanya Levshina - OSG.
July 25, 20071/21 OSG Information Services Gabriele Garzoglio, Rob Quick, Chris Green OSG Information Services, VO Monitoring Services and Resource Selection.
OSG PKI Transition: Transition Phase Report Von Welch OSG PKI Transition Lead Indiana University Center for Applied Cybersecurity Research.
Grid Operations Lessons Learned Rob Quick Open Science Grid Operations Center - Indiana University.
OSG Production Report OSG Area Coordinator’s Meeting Nov 17, 2010 Dan Fraser.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
Production Coordination Area VO Meeting Feb 11, 2009 Dan Fraser – Production Coordinator.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
The OSG and Grid Operations Center Rob Quick Open Science Grid Operations Center - Indiana University ATLAS Tier 2-Tier 3 Meeting Bloomington, Indiana.
ATLAS Dashboard Recent Developments Ricardo Rocha.
April 26, Executive Director Report Executive Board 4/26/07 Things under control Things out of control.
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Production Oct 31, 2012 Dan Fraser. Current Production Focus Transition to RPMs 52(44) sites using RPM based installs 52(44) sites using RPM based installs.
OSG Area Report Production – Operations – Campus Grids Jan 11, 2011 Dan Fraser.
OSG Site Admin Workshop - Mar 2008Using gLExec to improve security1 OSG Site Administrators Workshop Using gLExec to improve security of Grid jobs by Alain.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
OSG Storage VDT Support and Troubleshooting Concerns Tanya Levshina.
OSG Area Coordinator’s Report: Workload Management March 25 th, 2010 Maxim Potekhin BNL
OSG Area Coordinator’s Report: Workload Management October 6 th, 2010 Maxim Potekhin BNL
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
WLCG Information System Use Cases Review WLCG Operations Coordination Meeting 18 th June 2015 Maria Alandes IT/SDC.
OSG Area Report Production – Operations – Campus Grids June 19, 2012 Dan Fraser Rob Quick.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
OSG Status and Rob Gardner University of Chicago US ATLAS Tier2 Meeting Harvard University, August 17-18, 2006.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
OSG Area Coordinator’s Report: Workload Management June 3 rd, 2010 Maxim Potekhin BNL
OSG Area Coordinators Meeting Security Team Report Mine Altunay 8/15/2012.
Parag Mhashilkar Computing Division, Fermilab.  Status  Effort Spent  Operations & Support  Phase II: Reasons for Closing the Project  Phase II:
OSG Facility Miron Livny OSG Facility Coordinator and PI University of Wisconsin-Madison Open Science Grid Scientific Advisory Group Meeting June 12th.
Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
Why you should care about glexec OSG Site Administrator’s Meeting Written by Igor Sfiligoi Presented by Alain Roy Hint: It’s about security.
OSG Operations – Lessons Learned CHEP 2010, 18 October 15:10 (Asia/Taipei) – Room 2, BHSS OSG Operations – Lessons Learned CHEP 2010, 18 October 15:10.
OSG User Group August 14, Progress since last meeting OSG Users meeting at BNL (Jun 16-17) –Core Discussions on: Workload Management; Security.
Leigh Grundhoefer Indiana University
Presentation transcript:

OSG Production Report OSG Area Coordinator’s Meeting Aug 12, 2010 Dan Fraser

Some Production Examples… Effort from the entire team Gratia problem Root Cause Analysis Gratia problem Root Cause Analysis Still discovering failure modes New alarms to detect rate errors (several iterations) Updating SLA in process T1 sites using new Gratia Transfer collector SE-only solution for Atlas T3s SE-only solution for Atlas T3s CERN BDII stopped reporting (RG data limit exceeded) CERN BDII stopped reporting (RG data limit exceeded) CERN BDII not a high priority (pushed but no movement) OSG Operations tests detecting & notifying CERN of CERN BDII failures ITIL like processes for Operations ITIL like processes for Operations Updated process being designed for BDII management (recent failure in BDII management)

More Production Examples… Effort from the entire team Updated process for CA testing prior to production Updated process for CA testing prior to production CEMON issues (hanging the BDII) CEMON issues (hanging the BDII) Transitioned sites to use the new Gratia collector address Transitioned sites to use the new Gratia collector address Urgent security updates for sites running Condor/Gratia Urgent security updates for sites running Condor/Gratia

Updated View from Production VOs are getting very effective at using OSG Dzero hitting new production capabilities Dzero hitting new production capabilities LIGO often at rank #1 LIGO often at rank #1 SBGrid was hitting 7,000 simultaneous jobs SBGrid was hitting 7,000 simultaneous jobs New VOs being encouraged to use pilots New VOs being encouraged to use pilots OSG will support these in the UCSD Factory Opportunistic storage is the #1 problem A very difficult problem A very difficult problem Will take time to fix, some development ideas being worked on (draft requirements just completed) Will take time to fix, some development ideas being worked on (draft requirements just completed) Working with Atlas to make more opportunistic cycles available. (T1, Indiana T2, Illinois T3…) Atlas SWT2 looking Good Atlas SWT2 looking Good

The End Slides from some past presentations below

OSG Health Monitoring All links now on the production page me me Usage Charts Weekly Calls OSG Data movement Job/Error ratios DOE display showing last 24 hours and much more … and much more …

Solving Production Problems Solving problems is a TEAM sport The weekly production call has key people from all the teams that are needed to solve problems CMS, Atlas, LIGO, VOs, Engage, Integration, Sites, STG, Security, Operations, Metrics Problems accurately prioritized and channeled to the correct avenue Sometimes solved on the call. Forewarning to prepare for upcoming issues.

Example Problems Handling of job pre-emption (LIGO / D0) VO Package Validation probe needed GIP “truth in advertising” GIP “truth in advertising” LIGO switch to GT2 and also Condor-G job submission Condor scaling limits in GridMon (Atlas) Globus LSF gatekeeper bug (D0/CMS) Security Drill successes (for T1) Gratia probe introduction & ITB testing

Example Issues cont. STEP09 monitoring (partially successful) IceCube management of opportunistic storage Gratia file transfer data catch up Transition from VORS to myOSG New location for RSV probes and ability to update from the “production” cache Also, ensure that config_OSG does not update the probes automatically Also, ensure that config_OSG does not update the probes automatically Root Cause Analysis of CMS BDII outage

Example Issues cont. Plan to localize data transfer information and upload summary transfer packets. Globus memory leak was causing frequent reboots at BNL. Site name mapping problem to enable different names internal to OSG. OIM display difference (http vs https) Site admin meeting & materials prep to help sites upgrade to OSG 1.2.

Example Issues cont. Condor problem with directory creation in a multiple gateway scenario. (Nebraska) Gratia collector problem with handling records that accumulate faster than they can be processed. LIGO/Pegasus transition to use BDII data instead of central probe data. LIGO/Pegasus transition to use BDII data instead of central probe data.