Supporting Grid Environments

Slides:



Advertisements
Similar presentations
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
Advertisements

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Collaborative Campus Grid - Practices and experiences in Leiden University Campus Grid (LUCGrid) Hui Li Feb 4, 2005.
Dave Jent, PI Luke Fowler, Co-PI Ron Johnson, Co-PI
Computing and Data Infrastructure for Large-Scale Science Deploying Production Grids: NASA’s IPG and DOE’s Science Grid William E. Johnston
December 1, 2004Rob Quick - iVDGL Grid Operations Center1 Grid Operations Rob Quick Grid Technologist Indiana University Open Science Grid.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Grid Information Systems. Two grid information problems Two problems  Monitoring  Discovery We can use similar techniques for both.
TeraGrid Information Services December 1, 2006 JP Navarro GIG Software Integration.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
SOS EGEE ‘06 GGF Security Auditing Service: Draft Architecture Brian Tierney Dan Gunter Lawrence Berkeley National Laboratory Marty Humphrey University.
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
OSG Middleware Roadmap Rob Gardner University of Chicago OSG / EGEE Operations Workshop CERN June 19-20, 2006.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
LCG and HEPiX Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002.
INDIANAUNIVERSITYINDIANAUNIVERSITY Grid Monitoring from a GOC perspective John Hicks HPCC Engineer Indiana University October 27, 2002 Internet2 Fall Members.
10/24/2015OSG at CANS1 Open Science Grid Ruth Pordes Fermilab
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
INFSO-RI Enabling Grids for E-sciencE OSG-LCG Interoperability Activity Author: Laurence Field (CERN)
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
OSG Integration Activity Report Rob Gardner Leigh Grundhoefer OSG Technical Meeting UCSD Dec 16, 2004.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
The OSG and Grid Operations Center Rob Quick Open Science Grid Operations Center - Indiana University ATLAS Tier 2-Tier 3 Meeting Bloomington, Indiana.
Globus and PlanetLab Resource Management Solutions Compared M. Ripeanu, M. Bowman, J. Chase, I. Foster, M. Milenkovic Presented by Dionysis Logothetis.
Networking: Applications and Services Antonia Ghiselli, INFN Stu Loken, LBNL Chairs.
Operations Activity Doug Olson, LBNL Co-chair OSG Operations OSG Council Meeting 3 May 2005, Madison, WI.
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
Julia Andreeva on behalf of the MND section MND review.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
Opensciencegrid.org Operations Interfaces and Interactions Rob Quick, Indiana University July 21, 2005.
1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
Grid Deployment Technical Working Groups: Middleware selection AAA,security Resource scheduling Operations User Support GDB Grid Deployment Resource planning,
Javier Orellana EGEE-JRA4 Coordinator CERN March 2004 EGEE is proposed as a project funded by the European Union under contract IST Network.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
1 Grid2003 Monitoring, Metrics, and Grid Cataloging System Leigh GRUNDHOEFER, Robert QUICK, John HICKS (Indiana University) Robert GARDNER, Marco MAMBELLI,
Open Science Grid Interoperability
Bob Jones EGEE Technical Director
Accessing the VI-SEEM infrastructure
Regional Operations Centres Core infrastructure Centres
Operations Interfaces and Interactions
Open Science Grid Progress and Status
JRA3 Introduction Åke Edlund EGEE Security Head
Ian Bird GDB Meeting CERN 9 September 2003
POW MND section.
StratusLab Final Periodic Review
StratusLab Final Periodic Review
Brief overview on GridICE and Ticketing System
LCG/EGEE Incident Response Planning
EGEE VO Management.
Evolution of SAM in an enhanced model for monitoring the WLCG grid
Robert Szuman – Poznań Supercomputing and Networking Center, Poland
Testbed Software Test Plan Status
Nordic ROC Organization
LCG Operations Centres
LCG Operations Workshop, e-IRG Workshop
Monitoring of the infrastructure from the VO perspective
Cloud Computing Dr. Sharad Saxena.
Leigh Grundhoefer Indiana University
Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002
The Globus Toolkit™: Information Services
From Prototype to Production Grid
Presentation transcript:

Supporting Grid Environments Leigh Grundhoefer Indiana University leighg@indiana.edu Thank you for inviting me to discuss operations issues fo grid envs. I have heard a lot of good prez

Agenda Introduction Grid3 environment Operations model and implementation Conclusions 9 December 2018 leighg@indiana.edu

Grid Support What is the structure for the support? What kind of infrastructure? Definition of “instrumentation” software Deployment policies and procedures Error handling methods What is the structure for the support? Try to reduce duplication of effort Integration of grid support to a variable set of existing resource provider support mechanisms Interfacing support staff and grid experts 9 December 2018 leighg@indiana.edu

Integrating grid support NOC Facility Operations and Support Security Czar Grid ops Network gods Sys. Admin Resources 9 December 2018 leighg@indiana.edu

Grid Operations Mission Deploy, maintain, and operate a grid environment as a NOC manages a inter-network, providing a single point of operations for configuration support, monitoring of status and usage (current and historical), problem management, support for users, developers and systems administrators, provision of grid services, security incident response, and maintenance of grid information repositories. Proposed Areas of Research Access control and policy - Security Trouble Ticket System - Problem coordination Configuration and Information Services Health and Status Monitoring Experiment Scheduling 9 December 2018 leighg@indiana.edu

Agenda Introduction Grid3 environment Operations model and implementation Conclusions 9 December 2018 leighg@indiana.edu

Grid3: an application grid laboratory CERN LHC: US ATLAS testbeds & data challenges CERN LHC: USCMS testbeds & data challenges end-to-end HENP applications virtual data research - Virtual Data Toolkit virtual data grid laboratory - Software/Operations/Facilities 9 December 2018 leighg@indiana.edu

Grid3 Overview Grid environment built from core Globus 2 and Condor middleware, as delivered through the Virtual Data Toolkit (VDT) and added to a compute cluster or storage resource. Multi-VO based security (Virtual Organisation Membership Service) No shell access to grid resources, no grid-based privileged access Monitoring Instrumentation and Service Metrics defined by the Project Plan Currently 32 sites and opportunistic use ~3200 CPUs Delivering the US LHC Data Challenges 9 December 2018 leighg@indiana.edu

Integrated Monitoring Framework Globus Meta Directory System (LDAP directory) MonALISA, Monitoring Agents in Large Integrated Service Architecture (Pub/Sub) MonALISA repository (WS/WAP) Ganglia performance monitoring (Multicast/Hierarchical) Job Monitoring System at the Advanced Center for Distributed Computing (non invasive archive) The Grid Site Status Cataloging System at iGOC (human/automatic managed DB) Our instrumenetation resides in the monitoring framework which partially displayed here. ( the Buffalo CCR Job Monitoring is a seperation but equal framework ) 9 December 2018 leighg@indiana.edu

Grid3 – Monitoring Snapshots Service monitoring GridCat The GridCatlog software is freely available for download and it assists operations in visual review of the status of the grid software at each of the grid sites. MonaLisa 9 December 2018 leighg@indiana.edu

Grid3 – Monitoring Snapshots Job Monitoring Acdc job monitoring 9 December 2018 leighg@indiana.edu

Agenda Introduction Grid3 environment Grid Operations model and implementation Conclusions 9 December 2018 leighg@indiana.edu

Grid Operations Approach The Operations group Sets up and maintains a cooperative grid community Facilitates work to and among responsible agents Has no direct control: uses notification with follow-ups Tunes services to the capabilities of the sites Cooperative and mentoring principles are employed: Identifies community vision – i.e. the Project Plan (anchor) Utilizes a participatory decision making process -- Taskforce Makes clear agreements -- Service Descriptions and MOUs Makes clear communication and conflict resolution a priority Weekly operations (problem solving) and management teleconferences. Main point of this slide is that is a cooperative facilitated effort. The GOC facilitates, has no direct control. 9 December 2018 leighg@indiana.edu

Service Desk Activities A common face to collaboratively-provided support Facilitate and support communications: Direct email with site administrators and Grid users Web page resources Status reporting to mailing list Monitor status of Grid resources Coordinate and track: Problems Changes (software updates, resource additions) Security incidents Requests for assistance 9 December 2018 leighg@indiana.edu

Service Desk Activities (cont.) Provide reports Problem summaries, service desk activity Maintain the repository of support and process information User support, such as: How to join a VO How to get and maintain a cert How to run an application How to use monitoring tools Troubleshooting application failures Information about policies, etc. 9 December 2018 leighg@indiana.edu

Provisioning Create and maintain the grid-controlled software packages and cache Provide site software not supported through VDT Verify software compatibility Provide ease-of-installation tools Develop instructions on how to plug things together Provide site installation and configuration support End-to-end troubleshooting for resources Provide and maintain common Grid services such as VOMS, GIIS, RLS, archives, and monitoring systems 9 December 2018 leighg@indiana.edu

Leveraging the NOC Global NOC at Indiana University The Global NOC provides 24x7 network engineering and operations services for research and education networks and international interconnections, including Internet2 Abilene, National LambdaRail, TransPAC and AMPATH networks, the STAR TAP and MANLAN layer 3 international exchange points, and the STAR LIGHT optical exchange. In addition, the Global NOC supports activities of the iVDGL Grid Operations Center and the REN-ISAC cybersecurity Watch Desk. By virtue of the R&E network, grid, and cybersecurity activities, the Global NOC possesses a unique and embracing view of R&E cyberinfrastructure. 9 December 2018 leighg@indiana.edu

Monitoring the GOC services NOC Mon Nagios NOC Contact DB Ticket 894 Trouble Tickets Grid Systems and Services(run every 15m) GOC 9 December 2018 leighg@indiana.edu

http://www.ivdgl.org/grid3 9 December 2018 leighg@indiana.edu

Problem to Trouble Ticket Scope A single resource / Multiple resources Application wide VO wide Grid wide Operations Resource/ Operations Service Severity Critical, High, Elevated, Normal Problem Owner Problem Contact Problem Description 9 December 2018 leighg@indiana.edu

Security/Incidence Handling Monitoring Event GOC Site Fails Grid Catalog Test (run every 5 hours) Trouble Tickets NOC Monitors Grid Catalog Map Ticket 854 Grid Experts GOC Mon GridCat MonaLisa Contact DB Security/Incidence Handling Resource VO support Or Facility Resouce Resource 9 December 2018 leighg@indiana.edu

Reactive Support workflow igoc@ivdgl.org GOC Web form & Telephone Ticket 803 Ticket 823 Ticket 833 Ticket 843 Trouble Tickets Grid Experts Web Docs Developers Contact DB User/Admin Application Failure Planned Outages Security problems Installation help Configuration assistance Identity management Authorization problems Other Support Centers Security/Incidence Handling 9 December 2018 leighg@indiana.edu

Analysis of Effort by Area ~800 tickets total Issues relating to resource owners and providers 60% Special issues for Virtual Organizations (VO’s) 20% Issues relating to developers of applications and 10% workflow environments (portals) Support to individuals using Grid resources 10% Developers; e.g. VDS doesn’t work, how to get information in order to do their own resource brokering 9 December 2018 leighg@indiana.edu

Agenda Introduction Grid3 environment Operations model and implementation Conclusions 9 December 2018 leighg@indiana.edu

Operations Enables Applications Provide operational services that provide Applications with the “instruments” to: Publish site policies and environment Know the status of grid middleware on sites Know the job queue for compute resources Know the status and load of grid resources Access historical monitoring information Manage grid services Keep apprised of security incidents in the collaborative 9 December 2018 leighg@indiana.edu

Lessons Learned Configuration management efforts in the development and deployment areas are rewarded many times over during production. A monitoring infrastructure allows a significant problem solving advantage, esp. redundant monitoring. Establishment of clear communications between resources providers, users and Virtual Organizations is hard. 9 December 2018 leighg@indiana.edu

More Lessons Learned Human interactions in grid building costly Keeping resource provider requirements light lead to heavy loads on gatekeeper hosts ( monitoring framework ) Diverse set of resource configurations made jobs requirements exchange difficult Troubleshooting: efficiency for submitted jobs was not as high as we’d like. 9 December 2018 leighg@indiana.edu

Upcoming Challenges Shared problem handling with application-centric and VO centric support structures Ticket passing to and from other Grid environments Establishing a working monitoring framework for distributed storage resources and virtual data cataloging infrastructure 9 December 2018 leighg@indiana.edu

Thank You - Questions? 9 December 2018 leighg@indiana.edu