Monitoring the Earth System Grid with MDS4 Ann Chervenak USC Information Sciences Institute Jennifer M. Schopf, Laura Pearlman, Mei-Hui Su, Shishir Bharathi,

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

A. Sim, CRD, L B N L 1 ANI and Magellan Launch, Nov. 18, 2009 Climate 100: Scaling the Earth System Grid to 100Gbps Networks Alex Sim, CRD, LBNL Dean N.
Earth System Curator Spanning the Gap Between Models and Datasets.
Metadata Development in the Earth System Curator Spanning the Gap Between Models and Datasets Rocky Dunlap, Georgia Tech.
Data Grids Darshan R. Kapadia Gregor von Laszewski
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Office of Science U.S. Department of Energy Grids and Portals at NERSC Presented by Steve Chan.
Toni Saarinen, Tite4 Tomi Ruuska, Tite4 Earth System Grid - ESG.
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Milos Kobliha Alejandro Cimadevilla Luis de Alba Parallel Computing Seminar GROUP 12.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
Operational Dataset Update Functionality Included in the NCAR Research Data Archive Management System 1 Zaihua Ji Doug Schuster Steven Worley Computational.
The Earth System Grid Discovery and Semantic Web Technologies Line Pouchard Oak Ridge National Laboratory Luca Cinquini, Gary Strand National Center for.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
The Data Replication Service Ann Chervenak Robert Schuler USC Information Sciences Institute.
CCSM Portal/ESG/ESGC Integration (a PY5 GIG project) Lan Zhao, Carol X. Song Rosen Center for Advanced Computing Purdue University With contributions by:
Presented by The Earth System Grid: Turning Climate Datasets into Community Resources David E. Bernholdt, ORNL on behalf of the Earth System Grid team.
5 Chapter Five Web Servers. 5 Chapter Objectives Learn about the Microsoft Personal Web Server Software Learn how to improve Web site performance Learn.
A Metadata Catalog Service for Data Intensive Applications Presented by Chin-Yi Tsai.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Dataset Caitlin Minteer & Kelly Clynes.
ESP workshop, Sept 2003 the Earth System Grid data portal presented by Luca Cinquini (NCAR/SCD/VETS) Acknowledgments: ESG.
Event Management & ITIL V3
ESG The Earth System Grid (ESG) Presented by Don Middleton & Luca Cinquini NCAR Scientific Computing Division On Behalf of the ESG Team SCD Executive Committee.
The Earth System Grid (ESG) Goals, Objectives and Strategies DOE SciDAC ESG Project Review Argonne National Laboratory, Illinois May 8-9, 2003.
1 Use of SRMs in Earth System Grid Arie Shoshani Alex Sim Lawrence Berkeley National Laboratory.
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES Data Replication Service Sandeep Chandra GEON Systems Group San Diego Supercomputer Center.
High Performance GridFTP Transport of Earth System Grid (ESG) Data 1 Center for Enabling Distributed Petascale Science.
GEM Portal and SERVOGrid for Earthquake Science PTLIU Laboratory for Community Grids Geoffrey Fox, Marlon Pierce Computer Science, Informatics, Physics.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Chapter 1 Introduction to Databases. 1-2 Chapter Outline   Common uses of database systems   Meaning of basic terms   Database Applications  
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
The Earth System Grid: A Visualisation Solution Gary Strand.
Web Portal Design Workshop, Boulder (CO), Jan 2003 Luca Cinquini (NCAR, ESG) The ESG and NCAR Web Portals Luca Cinquini NCAR, ESG Outline: 1.ESG Data Services.
The Earth System Grid (ESG) Computer Science and Technologies DOE SciDAC ESG Project Review Argonne National Laboratory, Illinois May 8-9, 2003.
- Vendredi 27 mars PRODIGUER un nœud de distribution des données CMIP5 GIEC/IPCC Sébastien Denvil Pôle de Modélisation, IPSL.
ESG Observational Data Integration Presented by Feiyi Wang Technology Integration Group National Center of Computational Sciences.
GRID Overview Internet2 Member Meeting Spring 2003 Sandra Redman Information Technology and Systems Center and Information Technology Research Center National.
May 6, 2002Earth System Grid - Williams The Earth System Grid Presented by Dean N. Williams PI’s: Ian Foster (ANL); Don Middleton (NCAR); and Dean Williams.
Module 9 User Profiles and Social Networking. Module Overview Configuring User Profiles Implementing SharePoint 2010 Social Networking Features.
7. Grid Computing Systems and Resource Management
Access Control for NCAR Data Portals A report on work in progress about the future of the NCAR Community Data Portal Luca Cinquini GO-ESSP Workshop, 6-8.
1 Accomplishments. 2 Overview of Accomplishments  Sustaining the Production Earth System Grid Serving the current needs of the climate modeling community.
1 Overall Architectural Design of the Earth System Grid.
1 Earth System Grid Center For Enabling Technologies (ESG-CET) Introduction and Overview Dean N. Williams, Don E. Middleton, Ian T. Foster, and David E.
Earth System Curator and Model Metadata Discovery and Display for CMIP5 Sylvia Murphy and Cecelia Deluca (NOAA/CIRES) Hannah Wilcox (NCAR/CISL) Metafor.
1 Summary. 2 ESG-CET Purpose and Objectives Purpose  Provide climate researchers worldwide with access to data, information, models, analysis tools,
GraDS MacroGrid Carl Kesselman USC/Information Sciences Institute.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
SCD User Briefing The Community Data Portal and the Earth System Grid Don Middleton with presentation material developed by Luca Cinquini, Mary Haley,
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Climate-SDM (1) Climate analysis use case –Described by: Marcia Branstetter Use case description –Data obtained from ESG –Using a sequence steps in analysis,
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
1 Scientific Data Management Group LBNL SRM related demos SC 2002 DemosDemos Robust File Replication of Massive Datasets on the Grid GridFTP-HPSS access.
System Software Laboratory Databases and the Grid by Paul Watson University of Newcastle Grid Computing: Making the Global Infrastructure a Reality June.
The Earth System Grid: A Visualisation Solution
An Overview of Data-PASS Shared Catalog
Joseph JaJa, Mike Smorul, and Sangchul Song
Data Requirements for Climate and Carbon Research
Data Management Components for a Research Data Archive
Presentation transcript:

Monitoring the Earth System Grid with MDS4 Ann Chervenak USC Information Sciences Institute Jennifer M. Schopf, Laura Pearlman, Mei-Hui Su, Shishir Bharathi, Luca Cinquini, Mike D’Arcy, Neill Miller, David Bernholdt

Talk Outline l Overview of the Earth System Grid l Overview of Monitoring in the Globus Toolkit l Globus Monitoring Services in ESG u Monitoring and Discovery System u Trigger Service l Summary

The Earth System Grid: Turning Climate Datasets into Community Resources

The growing importance of climate simulation data l DOE invests broadly in climate change research: u Development of climate models u Climate change simulation u Model intercomparisons u Observational programs l Climate change research is increasingly data-intensive: u Analysis and intercomparison of simulation and observations from many sources u Data used by model developers, impacts analysts, policymakers 4 Bernholdt_ESG_SC07 Results from the Parallel Climate Model (PCM) depicting wind vectors, surface pressure, sea surface temperature, and sea ice concentration. Prepared from data published in the ESG using the FERRET analysis tool by Gary Strand, NCAR. Slide Courtesy of Dave Bernholdt, ORNL

Earth System Grid objectives To support the infrastructural needs of the national and international climate community, ESG is providing crucial technology to securely access, monitor, catalog, transport, and distribute data in today’s grid computing environment. HPC hardware running climate models ESG Sites ESG Portal 5 Bernholdt_ESG_SC07 Slide Courtesy of Dave Bernholdt, ORNL

Main ESG Portal IPCC AR4 ESG Portal 146 TB of data at four locations l 1,059 datasets l 958,072 files l Includes the past 6 years of joint DOE/NSF climate modeling experiments 35 TB of data at one location l 77,400 files l Generated by a modeling campaign coordinated by the Intergovernmental Panel on Climate Change l Model data from 13 countries 4,910 registered users1,245 registered analysis projects Downloads to date l 30 TB l 106,572 files Downloads to date l 245 TB l 914,400 files l 500 GB/day (average) > 300 scientific papers published to date based on analysis of IPCC AR4 data ESG facts and figures Worldwide ESG user base IPCC Daily Downloads (through 7/2/07) Slide Courtesy of Dave Bernholdt, ORNL

ESG architecture and underlying technologies l Climate data tools u Metadata catalog u NcML (metadata schema) u OPenDAP-G (aggregation and subsetting) l Data management u Data Mover Lite u Storage Resource Manager l Globus toolkit u Globus Security Infrastructure u GridFTP u Monitoring and Discovery Services u Replica Location Service l Security u Access control u MyProxy u User registration Data Subsetting Access Control User Registration OPeNDAP-GMyProxy SRM DISK Cache ESG Web Portal NCAR Cache NCAR MSS RLSSRM ORNL HPSS RLSSRM RLS SRM RLS LANL Cache search browse download Web Browser Web Browser DML Data User publish Web Browser Web Browser Data Provider Monitoring Services Data Publishing Climate Metadata Catalogs Browsing Usage Metrics Data Download Data Search NERSC MSS, HPSS : Tertiary data storage systems First Generation ESG Architecture SRM Slide Courtesy of Dave Bernholdt, ORNL

Evolving ESG to petascale Full data sharing (add to testbed…) Synchronized federation – metadata, data Full suite of server-side analysis Model/observation integration ESG embedded into desktop productivity tools GIS integration Model intercomparison metrics User support, life cycle maintenance Central database Centralized curated data archive Time aggregation Distribution by file transport No ESG responsibility for analysis Shopping-cart-oriented web portal Testbed data sharing Federated metadata Federated portals Unified user interface Selected server-side analysis Location independence Distributed aggregation Manual data sharing Manual publishing ESG Data System Evolution 2006Early CSSM, IPCC, satellite, In situ biogeochemistry, ecosystems ESG Data Archive TerabytesPetabytes CCSM IPCC Slide Courtesy of Dave Bernholdt, ORNL

l Petascale data archives l Broader geographical distribution of archives u across the United States u around the world l Easy federation of sites l Increased flexibility and robustness Architecture of the next-generation ESG Second Generation ESG Architecture Federated ESG Deployment ESG Node Web Portal Interfaces Applications Data & Metadata Holdings ESG Gateway (CCES) Web Portal Interfaces Applications Data & Metadata Holdings ESG Gateway (IPCC) Web Portal Interfaces Applications Data & Metadata Holdings ESG Gateway (CCSM) Distribution Online Data Distribution Online Data Deep Archives CPU Browser Clients Web Portals Remote Application Clients (CDAT, NCL, Ferret, GIS, Publishing, OPeNDAP, DML, Modeling, etc.) Local, Remote, and Web Services Interfaces Applications Components (data transfer, data publishing, search, analysis, visualization, post-processing, computation) Cross-Cutting Concerns (security, logging, monitoring) Workflow & Orchestration Slide Courtesy of Dave Bernholdt, ORNL

The team and sponsors National Center for Atmospheric Research Los Alamos National Laboratory Argonne National Laboratory Oak Ridge National Laboratory USC Information Science Institute Lawrence Livermore National Laboratory/ PCMDI Lawrence Berkeley National Laboratory National Oceanic & Atmospheric Administration/PMEL Climate Data Repository and ESG participant ESG participant Slide Courtesy of Dave Bernholdt, ORNL

Monitoring ESG l ESG consists of heterogeneous components deployed across multiple administrative domains l The climate community has come to depend on the ESG infrastructure as a critical resource u Failures of ESG components or services can disrupt the work of many scientists u Need to minimize infrastructure downtime l Monitoring components to determine their current state and detect failures is essential l Monitoring systems: u Collect, aggregate, and sometimes act upon data describing system state u Monitoring can help users make resource selection decisions and help administrators detect problems

GT4 Monitoring and Discovery System l A Web service adhering to the Web Services Resource Framework (WSRF) standards Consists of two higher-level services: l Index service collects and publishes aggregated information about Grid resources l Trigger service collects resource information from the Index Service and performs actions when certain trigger conditions are met l Information about resources is obtained from external components called information providers u Currently in ESG, these are simple scripts and programs that check the status of services

ESG Services Currently Monitored l GridFTP server: NCAR l OPeNDAP server: NCAR l Web Portal: NCAR l HTTP Dataserver: LANL, NCAR l RLS servers: LANL, LBNL, NCAR, ORNL l Storage Resource Managers: LBNL, NCAR, ORNL l Hierarchical Mass Storage Systems: LBNL, NCAR, ORNL

Monitoring Overall System Status l Monitored data are collected in MDS4 Index service l Information providers check resource status at a configured frequency l Currently, every 10 minutes l Report status to Index Service l This resource information in Index Service is queried by the ESG Web portal l Used to generate overall picture of state of ESG resources l Displayed on ESG Web portal page

Trigger Actions Based on Monitoring Information l MDS4 Trigger service periodically polls Index Service l Based on the current resource status, Trigger service determines whether specified trigger rules and conditions are satisfied u If so, performs specified action for each trigger l Current action: Trigger service sends to system administrators when services fail u Ideally, system failures can be detected and corrected by administrators before they affect larger ESG community l Future plans: include richer recovery operations as trigger actions, e.g., automatic restart of failed services

Example Monitoring Information Total error messages for May Messages related to certificate and configuration problems at LANL 38 Failure messages due to brief interruption in network service at ORNL on 5/13 2 HTTP data server failure at NCAR 5/171 RLS failure at LLNL 5/221 Simultaneous error messages for SRM services at NCAR, ORNL, LBNL on 5/23 3 RLS failure at ORNL 5/241 RLS failure at LBNL 5/311

Successes and Lessons Learned in ESG Monitoring l Overview of current system state for users and system administrators u ESG portal displays an overall picture of the current status of the ESG infrastructure u Gives users and administrators an understanding at a glance of which resources and services are currently available l Failure notification u Failure messages from Trigger service have helped system administrators to identify and quickly address failed components and services u Before the monitoring system was deployed, services would fail and might not be detected until a user tried to access an ESG dataset u MDS4 deployment has enabled a unified interface and notification system across ESG resources

Successes and Lessons Learned in ESG Monitoring (cont.) l More information was needed on failure types u An enhancement to MDS4 based on our experience: include additional information about location and type of failed service in subject line of trigger notification messages u Allow message recipients to filter these messages and quickly identify which services need attention.

Successes and Lessons Learned in ESG Monitoring (cont.) l Validation of new deployments u Sometimes make significant changes to the Grid infrastructure u E.g., Modification of service configurations or deployment of a new component version u May encounter a series of failure messages for particular classes of components over a period of days or weeks due to these changes u Example: pattern of failure messages for RLS servers that corresponded to a configuration problem related to updates among the services u Example: series of SRM failure messages relating to a new feature that had unexpected behavior u Monitoring messages helped to identify problems with these newly deployed or reconfigured services u Absence of failure messages can in part validate a new configuration or deployment

Successes and Lessons Learned in ESG Monitoring (cont.) l Failure deduction u The monitoring system can be used to deduce the reason for complex failures. u Example: we used MDS4 to gain insights into why the ESG portal crashed occasionally due to a lack of available file descriptors l Used monitoring infrastructure to check file descriptor usage by different services running on ESG portal u Example: Failure messages indicated that SRMs at three different locations had failed simultaneously. Such simultaneous independent failures are highly unlikely. We investigated and found a problem with a query expression in our monitoring software. l The monitoring system can be used to deduce reason for complex failures u System-wide monitoring can be used to detect a pattern of failures that occur close together in time u Deduce a problem at a different level of the system

Successes and Lessons Learned in ESG Monitoring (cont.) l Warn of Certificate Problems and Imminent Expirations u All ESG services at the LANL site reported failures simultaneously u Problem was expiration of the host certificate for the ESG node at that site u Downtime resulted while the problem was diagnosed and administrators requested and installed a new host certificate u To avoid such downtime in the future, we implemented additional information providers and triggers that check the expiration date of host certificates on services where this information can be queried u Trigger Service checks informs system administrators when certificate expiration is imminent

Successes and Lessons Learned in ESG Monitoring (cont.) l Scheduled Downtime u When a particular site has scheduled downtime for site maintenance, it is not necessary to send failure messages to system administrators u Developed a simple mechanism that disables particular triggers for the specified downtime period u Monitoring infrastructure still collects information about service state during this period, but failure conditions do not trigger actions by the Trigger Service

Acknowledgements l ESG is funded by the US Department of Energy under the Scientific Discovery Through Advanced Computing Program l MDS is funded by the US National Science Foundation under the Office of Cyberinfrastructure l ESG Team includes: u National Center for Atmospheric Research: Don Middleton, Luca Cinquini, Rob Markel, Peter Fox, Jose Garcia, others u Lawrence Livermore National Laboratory: Dean Williams, Bob Drach and others u Argonne National Laboratory: Veronika Nefedova, Ian Foster, Rachana Ananthakrishnan, Frank Seibenlist, others u Lawrence Berkeley National Laboratory: Arie Shoshani, Alex Sim and others u Oak Ridge National Laboratory: David Bernholdt, Meili Chen and others u Los Alamos National Laboratory: Phillip Jones and others u USC Information Sciences Institute: Ann Chervenak, Robert Schuler, Shishir Bharathi, Mei Hui Su l MDS Team includes: u Argonne National Laboratory: Jen Schopf, Neill Miller u USC ISI: Laura Pearlman, Mike D’Arcy

More on Metadata

Metadata Services l Metadata is information that describes data l Metadata services allow scientists to u Record information about the creation, transformation, meaning and quality of data items u Query for data items based on these descriptive attributes l Accurate identification of desired data items is essential for correct analysis of experimental and simulation results. l In the past, scientists have largely relied on ad hoc methods (descriptive file and directory names, lab notebooks, etc.) to record information about data items l However, these methods do not scale to terabyte and petabyte data sets consisting of millions of data items. l Extensible, reliable, high performance metadata services are required to support registration and query of metadata information

Presentation from SC2003 talk by Gurmeet Singh

Example: ESG Collection Level Metadata Class Definitions l Project u A project is an organized activity that produces data. The scope and duration of a project may vary, from a few datasets generated over several weeks or months, to a multi-year project generating many terabytes. Typically a project will have one or more principal investigators and a single funding source. u A project may be associated with multiple ensembles, campaigns, and/or investigations. A project may be a subproject of another project. u Examples: u CMIP (Coupled Model Intercomparison Project) u CCSM (Community Climate System Model) u PCM (Parallel Climate Model)

ESG Collection Level Metadata (cont.) l Ensemble u An ensemble calculation is a set of simulations that are closely related, in that typically all aspects of the model configuration and boundary conditions are held constant, while the initial conditions and/or external forcing are varied in a prescribed manner. Each set of initial conditions generates one or more dataset. l Campaign u A Campaign is a set of observational activities that share a common goal (e.g., observation of the ozone layer during the winter/spring months), and are related either geographically (e.g, a campaign at the South Pole) and/or temporally (e.g., measurements of rainfall at several observation stations during December 2003). l Investigation u An investigation is an activity, within a project, that produces data. The scope of the investigation is narrower and more focused than for the project. An investigation may be a simulation, experiment, observation, or analysis.

Example: ESG Collection Level Metadata Other Classes l Simulation l Experiment l Observation l Analysis l Dataset l Service

Attributes of classes l Project u Id: a unique identifier for the project. u Name: a brief name for the project intended for display in a browser, etc. u Topics: one or more keywords, qualified by an optional encoding, intended to be used by specialized search and discovery engines. See, for example, meters.html u Persons – project participants and their respective roles. u Description: a textual description of the project, intended to provide more in-depth information than the Name. u Notes: additional, ad-hoc information about the project. u References – links or references to additional project information: web pages, publications, etc. u Funding: funding agencies or sources. u Rights: description of the ownership and access conditions to the data holdings of the project.

l Ensemble u Id: a unique identifier for this ensemble. u Name: name for this ensemble. u Description: a textual description of the project, intended to provide more in-depth information than the Name. u Notes: additional, ad-hoc information about the project. u Persons – those responsible for the ensemble data. u References – optional links or references to additional project information: web pages, publications, etc. u Rights: optional description of the ownership and access conditions to the data holdings of the ensemble, if different from the project.

l A standard name is a description of a scientific quantity generated by a model run l Follows the CF standard name table, and is hierarchical l For example, the standard name ‘atmosphere’ is a standard name category that includes more specific quantities such as ‘air pressure l - Atmosphere l - Air Pressure l - … l - Carbon Cycle l - Biomass Burning Carbon Flux l - … l - Cloud l - Air Pressure at Cloud Base l - … l - Hydrology l - Atmosphere Water Content l - … l - Ocean l - Baroclinic Eastward Sea Water Velocity l - … l - Radiation l - Atmosphere Net Rate of Absorption of Longwave Energy l - … l - Sea Ice l - Direction of Sea-Ice Velocity l - … l - Surface l - Canopy and Surface Water Amount l - …

Metadata Services in Practice… l Generic metadata services have not proven to be very useful u MCS used in Pegasus workflow system to manage its metadata, provenance, etc. u Not widely used in science deployments l Virtual organizations (scientists) agree on appropriate metadata schema to describe data l Typically deploy a specialized metadata service u Relational database with indexes on domain-specific attributres to support common queries u RDF tuple services l Provide faster, more targeted queries on agreed metadata than a generic catalog