Monitoring, Logging, and Accounting on the TeraGrid David Hart, SDSC MLA09 June 10, 2009.

Slides:



Advertisements
Similar presentations
TeraGrid Deployment Test of Grid Software JP Navarro TeraGrid Software Integration University of Chicago OGF 21 October 19, 2007.
Advertisements

Test harness and reporting framework Shava Smallen San Diego Supercomputer Center Grid Performance Workshop 6/22/05.
Cross-site data transfer on TeraGrid using GridFTP TeraGrid06 Institute User Introduction to TeraGrid June 12 th by Krishna Muriki
Quality Assurance (QA) Working Group Update February 11, 2010 Kate Ericson (SDSC) Shava Smallen (SDSC)
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
TeraGrid Science Gateway AAAA Model: Implementation and Lessons Learned Jim Basney NCSA University of Illinois Von Welch Independent.
Simo Niskala Teemu Pasanen
Core Services I & II David Hart Area Director, UFP/CS TeraGrid Quarterly Meeting December 2008.
Network, Operations and Security Area Tony Rimovsky NOS Area Director
GIG Software Integration: Area Overview TeraGrid Annual Project Review April, 2008.
TeraGrid Information Services December 1, 2006 JP Navarro GIG Software Integration.
Scaling Account Creation and Management through the TeraGrid User Portal Contact: Eric Roberts
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
GIG Software Integration Project Plan, PY4-PY5 Lee Liming Mary McIlvain John-Paul Navarro.
TeraGrid Information Services John-Paul “JP” Navarro TeraGrid Grid Infrastructure Group “GIG” Area Co-Director for Software Integration and Information.
Progress on TeraGrid Stability for the LEAD project.
TeraGrid Information Services JP Navarro, Lee Liming University of Chicago TeraGrid Architecture Meeting September 20, 2007.
TeraGrid Science Gateways: Scaling TeraGrid Access Aaron Shelmire¹, Jim Basney², Jim Marsteller¹, Von Welch²,
GRAM: Software Provider Forum Stuart Martin Computational Institute, University of Chicago & Argonne National Lab TeraGrid 2007 Madison, WI.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
UFP/CS Update David Hart. Highlights Sept xRAC results POPS Allocations RAT follow-up User News AMIE WebSphere transition Accounting Updates Metrics,
1 PY4 Project Report Summary of incomplete PY4 IPP items.
Kurt Mueller San Diego Supercomputer Center NPACI HotPage Updates.
TeraGrid CTSS Plans and Status Dane Skow for Lee Liming and JP Navarro OSG Consortium Meeting 22 August, 2006.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
SAN DIEGO SUPERCOMPUTER CENTER Inca TeraGrid Status Kate Ericson November 2, 2006.
Evolving Interfaces to Impacting Technology: The Mobile TeraGrid User Portal Rion Dooley, Stephen Mock, Maytal Dahan, Praveen Nuthulapati, Patrick Hurley.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Leveraging the InCommon Federation to access the NSF TeraGrid Jim Basney Senior Research Scientist National Center for Supercomputing Applications University.
SAN DIEGO SUPERCOMPUTER CENTER Inca Control Infrastructure Shava Smallen Inca Workshop September 4, 2008.
Presented by: Tony Rimovsky TeraGrid Account Management Tony Rimovsky, Area Director for Network Operations and Security
1 NSF/TeraGrid Science Advisory Board Meeting July 19-20, San Diego, CA Brief TeraGrid Overview and Expectations of Science Advisory Board John Towns TeraGrid.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
Structural Biology on the GRID Dr. Tsjerk A. Wassenaar Biomolecular NMR - Utrecht University (NL)
Distributed Data for Science Workflows Data Architecture Progress Report December 2008.
User-Facing Projects Update David Hart, SDSC April 23, 2009.
Introduction to Grid Computing and its components.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Data, Visualization and Scheduling (DVS) TeraGrid Annual Meeting, April 2008 Kelly Gaither, GIG Area Director DVS.
Network, Operations and Security Area Tony Rimovsky NOS Area Director
TeraGrid-Wide Operations Von Welch Area Director for Networking, Operations and Security NCSA, University of Illinois April, 2009.
SAN DIEGO SUPERCOMPUTER CENTER Welcome to the 2nd Inca Workshop Sponsored by the NSF September 4 & 5, 2008 Presenters: Shava Smallen
Gateway Security Summit, January 28-30, 2008 Welcome to the Gateway Security Summit Nancy Wilkins-Diehr Science Gateways Area Director.
Quality Assurance (QA) Working Group Update July 1, 2010 Kate Ericson (SDSC) Shava Smallen (SDSC)
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.
Data Infrastructure in the TeraGrid Chris Jordan Campus Champions Presentation May 6, 2009.
TeraGrid’s Process for Meeting User Needs. Jay Boisseau, Texas Advanced Computing Center Dennis Gannon, Indiana University Ralph Roskies, University of.
System Software Laboratory Databases and the Grid by Paul Watson University of Newcastle Grid Computing: Making the Global Infrastructure a Reality June.
TeraGrid Software Integration: Area Overview (detailed in 2007 Annual Report Section 3) Lee Liming, JP Navarro TeraGrid Annual Project Review April, 2008.
Quality Assurance Working Group Doru Marcusiu, NCSA QA Working Group Lead TeraGrid Annual Review April, 2009.
TeraGrid Accounting System Progress and Plans David Hart July 26, 2007.
TeraGrid User Portal and Online Presence David Hart, SDSC Area Director, User-Facing Projects and Core Services TeraGrid Annual Review April 6, 2009.
HP SmartStream Production Center
Bob Jones EGEE Technical Director
TeraGrid Information Services
GPIR GridPort Information Repository
Simulation Production System
NSF TeraGrid Review January 10, 2006
U.S. ATLAS Grid Production Experience
TeraGrid Information Services Developer Introduction
Information Services Discussion TeraGrid ‘08
Stephen Pickles Technical Director, GOSC
US CMS Testbed.
Leigh Grundhoefer Indiana University
From Prototype to Production Grid
A Network Operating System Edited By Maysoon AlDuwais
Presentation transcript:

Monitoring, Logging, and Accounting on the TeraGrid David Hart, SDSC MLA09 June 10, 2009

TeraGrid: A Supercomputing Grid

TeraGrid “grid-style” usage, million non-GRAM jobs (1.1 million batch, 1.9 million Condor) 667,000 GRAM jobs Upper bound on GRAM-related CPU-hrs (actual probably much lower) JOBS CPU-HRS

TeraGrid’s Overall Usage Profile, 2008 Job Counts Job Charges Hrs - Longer Cores - Bigger # Jobs NUs (millions) While the number of jobs is stacked toward shorter, smaller jobs, the computation on TeraGrid is dominated by larger and/or longer jobs. 1.9 million 0-hr, 1-core jobs

TeraGrid’s Evolution Many TeraGrid functions have evolved independently— including monitoring, logging and accounting Allocations and Accounting were in place from the start –Leveraged capabilities from and continued policies of PACI partnerships and Supercomputer Centers program –At the same time, always supported “roaming” and multi- resource allocations, in anticipation of grid future Monitoring has changed more significantly over the years, adapting to the growth, evolution of TeraGrid resources, governance, and technology. –From initial set of four homogeneous clusters to a federation of ~20 heterogeneous HPC systems

MONITORING/LOGGING

Inca: User-Level Grid Monitoring Enables consistent user- level testing across resources Easy to configure and maintain Easy to collect data from resources Archived results support troubleshooting Large variety of tests (~150) Comprehensive views of data

Inca Deployment on TeraGrid Running since 2003 Total of ~2,400 tests running on 20 login nodes, 3 grid nodes, and 3 servers Coordinated software and services Resource registration in information services Cross-site tests CA certificate and CRL checking TeraGrid GIG services Screenshot of Inca status pages for TeraGrid

Inca Monitoring Benefits End Users Tests resources and services used by LEAD. Pings service every 3 minutes Verifies batch job submission every hour Automatically notifies admins of failures Show week of history in custom status pages “Inca-reported errors mirror failures we’ve observed, and as they are addressed, we’ve noticed an improvement in TeraGrid’s stability.” —Suresh Marru (LEAD developer)

Inca’s status pages provide multiple levels of details Tests Summary Test Details Current statusHistorical Individual test history Related test histories Error history summary Weekly status report Cumulative test status by resource Test status by package and resource Individual test result details Resource status history

QBETS: Batch queue prediction Wait-time Prediction –How long will it take my job to begin running? Deadline Prediction –What is the chance my job will complete within a certain time? Available via User Portal –Portlet uses QBETS Web service Wolski, et al., UC Santa Barbara

System load, queue monitoring TeraGrid Integrated Information Services –Load: percent of nodes running jobs (public) –Jobs: lists jobs in the batch queue (requires auth) –Uses Apache 2, Tomcat, and Globus MDS4 Accessed via HTTP GET or wsrf-query –Used by User Portal, Scheduling Services, etc.

User-Oriented System Monitor TGUP System MonitorTGUP System Monitor brings many pieces together… –Integration with TeraGrid IIS, Inca, User News – Up/Down Status from Inca System Load, Queues from IIS User News about Outage Inca test results re: access

Inter-site Data Movement GridFTP transfers –Counts and data volume –203 TB moved in Apr 09 Data collected via Globus Listener – Centralized Listener deployed for TeraGrid, used for GridFTP and WS-GRAM data –Would like to expand to other services

Monitoring Grid Computing Use GRAM job submission monitoring –Part of monthly Globus reporting –GRAM and PreWS GRAM job monitoring PreWS GRAM jobs counted by monthly parsing of log files at RP sites GRAM jobs counted by Globus Listener No current link between GRAM job records and accounting job records –Likely to happen Q4 2009

Inter-site Data Movement Performance SpeedPage— speedpage.psc.teragrid.org speedpage.psc.teragrid.org –Monitors GridFTP performance between TG endpoints –Backend primarily Python, UI in PHP –Pulls GridFTP endpoints from TG IIS –Spawns globus-url-copies (one per endpoint at a time), with striping, and records results in DB 2x per day –Working toward full automation $SCRATCH locations now manually updated for each resource Related Lustre WANpage –Monitors Lustre-WAN systems at Indiana U and PSC

Network Monitoring Monitoring TeraGrid’s dedicated network –network peak rates –daily transfer volume –In the past, had active bandwidth testing, but not currently Collected via SNMP –20-s polling Use RTG –

Quality Assurance Working Group How to handle tough/repeating problems in federated environment? –Many cracks for things to fall into –Arose from issues encountered by LEAD Gateway QA-WG established to improve reliability and availability of services Problems not solvable by site admin, or problems that keep coming back, are kicked to QA-WG. QA-WG prioritizes problems, works to resolve them, if need be engages with software packagers, developers or other experts.

Monitoring Needs Don’t have centralized log collection for all services yet. –Site-by-site collection painful and error prone. –Look to expand central logging via syslog, Globus Metrics Have a handle now on what is being used, want to understand better by whom and why –Knowing who is impacted by a failure and how helps to prioritize resources and reliability efforts. –Defining usage modalities to characterize user behavior. –Integrating science gateways/Globus use and users into accounting infrastructure

ALLOCATIONS/ACCOUNTING

Centralized Allocations Process Allocations process has origins back to start of supercomputer centers program (circa 1985) –Users submit requests, Startup or Research Startups reviewed continually Research requests reviewed quarterly by external committee –Allocations set usage limits for projects Projects have one “PI”, any number of authorized users –Managed by POPS system ( ) pops-submit for user submissions pops-review for review process pops-admin for managing, handling, awarding requests Originated out of NCSA’s local request management system, dating back to 2001 for use across NSF-supported sites.

AMIE AMIE: Account Management Info Exchange AMIE protocol –Federates Allocations and Accounting information Establishes projects and allocations Establishes user accounts—all TeraGrid users must be on at least one allocation Manages usage reporting from RPs –Asynchronous exchange of XML packets –System tolerant of RP delays and independence –Originated pre-TeraGrid by NCSA Alliance –AMIE/Gold interface developed by LONI TeraGrid Central Database (TGCDB) –Definitive TeraGrid data source –Implements TeraGrid policies –Creates AMIE “request” packets POPS TGCDB RP 1 RP n AMIE TG User Portal

TeraGrid Accounting Features AMIE system independent of specific HPC resource –One AMIE instance can support all resources at a site –RPs retain control of HPC resource admin, can have both TG/non-TG activities Usage based on job-level records sent from RPs –Averaging ~35,000 records per week Supports “grid” (multi-resource) allocations –Allocations that are usable on sets of resources, at user's discretion. –TGCDB amasses usage, notifies RPs when allocation exceeded –Usage can be normalized to a common “service unit” Supports TeraGrid Single Sign-On (SSO) –Distinguished Name synchronization across sites –gx-map suite of utilities for managing grid-mapfiles Supports non-HPC allocations –Requests/allocations supported for storage resources –Requests in POPS possible for Advanced User Support –But accounting definitions/support are lagging

User Views for Allocations/Accounting TeraGrid User Portal IIS-based profile service –under development tgusage –command-line utility

Staff Views for Allocations/Accounting Web interface for staff to monitor TGCDB “ground truth”Web interface –Help desk uses to troubleshoot user issues –Used to monitor accounting system functioning –Custom metrics/reports for management needs –Mostly Perl

Conclusions and Acknowledgments Within TeraGrid, these efforts are coordinated across several programmatic areas –Monitoring/logging activities fall under the GIG’s Network, Operations, and Security activities –Allocations/accounting managed under User-Facing Projects & Core Services activities –Information Services managed under Software Integration activities Thanks to Shava Smallen, SDSC; Von Welch (NCSA); JP Navarro (UC/ANL); Robert Budden (PSC) for their contributions to this presentation. Questions?