The CMS Grid Analysis Environment GAE (The CAIGEE Architecture)

Slides:



Advertisements
Similar presentations
Data Management Expert Panel - WP2. WP2 Overview.
Advertisements

EGEE-II INFSO-RI Enabling Grids for E-sciencE The gLite middleware distribution OSG Consortium Meeting Seattle,
High Performance Computing Course Notes Grid Computing.
CMS Applications Towards Requirements for Data Processing and Analysis on the Open Science Grid Greg Graham FNAL CD/CMS for OSG Deployment 16-Dec-2004.
CMS-ARDA Workshop 15/09/2003 CMS/LCG-0 architecture Many authors…
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
1 Grid services based architectures Growing consensus that Grid services is the right concept for building the computing grids; Recent ARDA work has provoked.
23rd. June 2003JJB, GAE Workshop1 GAE (Grid Analysis Environment) Overview of Caltech effort Slides for the Caltech GAE Workshop June 2003.
Other servers Java client, ROOT (analysis tool), IGUANA (CMS viz. tool), ROOT-CAVES client (analysis sharing tool), … any app that can make XML-RPC/SOAP.
A tool to enable CMS Distributed Analysis
Client/Server Architectures
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
Korea Workshop May Grid Analysis Environment (GAE) (overview) Frank van Lingen (on behalf of the GAE.
Scalable Systems Software Center Resource Management and Accounting Working Group Face-to-Face Meeting October 10-11, 2002.
DOSAR Workshop, Sao Paulo, Brazil, September 16-17, 2005 LCG Tier 2 and DOSAR Pat Skubic OU.
Grid Workload Management Massimo Sgaravatto INFN Padova.
ILDG Middleware Status Chip Watson ILDG-6 Workshop May 12, 2005.
Web Services BOF This is a proposed new working group coming out of the Grid Computing Environments Research Group, as an outgrowth of their investigations.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
David Adams ATLAS ADA, ARDA and PPDG David Adams BNL June 28, 2004 PPDG Collaboration Meeting Williams Bay, Wisconsin.
CPT Demo May Build on SC03 Demo and extend it. Phase 1: Doing Root Analysis and add BOSS, Rendezvous, and Pool RLS catalog to analysis workflow.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
CHEP 2004 Grid Enabled Analysis: Prototype, Status and Results (on behalf of the GAE collaboration) Caltech, University of Florida, NUST, UBP Frank van.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
Korea Workshop May GAE CMS Analysis (Example) Michael Thomas (on behalf of the GAE group)
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
David Adams ATLAS ATLAS-ARDA strategy and priorities David Adams BNL October 21, 2004 ARDA Workshop.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
David Adams ATLAS ADA: ATLAS Distributed Analysis David Adams BNL December 15, 2003 PPDG Collaboration Meeting LBL.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
System Software Laboratory Databases and the Grid by Paul Watson University of Newcastle Grid Computing: Making the Global Infrastructure a Reality June.
EDG Project Conference – Barcelona 13 May 2003 – n° 1 A.Fanfani INFN Bologna – CMS WP8 – Grid Planning in CMS Outline  CMS Data Challenges  CMS Production.
Distributed Systems Architectures Chapter 12. Objectives  To explain the advantages and disadvantages of different distributed systems architectures.
Distributed Systems Architectures. Topics covered l Client-server architectures l Distributed object architectures l Inter-organisational computing.
Outline Introduction and motivation, The architecture of Tycho,
Grid Computing: Running your Jobs around the World
OGF PGI – EDGI Security Use Case and Requirements
StoRM: a SRM solution for disk based storage systems
U.S. ATLAS Grid Production Experience
SuperComputing 2003 “The Great Academia / Industry Grid Debate” ?
JRA3 Introduction Åke Edlund EGEE Security Head
Introduction to Data Management in EGI
Testbed Software Test Plan Status
Grid Computing.
WP1 activity, achievements and plans
Short update on the latest gLite status
A Messaging Infrastructure for WLCG
LCG middleware and LHC experiments ARDA project
#01 Client/Server Computing
Monitoring of the infrastructure from the VO perspective
Patrick Dreher Research Scientist & Associate Director
Report on GLUE activities 5th EU-DataGRID Conference
Module 01 ETICS Overview ETICS Online Tutorials
The GENIUS portal and the GILDA t-Infrastructure
Federated Hierarchical Filter Grids
The Anatomy and The Physiology of the Grid
Status of Grids for HEP and HENP
#01 Client/Server Computing
Presentation transcript:

The CMS Grid Analysis Environment GAE (The CAIGEE Architecture) CMS Analysis – an Interactive Grid Enabled Environment See http://ultralight.caltech.edu/gaeweb

The Current Grid System used by CMS Outline The Current Grid System used by CMS The CMS Grid Analysis Environment being developed at Caltech, Florida and FNAL Summary

Current CMS distributed system Supports “traditional” and “Grid” productions OCTOPUS is the glue Grid systems use VDT and LCG services middleware provides AAA Information System and Information providers (including CE and SE) Data Management System and File Catalogues Workload Management System Monitoring CMS has developed (within OCTOPUS): Job preparation (splitting) tools (McRunJob and CMSProd) Job-level monitoring (BOSS) Dataset Metadata Catalogue and Data Provenance System (RefDB) Software packaging and distribution tools (DAR)

OCTOPUS Production System Dataset metadata Phys.Group asks for a new dataset JDL Grid (LCG) Scheduler LCG RLS Job metadata DAG job DAGMan (MOP) Chimera VDL Virtual Data Catalogue Planner DPE Production Manager defines assignments RefDB Computer farm shell scripts Data-level query Local Batch Manager Job level query BOSS DB McRunjob + plug-in CMSProd Site Manager starts an assignment Push data or info Pull info

Support for distributed analysis (I) CMS software framework (COBRA) supports creation: of Tags (datasets with direct pointers to selected events) of reclustered datasets (deep copies of events) of private datasets (redefinition of events) COBRA now uses POOL uses POOL catalogue to store part of the dataset metadata directly interfaced to a distributed file catalogue (RLS) supports automatic download of remote files at run-time GFAL-enabled version of COBRA to directly access data on SRM-based (local) Storage Elements Developing a system for private software distribution based on SCRAM Modifying OCTOPUS components to use grid AAA RefDB, BOSS

Support for distributed analysis (II) Job preparation RefDB has concepts of requests and assignments are equivalent to a high-level query on metadata optimized for production allows automatic preparation of jobs for a variety of environments CMS is extending the request object to support user queries on metadata if a common Dataset Metadata Catalogue is provided by LCG, CMS can migrate part of the functionality of RefDB otherwise continue to use the CMS specific implementation CMS is already using DAG’s on VDT. EDG (LCG-N) will support DAG submissions through the Resource Broker

CMS grid components and ARDA COBRA + Octopus/RefDB MDS Octopus LCG-VOMS COBRA/API Octopus/ McRunJob +ImpalaLite/ CMSProd +EDG-UI MonaLisa Grid-ICE POOL/ COBRA Octopus/RefDB EDG RB RLS LCG Replica Manager Octopus DAR/SCRAM PacMan VDT-EDT LCG Octopus/BOSS

The Grid Analysis Environment (GAE)

The Importance of the GAE The GAE is critical to the success or failure of physics, and Grid-based LHC applications: Where the physics gets done Where we learn how to collaborate globally on physics analyses

GAE Challenges The real “Grid Challenge” lies in the development of a physics analysis environment that integrates with the Grid systems To be used by diverse worldwide communities of physicists; sharing resources within and among global “Virtual Organizations” 100s - 1000s of tasks, with a wide range of demands for computing, data and network resources Needs a coherent set of priorities, taking Collaboration strategy and policy into account Needs Security To successfully construct and operate such a global system, we need To build Grid End-to-End Service and Grid Application Software Layers To Monitor, Track and Visualise the system end-to-end To Answer the question: How much automation is possible. How much is desirable? These challenges motivated the CAIGEE Project proposal, in Early 2002.

What does the GAE Address ? The full continuum of analysis tasks, from trivial (interactive) response to many days execution. It is especially focussed on: Low latency, Quick turnaround Unpredictable, sometimes chaotic system demands Smooth transition to longer, more “batch”-like tasks The seamless integration of existing analysis tools into the GAE The needs of CMS analysis, but with a strong intent to be useful in other LHC Grid environments. Describe interactive <-> batch continuum

GAE Development Strategy Build on the existing CS11 (PPDG Analysis group) requirements, HEPCAL (single user) use cases, & ARDA (Roadmap) documents Work with, and use results of existing efforts as much as possible Adapt and integrate tools in common use (like ROOT) in the GAE Support full spectrum of analysis tasks: from interactive to batch/production analysis Low latency, fast interactive tasks, together with longer running batch tasks. Design for, and show Scalability E.g. with Peer to Peer architectures Ensure fault tolerance No single point of failure, e.g. no centralised critical services Support Security Authentication and Authorization Build in Adaptability Exploit usage patterns and policies to optimize the overall system performance Support the full range of heavyweight and lightweight clients Clusters, Servers, Desktops, Browsers, Handhelds, WAP phones etc. Strongly favour implementations that are language and OS neutral

Work on the GAE General Scope: GAE portal (Clarens): GAE Monitoring GAE involves several US Tier2 sites and the FNAL Tier1 We are willing to actively participate in ARDA We have been working with the PPDG analysis group (CS11) on definition of the Grid Analysis services GAE portal (Clarens): Python and Java versions Many Clarens clients developed and deployed (e.g. ROOT, JAS, handhelds etc.) Clarens servers widely deployed UFL (service host for the Sphinx Grid Scheduler) CERN for CMS and ATLAS Fermilab UCSD NUST and NCP (Pakistan) Upcoming: FIU, UERJ (Rio), … GAE Monitoring We use MonALISA to monitor the Grid2003 and the global VRVS reflector network HEPCAL does not address the multi-user point of view of the GAE. Duplicate this slide. Add a list of collaborating groups (ufl, others?) in the middle

GAE System features Hierarchical Dynamic Distributed System Architecture* Dynamic service discovery Distributed Lookup Dynamic host discovery Distributed and automatic service lookup Better able to handle single node failure E.g. JINI today (working now!) Support for Peer to Peer Communications Hierarchical and Functional peer groupings Access to Result Sets, data and applications Web Services APIs Language and OS neutral Standards Based Emerging de-facto standard in academia and industry Full Architecture documentation See http://ultralight.caltech.edu/gaeweb/gae_services.pdf *Further info on the dynamic services architecture is available in: “Grid Computing 2002 – Making the Global Infrastructure a Reality”, Chapter 39

GAE Adopted Standards Access Web Services using HTTP , SOAP and XML-RPC Describe Web Services using WSDL Use X509/PKI based security via HTTP, HTTPS and GSI GAE will track the emerging OGSA standard with the goal of becoming compliant with it

Clarens Web Services Framework Clarens is a Portal system that provides a common infrastructure for deploying Grid-enabled web services. Features: Access control to services Session management Service discovery and invocation Virtual Organization management PKI-based security, including credential delegation Role in GAE: Connects clients (top) to Grid or Analysis applications Acts in concert with other Clarens servers to form a P2P network of service providers Two implementations: wide applicability Python/C using the Apache web server Java using Tomcat Servlets Client XML-RPC Web server SOAP/XML-RPC Clarens App. 1 App. 2 App. N Service discovery Grid Authentication RPC Call routing Session manager Plug-in manager HTTP HTTPS

GAE Architecture Analysis Clients talk standard protocols to the “Grid Services Web Server”, a.k.a. the Clarens data/services portal. Simple Web service API allows Analysis Clients (simple or complex) to operate in this architecture. Typical clients: ROOT, Web Browser, IGUANA, COJAC The Clarens portal hides the complexity of the Grid Services from the client, but can expose it in as much detail as req’d for e.g. monitoring. Key features: Global Scheduler, Catalogs, Monitoring, and Grid-wide Execution service. Grid Scheduler Catalogs Analysis Client Grid Services Web Server Execution Priority Manager Grid Wide Service Data Management Fully- Concrete Planner Abstract Virtual Replica Applications Monitoring Partially- Metadata HTTP, SOAP, XML/RPC Slide 7 - This diagram shows how the client access the grid services through a grid service host, and a rough idea of how the types of services interact. One key point from this slide is that the clients can be as smart or dumb as the application desires. They can let the grid service host do all of the work and ignore the details of the various services, or they can be more involved in the service flow. The motivator for hiding the grid services behind a grid service host is that it allows for simpler and easier to develop client application. The services are still available for smarter clients, however, so that application developers can have more control over the service flow if necessary. Clarens is used as the Grid Service host, providing XMLRPC, SOAP, and HTTP access to the grid services. Clarens also provides authentication and authorization to allow secure access to services. Factors motivating design choices: HTTP, SOAP, XMLRPC were chosen for transport/communication protocols due to their wide acceptance as standards. The heterogeneous and decentrally administered GAE must use open protocols for communication in order to be widely accepted. This lowers the barriers to entry when introducing new services and GAE grid sites.

GAE Architecture “Structured Peer-to-Peer” The GAE, based on Clarens and Web services, easily allows a “Peer-to-Peer” configuration to be built, with associated robustness and scalability features. Flexible: allows easy creation, use and management of highly complex VO structures. A typical Peer-to-Peer scheme would involve the Clarens servers acting as “Global Peers,” that broker GAE client requests among all the Clarens servers available worldwide. This slide shows that there is no single instance of a service; services are distributed across many service providing sites. The services at one site can interact with services at another site. Peer to peer is used to provide dynamic service discovery in order to prevent a single point of failure, and to allow easy integration of new services and service providers into the system. The term "super peer" is used to describe the natural tendency for the system to direct more requests to peers that have more resource availability (disk space, bandwidth, computing nodes); that is, not all peers are created equal and the more resource-affluent peers will tend to handle more of the grid-wide load than the resource-limited peers. P2P capabilities are being added to the Clarens service servers, as an addition to the existing security and service hosting capabilities. This makes Clarens a very key component of the CAIGEE architecture. Factors motivating design choices: P2P technologies are widely used for creating ad-hoc and dynamic groups of resource providers (file sharing). While still fairly young, P2P has proven that it is capable of scaling to the size that the GAE aims for.

Services in the GAE Architecture Analysis Client ROOT- Clarens Example Implementations Analysis Web Clients High-Level Web-service Grid Services Application Management VO Structure Data Discovery Request Servicing Work Flow Manager / Builder VO Management Service Metadata Service Monitoring Service Clarens MonALISA Shakar Lookup Service Virtual Data Service Request Steering Service Clarens Chimera Analysis Versioning Replica Location Service Request Scheduler Service CAVES Sphinx RLS Request Execution Service VDT-Client This is a breakdown of the various types of GAE services and existing implementations of the services. The implementations listed are only a sample of existing implementations. To encourage many implementations of the standard services, and to allow rapid prototyping and development of new services, language-neutral apis are necessary. Web Services provide the ability to define and implement such language-neutral APIs. Factors motivating software technologies: These service implementations were chosen because they already exist and have been used, even if in a limited manner. Newer implementations of the services will be investigated as they are created. Some of these services are native to the Clarens grid service host, while the others can be used as external plugins to the Clarens server. High-Level Web-service Grid Services Middleware Grid Services Data Service Grid Resource VO Mngmt. Service Replica Loc. Service Data Transfer Gatekeeper Monitoring Service Clarens Clarens VDT-Server

See http://ultralight.caltech.edu/gaeweb/gae_services.pdf GAE Service Flow The diagram depicts the sequence of operations and messages that occur during a Grid Enabled Analysis task execution in the GAE scheme: The sequence starts with the client communicating VO credentials to the Clarens server, and entrusting Clarens with the potentially complicated task of communicating with and marshalling the required Grid resources. Eventually the task execution results are returned, via Clarens, to the client from the Grid. During all this, the Monitoring service provides the user with constant feedback. This shows one possible service flow for end-to-end interactive analysis. Many other possible service flows can exist on the system. For example, a client application may be written to perform all computation locally, and could contact the Metadata Catalog and Replica Management Service directly to move data locally. In this scenario, the scheduler, steering service, and execution services are not used. While Clarens provides the VO and Lookup services natively, the other services are also made available Through the Clarens server as external plugins. Factors motivating design choices: This service flow and set of GAE services was motivated by the Caltech GAE workshop in June, 2003. It was also strongly motivated by the CS11 and HEPCAL documents. See http://ultralight.caltech.edu/gaeweb/gae_services.pdf

ARDA and GAE Ontology ARDA term GAE term Metadata Catalogue, Authentication, Authorization Job Monitor Monitor, Steering Grid Monitor Monitor Workload Management Scheduler Job Provenance Virtual Data Information Lookup Auditing File catalogue Replica Location and Selection ??? Workflow manager/builder Analysis versioning service Package Manager

Monitoring the Global GAE System MonALISA: http://monalisa.cacr.caltech.edu

Distributed Services for Grid ROOT Analysis Client Distributed Services for Grid Analysis Environment SC03 Demo Authentication Clarens Storage Elements Computing Elements VO management Iowa File Service VDT Resource MonALISA Monitoring Service Fermilab File Service VDT Resource Grid Florida File Service VDT Resource Chimera Virtual Data Service Caltech File Service VDT Resource Sphinx/VDT Execution Service Sphinx Scheduling Service RLS Replica Location Service

A Candidate GAE Desktop Allows simultaneous work on: Traditional analysis tools (e.g. ROOT) Software development Event displays MonALISA monitoring displays; Other “Grid Views” Job-progress Views Persistent collaboration (VRVS; shared windows) Online event or detector monitoring Web browsing, email Et cetera Four-screen Analysis Desktop 4 Flat Panels: 6400 X 1200 Driven by a single server and two graphics cards Cost: a few thousand dollars

Grid-Enabled Analysis Prototypes COJAC (via Web Services) Collaboration Analysis Desktop JASOnPDA (via Clarens) ROOT (via Clarens)

Upcoming GAE Developments Asynchronous service support using alternative transport protocols Higher level services to exploit distributed nature of resources Define standard APIs for similar services provided by different existing components (such as data catalogs) Implement a realistic end-to-end prototype based on the SC2003 GAE demonstration Make more existing components web-service enabled Asynchronous service access helps to enable more responsive clients.

Summary Distributed Production Grid Analysis Environment CMS is already using successfully LCG and VDT grid middleware to build its distributed production system A distributed batch-analysis system is being developed on top of the same LCG and VDT software CMS suggests that in the short term (6 months) ARDA extends the functionality of the LCG middleware to meet the architecture described in the ARDA document Grid Analysis Environment CMS asks that the GAE be adopted as the basis for the ARDA work on a system that supports interactive and batch analysis. Support for Interactive Analysis is the crucial goal in this context.