Metadata Services on the GRID

Slides:



Advertisements
Similar presentations
Chapter 10: Designing Databases
Advertisements

Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.
The AMGA metadata catalog Riccardo Bruno - INFN Madrid, 07-11/05/2007.
Asterios Katsifodimos Saturday, May 23, 2015 High Performance Computing systems Lab University of Cyprus The AMGA metadata catalog – An Overview Slides.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
G O B E Y O N D C O N V E N T I O N WORF: Developing DB2 UDB based Web Services on a Websphere Application Server Kris Van Thillo, ABIS Training & Consulting.
1 CS6320 – Why Servlets? L. Grewe 2 What is a Servlet? Servlets are Java programs that can be run dynamically from a Web Server Servlets are Java programs.
NFS. The Sun Network File System (NFS) An implementation and a specification of a software system for accessing remote files across LANs. The implementation.
INTRODUCTION TO WEB DATABASE PROGRAMMING
NETWORK CENTRIC COMPUTING (With included EMBEDDED SYSTEMS)
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
Basics of Web Databases With the advent of Web database technology, Web pages are no longer static, but dynamic with connection to a back-end database.
SSC2: Web Services. Web Services Web Services offer interoperability using the web Web Services provide information on the operations they can perform.
CSC271 Database Systems Lecture # 4.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America The AMGA metadata catalog with use cases.
1 HKU CSIS DB Seminar: HKU CSIS DB Seminar: Web Services Oriented Data Processing and Integration Speaker: Eric Lo.
Tunis International Centre for Environmental Technologies Small Seminar on Networking Technology Information Centers UNFCCC secretariat offices Bonn, Germany.
Marianne BargiottiBK Workshop – CERN - 6/12/ Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.
.Net and Web Services Security CS795. Web Services A web application Does not have a user interface (as a traditional web application); instead, it exposes.
(Chapter 10 continued) Our examples feature MySQL as the database engine. It's open source and free. It's fully featured. And it's platform independent.
INFSO-RI Enabling Grids for E-sciencE Distributed Metadata with the AMGA Metadata Catalog Nuno Santos, Birger Koblitz 20 June 2006.
INFSO-RI Enabling Grids for E-sciencE AMGA Metadata Server - Metadata Services in gLite (+ ARDA DB Deployment Plans with Experiments)
Enabling Grids for E-sciencE EGEE-III INFSO-RI I. AMGA Overview What is AMGA Metadata Catalogue of EGEE’s gLite 3.1 Middleware Main Feature of.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks AMGA PHP API Claudio Cherubino INFN - Catania.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Metadata Mòrag Burgon-Lyon University of Glasgow.
EGEE User Forum Data Management session Development of gLite Web Service Based Security Components for the ATLAS Metadata Interface Thomas Doherty GridPP.
Kemal Baykal Rasim Ismayilov
Application Development
Lattice QCD Data Grid Middleware: Meta Data Catalog (MDC) -- CCS ( tsukuba) proposal -- M. Sato, for ILDG Middleware WG ILDG Workshop, May 2004.
NorduGrid plans and questions for gLite Marko Niinimaki, NorduGrid 3 rd EGEE meeting Athens, April 2005.
CP476 Internet Computing Perl CGI and MySql 1 Relational Databases –A database is a collection of data organized to allow relatively easy access for retrievals,
Summary of Metadata Workshop Peter Hristov 28 February 2005 Alice Computing Day.
Matthew Farrellee Computer Sciences Department University of Wisconsin-Madison Condor and Web Services.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
FP6−2004−Infrastructures−6-SSA Enabling Grids for E-sciencE The AMGA Metadata Catalog Introduction and hands-on exercises Nuno Santos.
The ARDA Project Prototypes for User Analysis on the GRID Dietrich Liko/CERN IT.
In this session, you will learn to: Understand managed code Create managed database objects Define the Hypertext Transfer Protocol endpoints Implement.
E-commerce Architecture Ayşe Başar Bener. Client Server Architecture E-commerce is based on client/ server architecture –Client processes requesting service.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Introduction to Database Programming with Python Gary Stewart
National College of Science & Information Technology.
1 Chapter 1 INTRODUCTION TO WEB. 2 Objectives In this chapter, you will: Become familiar with the architecture of the World Wide Web Learn about communication.
Java Web Services Orca Knowledge Center – Web Service key concepts.
DBMS & TPS Barbara Russell MBA 624.
WWW and HTTP King Fahd University of Petroleum & Minerals
AMGA Metadata Service Vladimir Dimitrov IPP-BAS “gLite middleware Application Developers Course”, Sofia, Bulgaria,
N-Tier Architecture.
Security and Replication of Metadata with AMGA
Improving searches through community clustering of information
WEB SERVICES.
Metadata Services on the GRID
Unit – 5 JAVA Web Services
POOL persistency framework for LHC
Alice Off-line Week, February 24th, 2005
PHP / MySQL Introduction
New developments on the LHCb Bookkeeping
Chapter 2 Database Environment Pearson Education © 2009.
WEB API.
AMGA Metadata Service Vladimir Dimitrov IPP-BAS “gLite middleware Application Developers Course”, Plovdiv, Bulgaria,
Lecture 1: Multi-tier Architecture Overview
Database Environment Transparencies
AMGA Metadata Service Vladimir Dimitrov IPP-BAS “gLite middleware Application Developers Course”, Sofia, Bulgaria,
The AMGA metadata catalog
Issues in Client/Server Programming
Internet Protocols IP: Internet Protocol
Chapter 10 ADO.
Deepak Shenoy Agni Software
Database Management Systems
WCF Data Services and Silverlight
Presentation transcript:

Metadata Services on the GRID Nuno Santos ACAT’05 May 25th, 2005 Name. PhD student at Coimbra, Doing work on the ARDA. My presentation is about the work on Metadata that we have been doing at ARDA.

Contents Metadata on the GRID ARDA-gLite Metadata Interface The ARDA Implementation Performance study: SOAP vs TCP Streaming I’ll start by giving a brief overview of what metadata means to the GRID. Then, I’ll present the Metadata Interface developed by ARDA and gLite, which addresses the most common use cases for GRID metadata. I’ll continue by describing the prototype implemented by ARDA to validate this interface. I’ll finish by presenting the results of a performance study made using this prototype, where SOAP is compared with a traditional RPC protocol based on streaming.

Metadata on the GRID Metadata is data about data Metadata on the GRID Mainly information about files Other information necessary for running jobs Usually living on DBs Need simple interface for Metadata access Advantages Easier to use by clients - no SQL, only metadata concepts Common interface - clients don’t have to reinvent the wheel Must be integrated in the File Catalogue Also suitable for storing information about other resources First what is Metadata? A common definition is that metadata is data about data. On the GRID, this is mainly information describing files that is necessary for running jobs, that is, file metadata. But So, in a way accessing metadata is mainly about accessing databases. But having clients going directly to the database is not the most convenient solution. Better than that, is to have a simple interface for metadata access on the GRID. This interface should be defined in terms of metadata concepts, like entries, keys and values, instead of DB concepts. This has several advantages. It is easier to use by clients, since it exposes only metadata concepts and effectively hides the database. Having a simple interface that reveals DB functionality solves most of the problems. Simplified relational database interface.

ARDA-gLite Metadata Interface ARDA proposed an interface for Metadata access on the GRID Designed jointly with the gLite/EGEE team Incorporates feedback from GridPP Endorsed by the EGEE standards committee (PTF) Being implemented in gLite File Catalog (FiReMan) Interface concepts Metadata - Key-value pairs Entry - Entities to which metadata is attached Attribute – Holds information about an entry Schema – A collection of attributes Type – The type (int, float, string,…) Name/Key – The name of the attribute Value - Value of an entry's attribute Entries are associated with schemas Think of schemas as tables, attributes as columns, entries as rows Metadata are key-value pairs.

Interface Operations Schema management Entry management void createSchema(String schemaName, Attribute[] attributes) void dropSchema(String schemaName) void removeSchemaAttributes(String schemaName, String[] attributeNames) void addSchemaAttributes(String schemaName, Attribute[] attributes) Entry management void createEntry(MDEntry[] entries, String[] schemas) void removeEntry(String query) int setAttributes(String query, Attribute[] attributes) Attribute[] listAttributes(String entry) Talk about the main type of tasks, and then go into the concrete operations. Update with the new interface

Interface Operations Searching and retrieving entries MDResult query(MDQuery query) MDResult nextQuery(String token, MDQuery query) void endQuery(String token) Datatypes  Allows either stateful or stateless server implementations Attribute { String schema String name String type String value } MDEntry { String entry Attribute[] attributes } Other query types: XPath Mention stateful vs stateless nature of searching and retrieving entries. An implementation can return the answer in a single go, send in chunks with stateful servers or stateless servers. MDQuery { String query String queryType } MDResult { MDEntry[] entries String token Boolean done }

ARDA Prototype Validate proposed interface Architecture: Metadata organized in a hierarchy Schemas can contain sub-schemas Can inherit attributes Analogy to file system: Schema  Directory; Entry  File Stability with large responses Send large responses in chunks Otherwise preparing large responses could crash server Stateful server DB → Server – Data streamed using DB cursors Server → Client – Response sent in chunks

ARDA Implementation Backends Two frontends Currently: Oracle, PostgreSQL, SQLite Two frontends TCP Streaming Chosen for performance SOAP Formal requirement of EGEE Compare SOAP with TCP Streaming Also implemented as standalone Python library Data stored on filesystem

TCP Streaming Frontend Text based protocol (like SMTP, POP3,…) Data streamed to client in single connection Implementation Server – C++, multiprocess Clients – C++, Java, Python, Perl, Ruby Client: listattr entry Server: 0 entry value1 value2 … <EOT>

SOAP Frontend Most operations in interface implemented as simple SOAP calls query() - based on iterators Initial request – create session Open cursor on DB Return initial chunk of data and session token Subsequent requests Client calls nextQuery() using session token Termination – session closed when: End of data Client calls endQuery() Client timeout Implementations Server – gSOAP (C++). Clients – Tested WSDL with gSOAP, ZSI (Python), AXIS (Java)

Current Uses of the ARDA prototype Evaluated by LHCb-bookkeeping Migrated bookkeeping metadata to ARDA prototype 20M entries, 15 GB Feedback valuable in improving interface and fixing bugs Interface found to be complete ARDA prototype showing good scalability Ganga (LHCb, ATLAS) User analysis job management system Stores job status on ARDA prototype Highly dynamic metadata

Performance Study SOAP increasingly used as standard protocol for GRID computing Promising web services standard - Interoperability Some potential weaknesses XML encoding increases message size (4x to 10x typical) XML processing is compute and memory intensive How significant are these weaknesses? What is the cost of using SOAP? ARDA metadata implementation ideal for comparing SOAP with a traditional RCP protocol

Benchmark Description Protocols TCP-S – TCP Streaming SOAP – Clients with gSoap (C++), Axis (Java) and ZSI (Python) Operations ping – A null RPC add – Adds an entry get – Gets all attributes of an entry get (bulk) – Gets all attributes of several entries in a single operation Entries 60 attributes (ints, floats and strings) 700 bytes on average HTTP Keepalive/Persistant connections HTTP Keepalive increase HTTP performance. Should improve SOAP performance. gSOAP supports Keepalive. Axis and ZSI don’t. TCP-S uses persistent TCP connections to compare with HTTP Keepalive No work done on backend by ping

SOAP Data Overhead Measure size overhead of XML encoding Ping 1000 requests Minimal payload – less than 5 bytes per request SOAP overhead around 8 times Get attributes in bulk Retrieve 1000 entries Around 800KB of application data Streaming in TCP Iterators with SOAP – 4KB average SOAP packet payload With keepalive SOAP overhead around 2.5 times Total data transferred (in KB)

SOAP Toolkits performance Test protocol performance No work done on the backend Switched 100Mbits LAN Language comparison TCP-S with similar performance in all languages SOAP performance varies strongly with toolkit Protocols comparison Keepalive improves performance significantly On Java and Python, SOAP is several times slower than TCP-S 1000 pings Mention that TCP-S has a large initial overhead due to protocol negotiation Mention that Axis and ZSI don’t support HTTP keepalive. Mention that it was hard to create an interoperable WSDL. - Java is faster due to initial negotiation, less chatty negotiation than with C++

Single client results (LAN) Compare performance of different operations C++ clients (gSOAP) When backend must do work, differences between gSOAP and TCP-S are small Bulk operations very important for performance getBulk 4x faster than get 1000 pings/1000 Entries Main difference is more between KA and no KA, than between SOAP and C++ Bulk operations very important for performance.

Single client results (WAN) Client CERN, server Taiwan ≈300 ms latency Results dominated by latency Execution time at server irrelevant Large performance boost from latency hiding techniques: keepalive – fewer TCP handshakes bulk operations – fewer client/server interactions 1000 pings/1000 Entries Ping, add and get have to perform 1.000 requests – same results regardless of work being done on server side. Keepalive avoids TCP handshakes Bulk operations further improves results – TCP-S only a single request from client, server streams answer. SOAP fewer requests from client, since each answer from server contains many entries (between 5 and 6) TCP-S has a large initial overhead due to protocol negotiation. SOAP is twice as fast as TCP-S whithout Keepalive since it does not have to negotiate the protocol. With keepalive, it makes no difference using SOAP or TCP-S when making individual requests. With bulk operations,

Scalability with Multiple Clients - Pings Measure scalability of protocols Switched 100Mbits LAN TCP-S 3x faster than gSoap (with keepalive) Poor performance without keepalive Around 1.000 ops/sec (both gSOAP and TCP-S) 1000 pings Graph contains the average throughput of the server

Scalability with Multiple Clients - getAttr Measure scalability with realistic payload Switched 100Mbits LAN All tests with keepalive Smaller difference between gSOAP and TCP-S TCP-S 2x faster (1000 vs 500 entries/sec) Poor performance of non-bulk operations 100 entries/sec 1000 entries

Conclusions A common Metadata Interface was developed by ARDA and gLite Endorsed by the EGEE standards committee Interface validated by ARDA prototype Prototype in use by LHCb (bookkeeping, Ganga) and ATLAS (Ganga) SOAP performance studied using ARDA implementation Toolkit performance varies widely Large SOAP overhead (over 100%)