Download presentation
Presentation is loading. Please wait.
1
Metadata Services on the GRID
Nuno Santos ACAT’05 May 25th, 2005 Name. PhD student at Coimbra, Doing work on the ARDA. My presentation is about the work on Metadata that we have been doing at ARDA.
2
Contents Metadata on the GRID ARDA-gLite Metadata Interface
The ARDA Implementation Performance study: SOAP vs TCP Streaming I’ll start by giving a brief overview of what metadata means to the GRID. Then, I’ll present the Metadata Interface developed by ARDA and gLite, which addresses the most common use cases for GRID metadata. I’ll continue by describing the prototype implemented by ARDA to validate this interface. I’ll finish by presenting the results of a performance study made using this prototype, where SOAP is compared with a traditional RPC protocol based on streaming.
3
Metadata on the GRID Metadata is data about data Metadata on the GRID
Mainly information about files Other information necessary for running jobs Usually living on DBs Need simple interface for Metadata access Advantages Easier to use by clients - no SQL, only metadata concepts Common interface - clients don’t have to reinvent the wheel Must be integrated in the File Catalogue Also suitable for storing information about other resources First what is Metadata? A common definition is that metadata is data about data. On the GRID, this is mainly information describing files that is necessary for running jobs, that is, file metadata. But So, in a way accessing metadata is mainly about accessing databases. But having clients going directly to the database is not the most convenient solution. Better than that, is to have a simple interface for metadata access on the GRID. This interface should be defined in terms of metadata concepts, like entries, keys and values, instead of DB concepts. This has several advantages. It is easier to use by clients, since it exposes only metadata concepts and effectively hides the database. Having a simple interface that reveals DB functionality solves most of the problems. Simplified relational database interface.
4
ARDA-gLite Metadata Interface
ARDA proposed an interface for Metadata access on the GRID Designed jointly with the gLite/EGEE team Incorporates feedback from GridPP Endorsed by the EGEE standards committee (PTF) Being implemented in gLite File Catalog (FiReMan) Interface concepts Metadata - Key-value pairs Entry - Entities to which metadata is attached Attribute – Holds information about an entry Schema – A collection of attributes Type – The type (int, float, string,…) Name/Key – The name of the attribute Value - Value of an entry's attribute Entries are associated with schemas Think of schemas as tables, attributes as columns, entries as rows Metadata are key-value pairs.
5
Interface Operations Schema management Entry management
void createSchema(String schemaName, Attribute[] attributes) void dropSchema(String schemaName) void removeSchemaAttributes(String schemaName, String[] attributeNames) void addSchemaAttributes(String schemaName, Attribute[] attributes) Entry management void createEntry(MDEntry[] entries, String[] schemas) void removeEntry(String query) int setAttributes(String query, Attribute[] attributes) Attribute[] listAttributes(String entry) Talk about the main type of tasks, and then go into the concrete operations. Update with the new interface
6
Interface Operations Searching and retrieving entries MDResult query(MDQuery query) MDResult nextQuery(String token, MDQuery query) void endQuery(String token) Datatypes Allows either stateful or stateless server implementations Attribute { String schema String name String type String value } MDEntry { String entry Attribute[] attributes } Other query types: XPath Mention stateful vs stateless nature of searching and retrieving entries. An implementation can return the answer in a single go, send in chunks with stateful servers or stateless servers. MDQuery { String query String queryType } MDResult { MDEntry[] entries String token Boolean done }
7
ARDA Prototype Validate proposed interface Architecture:
Metadata organized in a hierarchy Schemas can contain sub-schemas Can inherit attributes Analogy to file system: Schema Directory; Entry File Stability with large responses Send large responses in chunks Otherwise preparing large responses could crash server Stateful server DB → Server – Data streamed using DB cursors Server → Client – Response sent in chunks
8
ARDA Implementation Backends Two frontends
Currently: Oracle, PostgreSQL, SQLite Two frontends TCP Streaming Chosen for performance SOAP Formal requirement of EGEE Compare SOAP with TCP Streaming Also implemented as standalone Python library Data stored on filesystem
9
TCP Streaming Frontend
Text based protocol (like SMTP, POP3,…) Data streamed to client in single connection Implementation Server – C++, multiprocess Clients – C++, Java, Python, Perl, Ruby Client: listattr entry Server: 0 entry value1 value2 … <EOT>
10
SOAP Frontend Most operations in interface implemented as simple SOAP calls query() - based on iterators Initial request – create session Open cursor on DB Return initial chunk of data and session token Subsequent requests Client calls nextQuery() using session token Termination – session closed when: End of data Client calls endQuery() Client timeout Implementations Server – gSOAP (C++). Clients – Tested WSDL with gSOAP, ZSI (Python), AXIS (Java)
11
Current Uses of the ARDA prototype
Evaluated by LHCb-bookkeeping Migrated bookkeeping metadata to ARDA prototype 20M entries, 15 GB Feedback valuable in improving interface and fixing bugs Interface found to be complete ARDA prototype showing good scalability Ganga (LHCb, ATLAS) User analysis job management system Stores job status on ARDA prototype Highly dynamic metadata
12
Performance Study SOAP increasingly used as standard protocol for GRID computing Promising web services standard - Interoperability Some potential weaknesses XML encoding increases message size (4x to 10x typical) XML processing is compute and memory intensive How significant are these weaknesses? What is the cost of using SOAP? ARDA metadata implementation ideal for comparing SOAP with a traditional RCP protocol
13
Benchmark Description
Protocols TCP-S – TCP Streaming SOAP – Clients with gSoap (C++), Axis (Java) and ZSI (Python) Operations ping – A null RPC add – Adds an entry get – Gets all attributes of an entry get (bulk) – Gets all attributes of several entries in a single operation Entries 60 attributes (ints, floats and strings) 700 bytes on average HTTP Keepalive/Persistant connections HTTP Keepalive increase HTTP performance. Should improve SOAP performance. gSOAP supports Keepalive. Axis and ZSI don’t. TCP-S uses persistent TCP connections to compare with HTTP Keepalive No work done on backend by ping
14
SOAP Data Overhead Measure size overhead of XML encoding Ping
1000 requests Minimal payload – less than 5 bytes per request SOAP overhead around 8 times Get attributes in bulk Retrieve 1000 entries Around 800KB of application data Streaming in TCP Iterators with SOAP – 4KB average SOAP packet payload With keepalive SOAP overhead around 2.5 times Total data transferred (in KB)
15
SOAP Toolkits performance
Test protocol performance No work done on the backend Switched 100Mbits LAN Language comparison TCP-S with similar performance in all languages SOAP performance varies strongly with toolkit Protocols comparison Keepalive improves performance significantly On Java and Python, SOAP is several times slower than TCP-S 1000 pings Mention that TCP-S has a large initial overhead due to protocol negotiation Mention that Axis and ZSI don’t support HTTP keepalive. Mention that it was hard to create an interoperable WSDL. - Java is faster due to initial negotiation, less chatty negotiation than with C++
16
Single client results (LAN)
Compare performance of different operations C++ clients (gSOAP) When backend must do work, differences between gSOAP and TCP-S are small Bulk operations very important for performance getBulk 4x faster than get 1000 pings/1000 Entries Main difference is more between KA and no KA, than between SOAP and C++ Bulk operations very important for performance.
17
Single client results (WAN)
Client CERN, server Taiwan ≈300 ms latency Results dominated by latency Execution time at server irrelevant Large performance boost from latency hiding techniques: keepalive – fewer TCP handshakes bulk operations – fewer client/server interactions 1000 pings/1000 Entries Ping, add and get have to perform requests – same results regardless of work being done on server side. Keepalive avoids TCP handshakes Bulk operations further improves results – TCP-S only a single request from client, server streams answer. SOAP fewer requests from client, since each answer from server contains many entries (between 5 and 6) TCP-S has a large initial overhead due to protocol negotiation. SOAP is twice as fast as TCP-S whithout Keepalive since it does not have to negotiate the protocol. With keepalive, it makes no difference using SOAP or TCP-S when making individual requests. With bulk operations,
18
Scalability with Multiple Clients - Pings
Measure scalability of protocols Switched 100Mbits LAN TCP-S 3x faster than gSoap (with keepalive) Poor performance without keepalive Around ops/sec (both gSOAP and TCP-S) 1000 pings Graph contains the average throughput of the server
19
Scalability with Multiple Clients - getAttr
Measure scalability with realistic payload Switched 100Mbits LAN All tests with keepalive Smaller difference between gSOAP and TCP-S TCP-S 2x faster (1000 vs 500 entries/sec) Poor performance of non-bulk operations 100 entries/sec 1000 entries
20
Conclusions A common Metadata Interface was developed by ARDA and gLite Endorsed by the EGEE standards committee Interface validated by ARDA prototype Prototype in use by LHCb (bookkeeping, Ganga) and ATLAS (Ganga) SOAP performance studied using ARDA implementation Toolkit performance varies widely Large SOAP overhead (over 100%)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.