Alice Off-line Week, February 24th, 2005 ARDA Experiences with Metadata Alice Off-line Week, February 24th, 2005 Nuno Santos Metadata Solutions of the experiments A Generic Interface to Metadata An ARDA Metadata Service Prototype SOAP: A good way to Access Metadata? Conclusions
Experience with Metadata ARDA tested several Metadata solutions from the experiments: LHCb Bookkeeping – XML-RPC with Oracle backend CMS: RefDB - PHP in front of MySQL, giving back XML tables Atlas: AMI - SOAP-Server in Java in front of MySQL gLite (Alien Metadata) - Perl in front of MySQL parsing command, streaming back text Tested performance, scalability and features.
Results taken on November 2004 AMI & RefDB (SOAP) Both AMI & RefDB ship responses in single XML package Results taken on November 2004 Poor scalability, both in number of concurrent clients and size of response. Problems with memory consumption at server – Requests too large to fit in memory. 20 40 60 80 100 120 140 time-to-complete [s] #connections Average response time 19.5Mb/conn. 5.5Mb/conn. 1.2Mb/conn. 200 300 400 500 600 700 800 512MB limit 10 20 30 40 50 60 80 100 120 140 Time to completion [s] Number of rows selected 150 30 Clients 5 1 Results on RefDB:J. Andreeva, J. Herrala
LHCb(XML-RPC) LHCb uses XML-RPC (predecessor of SOAP): Being also “one shot” query based, the solutions suffers from the same problems as the two based on SOAP W.-L. Ueng
gLite/Alien(Streaming) gLite/Alien streams responses to the perl implemented shell Errors Time to completion # Clients Time to Completion [s] Selecting 2.5k files of 10k 5 10 15 20 25 30 35 40 60 80 100 Streaming offers good scalability
Synthesis Generally, the scalability is poor Sending big responses in a single packet limits scalability Use of SOAP and XML-RPC worsens the problem Schema evolution not really supported RefDB and Alien don't do schema evolution at all. AMI, LHCb-Bookkeeping via admins adjusting tables. No common Metadata interface Experience with existing Software Propose Generic Interface & Prototype as Proof-Of-Concept
Design decisions Metadata organized as a hierarchy - Collect objects with shared attributes into collections Collections stored as a table - allows queries on SQL tables Analogy to file system: Collection <-> Directory Entry <-> File Abstract the backend – Allows supporting several backends. PostgreSQL, MySQL, Oracle, filesystem… Values restricted to ASCII strings Backend is unknown Scalability: Transfer large responses in chunks, streaming from DB Decreases server memory requirements, no need to store full response in memory Implement also a non-SOAP protocol Compare with SOAP. What is the performance price of SOAP?
Terminology Define common terms, first Metadata: Key-Value pairs Any data necessary to work on the grid not living in the files Entry: Entities to which metadata is attached Denoted by a string, format like file-path in Unix, wild-cards allowed Collection: Set of entries Collections are themselves entries, think of Directories Attribute: Name or key of piece of metadata Alphanumerical string with starting letter Value: Value of an entry's attribute Printable ASCII string Schema: Set of attributes of an entry Classifies types of entries: Collection's schema inherited by its entries Storage Type: How back end stores a value Back end may have different (SQL) datatypes than application Transfer Type: How values are transferred Values are transported as printeable ASCII
Interface I Interface designed in discussions with gLite-team, GridPP metadata working group and GAG. Accepted by PTF. Presented here as a C style API. Entry management: int addEntry(string entry, string type) int addEntries(list<string> entries, list<string> types) int removeEntries(string pattern) Schema management and metadata reading/writing: int addAttr(string collection, list<string> keys, list<string> types) int removeAttr(string collection, list<string> keys) int setAttr(string pattern, list<string> keys, list<string> values) int clearAttr(string pattern, string key) With R. Rocha(gLite), V. Pose, B. Koblitz
Interface: Retrieving data Bulk transfers to client - session handlers and iterators on the backend: int getAttr(string pattern, list<string>keys, Handler handler) int listAttr(string entry, Handler handler) int getNext(handle_t handle, DataChunk dc) int abort(handle_t handle) Searching for entries: int find(string pattern, string query, Handler &handle) int listEntries(string pattern, Handler &handle) Example query:'tracks > 10 and sin(p_angle) <0.5 and trigger & 2' With R. Rocha(gLite), V. Pose, B. Koblitz
Protocols implemented Two implementations: SOAP - Fulfil formal requirements on EGEE Server – C++ (gSOAP). Clients – Tested WSDL with C++ (gSOAP), Python (ZSI), Java (AXIS) TCP/IP streaming – For efficiency Text based protocol (like SMTP, POP3, and many others) Server – C++ Clients - C++, Java, Python, Perl, Ruby SSL for authentication and encryption implemented Client: getattr file(s) key1 key2… Server: 0 file value1 value2 …
TCP/IP Streaming Good experiences with ARDA prototype using streaming: Results from November 2004 Now very stable after tuning. Collaboration with LHCb - large scale test Database with 40M entries GANGA back end gLite(Alien) Time to Completion [s] # Clients 20 40 60 80 100 5 25 10 15 30 35 Crashes Selecting 2.5k entries of 10k ARDA Entries / s [1/s] Entries per directory/collection 5 10 15 20 25 30 1000 10000 100000 Attach MD (Alien) Create Entry (Alien) Create Entry (ARDA) Attach MD (ARDA) With V.Pose
TCP Streaming Streaming operations: Significant performance gains Open a transaction on the database (cursor) Stream the whole response to the client in a single TCP connection Significant performance gains Server throughput around 8 times higher
SOAP Toolkits What is the state of current SOAP toolkits? C++ - gSOAP Python – ZSI Java – Axis 1.2RC2 Ease of development? WSDL generation: From a .h file using gSOAP. After two days – a WSDL working fine with a gSOAP client. But not with other toolkits. From a Java interface using AXIS Resulting WSDL generated warnings when validated by other toolkits In the end - WSDL written manually Only known way of creating a WSDL that is accepted without warnings by toolkits being used. Ease of development? Client side: Not so bad with proper WSDLs and good toolkits 1 hour from WSDL to a working Java application using Axis 1.2RC2. What about performance? N. Santos
SOAP Toolkits performance Performance varies significantly gSOAP is fast AXIS is more than10 times slower ZSI is more than 20 times slower!! SOAP serialization performance is not too important for the client But essential for scalable servers. gSOAP is an interesting option.
SOAP vs TCP What is the overhead of a SOAP call compared to a TCP request? Around 6 times if using HTTP keepalive to improve SOAP. More than 10 times if no HTTP keepalive used. TCP connections are kept open between requets.
SOAP Iterators – Retrieving large results Iterators are used in SOAP as an alternative to streaming. Server throughput improved by 3 times But still far from TCP streaming
Conclusions on SOAP Tests show: SOAP performance is generally poor when compared to TCP streaming gSOAP is significantly faster than other toolkits. Iterators (with stateful server) helps considerably – results returned in small chunks. SOAP puts an additional load on server SOAP interoperability very problematic Writing an interoperable WSDL is hard Not all toolkits are mature
(discussed with gLite, GridPP, GAG, accepted by PTF) Conclusions Many problems understood studying metadata implementations of experiments Common requirements exist ARDA proposes generic interface to metadata on Grid: Retrieving/Updating of data Hierarchical view Schema discovery and management (discussed with gLite, GridPP, GAG, accepted by PTF) Prototype with SOAP & streaming front end built SOAP can be as fast as streaming in many cases SOAP toolkits still immature http://project-arda-dev.web.cern.ch/project-arda-dev/metadata/