Benchmarking XML Processors for Applications in Grid Web Services Michael R. Head*, Madhusudhan Govindaraju*, Robert van Engelen**, Wei Zhang** *Grid Computing Research Laboratory Binghamton University (SUNY) **Florida State University
GCRL Binghamton University 2 Outline ● Motivation ● XML Performance Obstacles ● Benchmark Suite ● Results for a Variety of XML Processors ● Recommendations and Conclusions ● Future Work
GCRL Binghamton University 3 XML Defined ● Text based (usually UTF-8 encoded) ● Tree structured ● Language independent ● Generalized data format
GCRL Binghamton University 4 Motivation from SOAP ● Generalized RPC mechanism ● Broad industrial support ● Web Services on the Grid – OGSA: Open Grid Services Architecture – WSRF: Web Services Resource Framework ● At bottom, SOAP depends on XML
GCRL Binghamton University 5 XML Exclusive of SOAP ● General structured data format ● Becoming standard for many scientific datasets – HapMap – mapping genes – Protein Sequencing – NASA astronomical data – Many more instances
GCRL Binghamton University 6 Benchmark Motivation ● Grid applications place a wide range of requirements on the communication substrate and data formats. ● Simple and straightforward implementations can have a severe performance impact.
GCRL Binghamton University 7 XML Performance Limitations ● Compared to “legacy” formats – Text-based ● Lacks any “header blocks” (ex. TCP headers), so must scan every character to tokenize ● Numeric types take more space and conversion time – Lacks indexing ● Unable to quickly skip over fixed-length records
GCRL Binghamton University 8 Array size: SOAP vs. Binary 5 times difference in size
GCRL Binghamton University 9 CPU Usage when parsing doubles 90% of CPU time is being spent in floating point conversions
GCRL Binghamton University 10 Parsing Optimizations in Use ● Look-aside buffers/String caching [gsoap, XPP] ● Trie data structure with schema-specific parser ● One pass table-driven recursive descent parser [TDX]
GCRL Binghamton University 11 Benchmark Suite 1)A chosen set of XML documents – Low level probes – Application-based benchmarks 2)A driver application for each XML processor – Runs the parser on the input, but does not act on the data ● Eliminates application-level performance differences ● One for each interface style (SAX/DOM)
GCRL Binghamton University 12 Benchmark Probes ● Overhead test – Minimal XML document ● (header plus one self-closing element) ● Buffering – Repeated use of xsi:type attributes ● Namespace management – Gratuitous use of xmlns attributes ● SOAP payloads
GCRL Binghamton University 13 Application Benchmarks ● Ptolemy Workflow documents (which Kepler uses) ● Genetic data files – (Large) files from the International HapMap Project ● Molecular data ● Mesh interface objects, event streams (WSMG) ● WS-Security documents ● Eager for more
GCRL Binghamton University 14 Results – Latency Overhead
GCRL Binghamton University 15 C Parsers: SOAP Payloads
GCRL Binghamton University 16 C Parsers: Application-level tests
GCRL Binghamton University 17 Java Parsers: SOAP Payloads
GCRL Binghamton University 18 Java Parsers: Application-level tests
GCRL Binghamton University 19 TDX Performance SOAP payload of array of strings
GCRL Binghamton University 20 Recommendations ● When handling disparate XML formats, different parsers, consider a pluggable XML handling mechanism ● Schema-specific parsing techniques (TDX for example) are very promising when schemas are known in advance ● When considering designs for multi-core architectures, using TDX may be far faster than attempting to parallelize the other existing processors
GCRL Binghamton University 21 Community Relations ● Publicly available benchmark suite ● Encourage vendors, users, developers to contribute additional XML parsers and sample files as necessary – – – –
GCRL Binghamton University 22 Future Work ● Various techniques to parallelize XML processing ● Add new XML parser tests to the suite – Add more tests for existing parsers ● Include more sample files ● Update web site with current performance snapshots
GCRL Binghamton University 23 Questions