Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003.

Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003

Contents In search of a simple API for accessing DOM The multiple tag problem What is it? Is it a problem for us? How can we get around it? XPath What is easy to parse? Software: XPathReader package Conclusions

Motivation (Starting Points) Lack of free Data- binding tools for C/C++ Desire to read ILDG Metadata documents, marshal application data => Have to write our own tools Would like simple API to get at document data Would like same API to cope with ILDG metadata AND application data. We got as far as reading into a DOM.

Start With Simple Idea Consider simple API with functions push(tagname) -- select tag with name tagname pop() -- move up a level getType( tagname, result ) Type = string | float | double | int | bool; Equivalent API: directory like structure with no absolute paths: cd(tagname) = push(tagname), cd(..) = pop() Simple Data: No Attributes, No Namespaces No Empty Elements.

Example String 5.0 Open(''file.xml''); push(''foo''); string bar; getString(''bar'', bar); double fred; getDouble(''fred'', fred); pop(); So far so good - nice and simple Current UKQCD Schema has no attributes/namespaces Empty tags serve no purpose except as placeholders BUT Soon we encounter...

The Multiple Tag Problem 1 16 2 16 Consider following snippet: Lets try our API: push(''size''); But what does: push(''axis''); do?

Multiple Tag Problem (cont'd) push(“axis”) could select in document order We could add an index to push(“axis”) push(“axis”, 1) push(“axis”,2) We could add an index attribute to But then we'd need a mechanism to match index attribute We could change the names of axis: We could put the different into different namespaces -- effectively same as adding attribute We could try and match the tag.

The consequences Changing tagnames for simplicity of parsing just seems wrong Matching the tag is not possible without first selecting an in our scheme (locality) Adding attributes/namespaces complicates API. This use of different namespaces would be philosophically wrong. Adding order of occurrance index into API is cleanest No need to change Schema, Instance documents etc. Document ordering removes random access capability

In General For less simple (more general) XML documents duplicate tags can be distinguished by: Occurrance Order Name Attributes Content Namespace An ideal, simple API should allow matching on all of these to interrogate any XML document.

What about Locality ? push(namespace, tagname, attributes, occurrance) getType(ns, tagname, attributes, occurrance, result) But NO local parser can match on element content. need to open a tag based on value of content BUT can't get to content without opening tag. 2 2 16 1 16 Document order may not help here Schema document still satisfied. Would like to match on tag Need to abandon locality

Lesson In order to avoid ambiguity we must Restrict the form of markup we deal with Force decisions onto our Schema writers OR complicate our API rely on tag ordering (either implicitly or explicitly) introduce attributes (forcing decision on Schema writers) give up locality in the API

Global Queries: XPath Would like a nice way to encode tag name attributes order of occurrence attribute/content matching predicates Can this be done? YES! Using XPath

XPath Axes Node Parent axis:.. Attribute Axis: @ Child axis:./ Following Sibling Axis (no compact selector) Preceding Sibling Axis (no compact selector) XPath Axes specify coordinates for DOM. Some Axes can include more than one node: ancestors: parent and all its ancestors

XPath Selectors tagname selects all children of current node called tagname * selects all children of node @name selects all attribute nodes called name @* selects all atributes nodes of current node. name[i] selects the i-th occurrance of child node called name.. selects parent of current node //name selects name with any set of ancestors

XPath Examples 1 16 2 16 XPath Query: / Selection

XPath Examples 1 16 2 16 XPath Query: /size Selection

XPath Examples 1 16 2 16 XPath Query: /size/axis Selection OR /size/* OR //axis

XPath Examples 1 16 2 16 XPath Query: /size/axis[2] Selection /size/axis[dimension=”2”] OR Query on element content Query on order of occurrance

XPath Examples 1 16 2 16 XPath Query: /size/bj:axis Selection Support Namespaces

XPath Examples 1 16 2 16 XPath Query: /size/axis[@index=”2”] Selection Attribute Matching Visit: http://www.zvon.org/xxl/XPathTutorial for more...

XPath Notes Can return sets of nodes - not just unique node Has more features: Functions to turn query results into strings, numbers, booleans Encodes all features we need C/C++ linkable XPath Processors exist Xerces, Xalan, libxml Solves all our reader API problems in nice way.

XPath Based Reader API Basic Functions: open(file/stream); getType(xpath_string, result); getAttributeType(xpath_string, attributeName, result); Semantics: The xpath_string must identify a unique node.

What is Easy to Parse? Stylistic discussion on Metadata Mailing list. One particular question: “ How should we mark up things?” 4 X 16 Y 16 Chris' Way: Tomoteru's Way: Known as the: “ Element v.s. Attribute” debate in the XML world.

What is Easy to Parse? One statement is that the attribute way is perhaps easier to parse? With XPath, both ways are easy to parse. To get the length of the x dimension: Chris' Way: number(//size/axis[normalize-space(string(name))=”X”]/length) getInt(“//size/axis[normalize-space(string(name))=\”X\”]/length”, intValue); Tomoteru's Way: number(//size/x/@value) getIntAttribute(“//size/x”, “value”, intValue); Chris' Way has more complex query. But equally simple API Call.

Element v.s. Attribute Debate (aside) Looked on Web Tomoteru's way is preferred in general by object modellers (eg. database people) Mark up most “ atomic” data as attributes Use tags to indicate “ table structure” Chris' way is perhaps preferred by archivists or librarians (Go Kim!) Decide for yourself, a discussion is available at: http://www.oasis-open.org/cover/elementsAndAttrs.html Found no universally accepted best practice.

Software: XPathReader Wrote software to implement XPath Reader API in C++ Wraps around free libxml2 (C) library Uses overloading and templating Two Classes: BasicXPathReader: Use XPath to get at basic C++ types ( ints, std::strings, etc) XPathReader Allows reading of Complex Numbers and Arrays.

XPathReader Class Public Members void open(istream& is); void close(void); template void getXPathAttribute(const string& xpath_to_node, const string& attribute_name, T& result); template void getXPath(const string& xpath, T& result); int countXPath(const string& xpath_query); open/close functions: count results of XPath Query: get value of attribute from node identified by XPath: get value of node identified by XPath

Complex Numbers and Arrays XPathReader Library provides Classes for Complex Numbers and Arrays: template class TComplex {... }; template class Array {... }; Can have Complex numbers of arrays Eg for storing real/imaginary parts of arrays: TComplex > Can also have Complex-es templated on string -s Mathematically not sensible...

Complex Number Markup & Marshal real part imag part Invented simple mark up: can maintain API through C++ function overloading and recursion: template void getXPath(const string& path, TComplex & result) { getXPath( path+”/cmpx/re”, result.real() ); getXPath( path+”/cmpx/im”, result.imag() ); } similar but slightly more involved for Array.

Array Markup Arrays were marked up as follows: <array sizeName=”size” elemName=”el” indexName=”idx” indexStart=”x”> N element[0] element[1]... element[N-1] This is a general mark up -- suitable for local parsers too

Array Mark - Up Example <array sizeName=”num_dimensions” elemName=”axis” indexName=”dimension” indexStart =”1”> 4......... Minimally invasive Insert tags Copy tag to attribute Easy to implement with XSL transformation Working group needn't amend current metadata schema for it.

Conclusions Discussed API Issues for Parsing XML without full “data binding” tools. Discussed Repeated Tag problem Concluded that XPath is simple and elegant way to solve problem - hopefully convinced you too. Discussed C++ Implementation of an XPathReader API Discussed how to parse compound data types Described markup for Complex Numbers and Arrays Suggest Complex and Array markup be standardised by Metadata Working Group (but not necessarily that it be used in metadata documents) - to assist sharing of data.

References/Links XML, DOM, XPath: http://www.w3.org Tutorials (XPath/XSLT): http://www.zvon.org libxml2: http://www.xmlsoft.org Attribute v.s. Entities (and other discussions): http://www.oasis-open.org/cover/elementsAndAttrs.html XPathReader software send email to me: bj@ph.ed.ac.uk SciDAC CVS repository at JLAB ( xpath_reader ) SciDAC: http://www.lqcd.org

Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003.

Similar presentations

Presentation on theme: "Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003.

Similar presentations

Presentation on theme: "Getting Data out of XML Documents Bálint Joó School of Physics University of Edinburgh May 02, 2003."— Presentation transcript:

Similar presentations

About project

Feedback