XML Refresher Course Bálint Joó School of Physics University of Edinburgh May 02, 2003
Contents XML Documents Basic Structure Parsing via SAX Document Object Model (DOM) Basic Tree Representation DOM Node Types DOM Notes Conclusions
XML Documents Begin with Prologue: Sequence of tags follows: Some stuff
Element Structure Elements have a name: Data or as empty tags (no data): Must occur as pair of opening/closing tags possibly containing data:
Attributes Elements can have one or more attributes Attributes are name/value pairs Attributes are simple - they have no sub tags Attributes may have a purpose (e.g. declaration) declares namespace bj
Namespaces Allow reuse of tag names for different purposes Consist of a prefix and a URI Declared with xmlns attribute: Tags/Attributes from namespace are prefixed: In some cases, attribute values may be prefixed
Namespaces in QCDML Suppose Metadata Working Group can't agree on convention for parameter but both UKQCD and SciDAC want to use the name beta but with different meanings. Define namespaces: sciDac and ukqcd Can then have tags:
Parsing XML via SAX SAX - Simple API for XML Treats XML Document as a “ program” SAX Parsers provide hooks to let the user write an “ interpreter” for the “ program” Generally fast, with small memory footprint BUT: writing interpreters is potentially burdensome / problem specific
Document Object Model (DOM) DOM specifies a Dynamic Interface to XML documents Tree based representation Various APIs for accessing the representation Traversing searching creating/updating We consider here the tree representation only (as it is closely related to XPath)
DOM Trees Docum ent Document Node Root Link Node Root Node Sibling next Sibling previous Node Sibling Node (brother/sister) child parent Node Child Node
DOM Nodes There are several types of Node. Most useful: Document Element Corresponds to... or Attribute The attribute in Text The data in data The value in
DOM Notes DOM Preserves Document order (parent/child, previous/next sibling links) Getting Documents into DOM is easy Using libxml: doc=xmlParseFile(“foo.xml”); Many free DOM parsers exist even for C/C++ Apache Xerces, libxml Difficulty shifts to extracting data from DOM
Conclusions This talk provided basic introduction to XML document structure Discussed DOM representation of XML Highlighted need to define Easy To Use API to query DOM objects What does Easy To Use mean ? What is Easy To Parse? Stay Tuned for Part 2...