XML Study-Session: Part III Parsing XML Documents
Objectives By completing this study-session, you should be able to: Learn to use the IBM XML4J Java XML parser. Gain familiarity with the Document Object Model (DOM). Be able to create a parsing application to display, navigate, and modify an XML document.
What is parsing? Interpretation of text. The XML parser’s job is load the document, check that follows all necessary rules (at minimum, for well-formedness), and build a document tree structure that can be passed on to the application. The application is any program (e.g. browser, reader, middleware) that acts upon the tree structure, processing the data it contains.
Overview of XML parsing Packets of parsed XML data Application to manipulate XML Data XML Document XML Parser XML Application Fig. 1 (from “Building XML Applications,” St. Laurent and Cerami) Every XML application includes at least two pieces: an XML parser and an application to manipulate the parsed XML data.
Types of parsers Validating vs. Non-validating: A validating parser checks a document against a declared DTD. Tree-based vs. Event-driven interface: Parser with tree-based interface will read entire document and create an internal tree representation of the data which can then be traversed by the application. A standardized API for this interface is the W3C DOM. In the event-driven model, the parser reads through the document and signals each significant parsing event (e.g. start of document, start of element, end of element). Callback methods are used to handle these events as they occur. This approach is used by the Simple API for XML (SAX).
The IBM XML4J parser Open source Java parser developed by IBM and now available as part of the xml.apache.org project under the codename Xerces. Version 3.1.1 API supports DOM level 1 and SAX level 1. Can be downloaded from as .zip file from www.alphaworks.ibm.com/tech/xml4j. Ideal for standalone Java applications and working with Java servlets.
Setting up your environment To use the classes in XML4J, you must set your Java CLASSPATH variable so that Java can locate the xerces.jar and xercesSamples.jar files To set classpath in Jcreator: Configure -> Options -> JDK Profiles -> select JDK version -> Edit -> Add Package -> add d:/xml4j/xerces.jar and d:/xml4j/xercesSamples.jar To run/execute project with command-line arguments: Project -> Project Settings -> JDK Tools -> Select tool type: Run Application -> select <Default> -> Edit -> Parameters -> set “Prompt for main function argument” checkbox to ‘True’.
Understanding DOM The W3C DOM specifies an interface for treating a document as a tree of nodes. A Node object, implemented in Java DOM, has methods such as getChildNode(), getNextSibling(), getParentNode(), getNodeType(), etc. Possible node types in DOM include: Element, Attribute, Comment, Text, CDATA section, Entity reference, Entity, Processing Instruction, Document, Document type, Document fragment, and Notation.
Example: (petfile.xml) <?xml version=‘1.0’ encoding=‘UTF-8’?> <Pets> <Pet ID=‘001’Registered=‘030801’> <Name>Rover</Name> <Age>3</Age> <Description Species=‘Dog’> Yellow colored Golden Retreiver </Description> </Pet> <Pet ID=‘002’Registered=‘101100’> <Name>Ella</Name> <Age>1</Age> <Description Species=‘Tortoise’> Green and black shelled pond crawler </Pets>
Example DOM structure Pets Pet Pet ID Registered Name Age Description Yellow colored Golden Retriever 001 030801 Rover 3 Dog
Understanding DOM (contd.) In XML4J, the classes that support the W3C DOM interface are stored in the org.w3c.dom class and the classes for the DOM parser are stored in the org.apache.xerces.parsers.DOMparser class. High-level constructs such as Element and Attribute in DOM extend the Node interface. So, for instance, an Attribute object has methods such as getName() and getValue() and also getNodeName(). Complete API documentation can be found online at http://xml.apache.org/apiDocs/index.html.
Creating a parser From the XML Reference page, download and view the FirstParser.java sample code. This program will parse an XML document (“customer.xml”, passed as a command-line argument) and display the number of a certain element (in this case, the number of <Customer> elements) in it.
Displaying a document From the XML Reference page, download and view the IndentingParser.java sample code. This program will parse and display an entire XML document (passed as a command-line argument) with proper indentation. Separate handler methods are used to handle the document (i.e. root) node, element nodes, attributes, CDATA sections, text nodes, and Processing Instruction nodes.
Navigating a document From the XML Reference page, download and view the nav.java sample code. This program will parse the “meetings.xml” document and navigate the tree structure to locate the name of the third person. Note that the XML4J parser treats indented space in the XML document as text nodes. We can set the parser to ignore whitespace by calling the parser method setIncludeIgnorableWhitespace with the value ‘false’.
Modifying a document From the XML Reference page, download and view the XMLWriter.java sample code. This program will parse an XML document (“customer.xml”, passed as a command-line argument) and modify it by adding a new <Middle_Name>XML</Middle_Name> element to every customer. The modified document tree is then written to a new file with the name “customer2.xml”.
Next session: Presenting XML Documents Stylesheets Writing your own XSL applications