Java and XML
What is XML XML stands for eXtensible Markup Language. A markup language is used to provide information about a document. Tags are added to the document to provide the extra information. HTML tags tell a browser how to display the document. XML tags give a reader some idea what some of the data means.
Advantages of XML XML is text (Unicode) based. One XML document can be displayed differently in different media. – Html, video, CD, DVD, – You only have to change the XML document in order to change all the rest. XML documents can be modularized. Parts can be reused.
Example of an HTML Document Example </head. This is an example of a page. Some information goes here.
Example of an XML Document Alice Lee
Difference Between HTML and XML HTML tags have a fixed meaning and browsers know what it is. XML tags are different for different applications, and users know what they mean. HTML tags are used for display. XML tags are used to describe documents and data.
XML Rules Tags are enclosed in angle brackets. Tags come in pairs with start-tags and end- tags. Tags must be properly nested. – … is not allowed. – … is. Tags that do not have end-tags must be terminated by a ‘/’. – is an html example.
More XML Rules Tags are case sensitive. – is not the same as XML in any combination of cases is not allowed as part of a tag. Tags may not contain ‘<‘ or ‘&’. Tags follow Java naming conventions, except that a single colon and other characters are allowed. They must begin with a letter and may not contain white space. Documents must have a single root tag that begins the document.
Well-Formed Documents An XML document is said to be well-formed if it follows all the rules. An XML parser is used to check that all the rules have been obeyed. Recent browsers such as Internet Explorer 5 and Netscape 7 come with XML parsers. Parsers are also available for free download over the Internet. One is Xerces, from the Apache open- source project. Java 1.4 also supports an open-source parser.
Expanded Example Alice Lee
XML Files are Trees address name phonebirthday firstlastyearmonthday
Validity A well-formed document has a tree structure and obeys all the XML rules. A particular application may add more rules in either a DTD (document type definition) or in a schema. Many specialized DTDs and schemas have been created to describe particular areas. These range from disseminating news bulletins (RSS) to chemical formulas. DTDs were developed first, so they are not as comprehensive as schema.
Document Type Definitions A DTD describes the tree structure of a document and something about its data. There are two data types, PCDATA and CDATA. – PCDATA is parsed character data. – CDATA is character data, not usually parsed. A DTD determines how many times a node may appear, and how child nodes are ordered.
DTD for address Example
Schemas Schemas are themselves XML documents. They were standardized after DTDs and provide more information about the document. They have a number of data types including string, decimal, integer, boolean, date, and time. They divide elements into simple and complex types. They also determine the tree structure and how many children a node may have.
Schema for First address Example
Parsers There are two principal models for parsers. SAX – Simple API for XML – Uses a call-back method – Similar to javax listeners DOM – Document Object Model – Creates a parse tree – Requires a tree traversal
DOM Parser
About DOM Stands for Document Object Model A World Wide Web Consortium (w3c) standard Standard constantly adding new features – Level 3 Core just released this month Well cover most of the basics. There’s always more, and it’s always changing.
DOM abstraction layer in Java -- architecture Returns specific parser implementation org.w3d.dom.Document Emphasis is on allowing vendors to supply their own DOM Implementation without requiring change to source code
Sample Code DocumentBuilderFactor factory = DocumentBuilderFactory.newInstance(); /* set some factory options here */ DocumentBuilder builder = factory.newDocumentBuilder(); Document doc = builder.parse(xmlFile); A factory instance is the parser implementation. Can be changed with runtime System property. Jdk has default. Xerces much better. From the factory one obtains an instance of the parser xmlFile can be an java.io.File, an inputstream, etc. javax.xml.parsers.DocumentBuilderFactory javax.xml.parsers.DocumentBuilder org.w3c.dom.Document For reference. Notice that the Document class comes from the w3c-specified bindings.
Validation Note that by default the parser will not validate against a schema or DTD As of JAXP1.2, java provides a default parse than can handle most schema features See next slide for details on how to setup
Document object Once a Document object is obtained, rich API to manipulate. First call is usually Element root = doc.getDocumentElement(); This gets the root element of the Document as an instance of the Element class Note that Element subclasses Node and has methods getType(), getName(), and getValue(), and getChildNodes()
Types of Nodes Note that there are many types of Nodes (ie subclasses of Node: Attr, CDATASection, Comment, Document, DocumentFragment, DocumentType, Element, Entity, EntityReference, Notation, ProcessingInstruction, Text Each of these has a special and non-obvious associated type, value, and name. Standards are language-neutral and are specified on chart on following slide Important: keep this chart nearby when using DOM
Node nodeName() nodeValue()AttributesnodeType() Attr Attr nameValue of attributenull 2 CDATASection #cdata-sectionCDATA cotnentnull 4 Comment #commentComment contentnull 8 Document #document Null null 9 DocumentFragment #document- fragment null 11 DocumentType Doc type name null 10 Element Tag name null NamedNodeMap 1 Entity Entity name null 6 EntityReference Name entitry referenced null 5 Notation Notation name null 1 ProcessingInstruction target Entire string null 7 Text #text Actual text null 3
Transformer Architecture
Writing DOM to XML public class WriteDOM{ public static void main(String[] argv) throws Exception{ File f = new File(argv[0]); DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); Document document = builder.parse(f); TransformerFactory tFactory = TransformerFactory.newInstance(); Transformer transformer = tFactory.newTransformer(); DOMSource source = new DOMSource(document); StreamResult result = new StreamResult(System.out); transformer.transform(source, result); }
Creating a DOM from scratch Sometimes you may want to create a DOM tree directly in memory. This is done with: DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); document = builder.newDocument();
Manipulating Nodes Once the root node is obtained, typical tree methods exist to manipulate other elements: boolean node.hasChildNodes() NodeList node.getChildNodes() Node node.getNextSibling() Node node.getParentNode() String node.getValue(); String node.getName(); String node.getText(); void setNodeValue(String nodeValue); Node insertBefore(Node new, Node ref);
SAX Simple API for XML Processing
About SAX SAX in Java is hosted on source forge SAX is not a w3c standard Originated purely in Java Other languages have chosen to implement in their own ways based on this prototype
SAX vs DOM Please don’t compared unrelated things: – SAX is an alternative to DOM, but realize that DOM is often built on top of SAX – SAX and DOM do not compete with JAXP – They do both compete with JAXB implementations
How a SAX parser works SAX parser scans an xml stream on the fly and responds to certain parsing events as it encounters them. This is very different than digesting an entire XML document into memory. Much faster, requires less memory. However, need to reparse if you need to revisit data.
Obtaining a SAX parser Important classes javax.xml.parsers.SAXParserFactory; javax.xml.parsers.SAXParser; javax.xml.parsers.ParserConfigurationException; //get the parser SAXParserFactory factory = SAXParserFactory.newInstance(); SAXParser saxParser = factory.newSAXParser(); //parse the document saxParser.parse( new File(argv[0]), handler);
DefaultHandler Note that an event handler has to be passed to the SAX parser. This must implement the interface org.xml.sax.ContentHandler; Easier to extend the adapter org.xml.sax.helpers.DefaultHandler
Overriding Handler methods Most important methods to override – void startDocument() Called once when document parsing begins – void endDocument() Called once when parsing ends – void startElement(...) Called each time an element begin tag is encountered – void endElement(...) Called each time an element end tag is encountered – void characters(...) Called randomly between startElement and endElement calls to accumulated character data
startElement public void startElement( String namespaceURI, //if namespace assoc String sName, //nonqualified name String qName, //qualified name Attributes attrs) //list of attributes Attribute info is obtained by querying Attributes objects.
Characters public void characters( char buf[], //buffer of chars accumulated int offset, //begin element of chars int len) //number of chars Note, characters may be called more than once between begin tag / end tag Also, mixed-content elements require careful handling
Entity references Recall that entity references are special character sequences for referring to characters that have special meaning in XML syntax – ‘<‘ is < – ‘>’ is > In SAX these are automatically converted and passed to the characters stream unless they are part of a CDATA section
Choosing a Parser Choosing your Parser Implementation – If no other factory class is specified, the default SAXParserFactory class is used. To use a different manufacturer's parser, you can change the value of the environment variable that points to it. You can do that from the command line, like this: java -Djavax.xml.parsers.SAXParserFactory=yourFactoryHere... The factory name you specify must be a fully qualified class name (all package prefixes included). For more information, see the documentation in the newInstance() method of the SAXParserFactory class.