1 Processing XML with Java A comprehensive tutorial about XML processing with JavaXML processing with Java XML tutorial of W3SchoolsW3Schools.

Slides:



Advertisements
Similar presentations
J0 1 Marco Ronchetti - Web architectures – Laurea Specialistica in Informatica – Università di Trento Java XML parsing.
Advertisements

XML Parsers By Chongbing Liu. XML Parsers  What is a XML parser?  DOM and SAX parser API  Xerces-J parsers overview  Work with XML parsers (example)
1 SAX and more… CS , Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes.
14-Jun-15 DOM. SAX and DOM SAX and DOM are standards for XML parsers-- program APIs to read and interpret XML files DOM is a W3C standard SAX is an ad-hoc.
1 Processing XML with Java Representation and Management of Data on the Internet.
XML Technologies and Applications Rajshekhar Sunderraman Department of Computer Science Georgia State University Atlanta, GA 30302
21-Jun-15 SAX (Abbreviated). 2 XML Parsers SAX and DOM are standards for XML parsers-- program APIs to read and interpret XML files DOM is a W3C standard.
1 Processing XML with Java Representation and Management of Data on the Internet A comprehensive tutorial about XML processing with JavaXML processing.
1 Processing XML with Java CS , Spring 2008/9.
26-Jun-15 SAX. SAX and DOM SAX and DOM are standards for XML parsers--program APIs to read and interpret XML files DOM is a W3C standard SAX is an ad-hoc.
1 Processing XML with Java Representation and Management of Data on the Internet.
Apache DOM Parser©zwzOctober 24, 2002 Wenzhong Zhao Department of Computer Science The University of Kentucky.
1 Processing XML with Java Dr. Praveen Madiraju Modified from Dr.Sagiv ’ s slides.
Processing of structured documents Spring 2003, Part 5 Helena Ahonen-Myka.
CSE 6331 © Leonidas Fegaras XML Tools1 XML Tools Leonidas Fegaras.
The Joy of SAX (and DOM, and JDOM…) Bill MacCartney 11 October 2004.
1 Processing XML with Java CS , Spring 2010.
1 XML Data Management 4. Domain Object Model Werner Nutt.
SDPL 2003Notes 3: XML Processor Interfaces1 3. XML Processor APIs n How can applications manipulate structured documents? –An overview of document parser.
1 XML at a neighborhood university near you Innovation 2005 September 16, 2005 Kwok-Bun Yue University of Houston-Clear Lake.
5 Processing XML Parsing XML documents  Document Object Model (DOM)  Simple API for XML (SAX) Class generation Overview.
17 Apr 2002 XML Programming - DOM Andy Clark. DOM Design Premise Derived from browser document model Defined in IDL – Lowest common denominator programming.
SAX Parsing Presented by Clifford Lemoine CSC 436 Compiler Design.
Advanced Java Session 9 New York University School of Continuing and Professional Studies.
CSE 6331 © Leonidas Fegaras XML Tools1 XML Tools.
1 Java and XML Modified from presentation by: Barry Burd Drew University Portions © 2002 Hungry Minds, Inc.
SDPL 2002Notes 3: XML Processor Interfaces1 3. XML Processor APIs n How can applications manipulate structured documents? –An overview of document parser.
Processing of structured documents Spring 2002, Part 2 Helena Ahonen-Myka.
SAX. What is SAX SAX 1.0 was released on May 11, SAX is a common, event-based API for parsing XML documents Primarily a Java API but there implementations.
Beginning XML 4th Edition. Chapter 12: Simple API for XML (SAX)
The XML Document Object Model (DOM) Aug’10 – Dec ’10.
XML Processing in Java. Required tools Sun JDK 1.4, e.g.: JAXP (part of Java Web Services Developer Pack, already in Sun.
Sheet 1XML Technology in E-Commerce 2001Lecture 3 XML Technology in E-Commerce Lecture 3 DOM and SAX.
1 Processing XML with Java Modified from Dr.Sagiv ’ s slides.
C# and Windows Programming XML Processing. 2 Contents Markup XML DTDs XML Parsers DOM.
CSE 6331 © Leonidas Fegaras XML Tools1 XML Tools.
XML Study-Session: Part III
Document Object Model DOM. Agenda l Introduction to DOM l Java API for XML Parsing (JAXP) l Installation and setup l Steps for DOM parsing l Example –Representing.
SNU OOPSLA Lab. DOM/SAX Applications The ubiquitous XML(9) © copyright 2001 SNU OOPSLA Lab.
Java and XML. What is XML XML stands for eXtensible Markup Language. A markup language is used to provide information about a document. Tags are added.
SDPLNotes 3.2: DOM1 3.2 Document Object Model (DOM) n How to provide uniform access to structured documents in diverse applications (parsers, browsers,
Apache DOM Parser©zwzOctober 24, 2002 Wenzhong Zhao Department of Computer Science The University of Kentucky.
XML and SAX (A quick overview) ● What is XML? ● What are SAX and DOM? ● Using SAX.
When we create.rtf document apart from saving the actual info the tool saves additional info like start of a paragraph, bold, size of the font.. Etc. This.
1 Introduction JAXP. Objectives  XML Parser  Parsing and Parsers  JAXP interfaces  Workshops 2.
1 Processing XML with Java Representation and Management of Data on the Internet.
SDPL 20063: XML Processor Interfaces1 3. XML Processor APIs n How can (Java) applications manipulate structured (XML) documents? –An overview of XML processor.
18-Feb-16 XML eXtensible Markup Language THETOPPERSWAY.COM.
Simple API for XML (SAX) Aug’10 – Dec ’10. Introduction to SAX Simple API for XML or SAX was developed as a standardized way to parse an XML document.
7-Mar-16 Simple API XML.  SAX and DOM are standards for XML parsers-- program APIs to read and interpret XML files  DOM is a W3C standard  SAX is an.
13-Mar-16 DOM. 2 Difference between SAX and DOM DOM reads the entire XML document into memory and stores it as a tree data structure SAX reads the XML.
Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Representing Web Data:
USING ANDROID WITH THE DOM. Slide 2 Lecture Summary DOM concepts SAX vs DOM parsers Parsing HTTP results The Android DOM implementation.
1 Introduction SAX. Objectives 2  Simple API for XML  Parsing an XML Document  Parsing Contents  Parsing Attributes  Processing Instructions  Skipped.
21-Jun-16 Document Object Model DOM. SAX and DOM SAX and DOM are standards for XML parsers-- program APIs to read and interpret XML files DOM is a W3C.
Java API for XML Processing
XML. Contents  Parsing an XML Document  Validating XML Documents.
{ XML Technologies } BY: DR. M’HAMED MATAOUI
XML Parsers Overview Types of parsers Using XML parsers SAX DOM
Unit 4 Representing Web Data: XML
Java/XML.
Chapter 7 Representing Web Data: XML
DOM Document Object Model.
XML Parsers By Chongbing Liu.
Jagdish Gangolly State University of New York at Albany
XML Parsers Overview Types of parsers Using XML parsers SAX DOM
Java API for XML Processing
DOM 8-Dec-18.
DOM 24-Feb-19.
SAX2 29-Jul-19.
Presentation transcript:

1 Processing XML with Java A comprehensive tutorial about XML processing with JavaXML processing with Java XML tutorial of W3SchoolsW3Schools

2 Parsers What is a parser? Parser Formal grammar Input Analyzed Data The structure(s) of the input, according to the atomic elements and their relationships (as described in the grammar)

3 XML-Parsing Standards We will consider two parsing methods that implement W3C standards for accessing XML DOM -convert XML into a tree of objects - “ random access ” protocol SAX - “ serial access ” protocol -event-driven parsing

4 XML Examples

5 Israel 6,199,008 Jerusalem Ashdod France 60,424,213 world.xml root element validating DTD file reference to an entity

6 XML Tree Model element attribute simple content

7 world.dtd default value Open world.xml in your browser Check world2.xml for #PCDATA exmaple As opposed to required parsed Not parsed

8 <forsale date="12/2/03" xmlns:xhtml=" DBI:.]]> <comment xmlns=" My favorite book! sales.xml “xhtml” namespace declaration default namespace declaration namespace overriding (non-parsed) character data Namespaces

9 <forsale date="12/2/03" xmlns:xhtml=" DBI.]]> <comment xmlns=" My favorite book! sales.xml Namespace: “ ” Local name: “ h1 ” Qualified name: “ xhtml:h1 ”

10 <forsale date="12/2/03" xmlns:xhtml=" DBI.]]> <comment xmlns=" My favorite book! sales.xml Namespace: “ ” Local name: “ par ” Qualified name: “ par ”

11 <forsale date="12/2/03" xmlns:xhtml=" DBI.]]> <comment xmlns=" My favorite book! sales.xml Namespace: “” Local name: “ title ” Qualified name: “ title ”

12 <forsale date="12/2/03" xmlns:xhtml=" DBI.]]> <comment xmlns=" My favorite book! sales.xml Namespace: “ Local name: “ b ” Qualified name: “ xhtml:b ”

13 DOM – Document Object Model

14 DOM Parser DOM = Document Object Model Parser creates a tree object out of the document User accesses data by traversing the tree -The tree and its traversal conform to a W3C standard The API allows for constructing, accessing and manipulating the structure and content of XML documents

15 Israel 6,199,008 Jerusalem Ashdod France 60,424,213

16 The DOM Tree

17 Using a DOM Tree DOM Parser DOM Tree XML File APIAPI Application in memory

18

19 Creating a DOM Tree A DOM tree is generated by a DocumentBuilder The builder is generated by a factory, in order to be implementation independent The factory is chosen according to the system configuration DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); DocumentBuilder builder = factory.newDocumentBuilder(); Document doc = builder.parse("world.xml");

20 Configuring the Factory The methods of the document-builder factory enable you to configure the properties of the document building For example - factory.setValidating(true) - factory.setIgnoringComments(false) Read more about DocumentBuilderFactory Class, DocumentBuilder ClassDocumentBuilderFactory Class DocumentBuilder Class

21 The Node Interface The nodes of the DOM tree include -a special root (denoted document) The Document interface retrieved by builder.parse( … ) actually extends the Node Interface -element nodes -text nodes and CDATA sections -attributes -comments -and more... Every node in the DOM tree implements the Node interface This is very unintuitive (some would even say this is a bloated design)

22 A light- weight fragment of the document. Can hold several sub-trees Interfaces in a DOM Tree DocumentFragment Document CharacterData Text Comment CDATASection Attr Element DocumentType Notation Entity EntityReference ProcessingInstruction Node NodeList NamedNodeMap DocumentType Figure as appears in : “The XML Companion” - Neil Bradley Run FragmentVsElement with 1 st argument fragment/element

23 Interfaces in the DOM Tree Document Document TypeElement AttributeElement AttributeText ElementTextEntity ReferenceText Comment

24 Node Navigation Every node has a specific location in tree Node interface specifies methods for tree navigation - Node getFirstChild(); - Node getLastChild(); - Node getNextSibling(); - Node getPreviousSibling(); - Node getParentNode(); - NodeList getChildNodes(); - NamedNodeMap getAttributes()

25 Node Navigation (cont) getFirstChild() getPreviousSibling() getChildNodes() getNextSibling() getLastChild() getParentNode()

26 Not a very OO approach… reminds one of simulating OO using unions in C…. Node Properties Every node has (a dubious design and … ) -a type -a name -a value -attributes The roles of these properties differ according to the node types Nodes of different types implement different interfaces (that extend Node ) Only Element Nodes have attributes… some would say this should’ve been a property of the Element derived class and not of Node. Very few nodes have both a significant name and a significant value

27 InterfacenodeNamenodeValueattributes Attrname of attributevalue of attributenull CDATASection "#cdata-section" content of the Sectionnull Comment "#comment" content of the commentnull Document "#document" null DocumentFragment "#document-fragment" null DocumentTypedoc-type namenull Elementtag namenullNodeMap Entityentity namenull EntityReferencename of entity referencednull Notationnotation namenull ProcessingInstructiontargetentire contentnull Text "#text" content of the text nodenull Names, Values and Attributes

28 Node Types - getNodeType() ELEMENT_NODE = 1 ATTRIBUTE_NODE = 2 TEXT_NODE = 3 CDATA_SECTION_NODE = 4 ENTITY_REFERENCE_NODE = 5 ENTITY_NODE = 6 PROCESSING_INSTRUCTION_NODE = 7 COMMENT_NODE = 8 DOCUMENT_NODE = 9 DOCUMENT_TYPE_NODE = 10 DOCUMENT_FRAGMENT_NODE = 11 NOTATION_NODE = 12 if (myNode.getNodeType() == Node.ELEMENT_NODE) { //process node … } Read more about Node InterfaceNode Interface

29 import org.w3c.dom.*; import javax.xml.parsers.*; public class EchoWithDom { public static void main(String[] args) throws Exception { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setIgnoringElementContentWhitespace(true); DocumentBuilder builder = factory.newDocumentBuilder(); Document doc = builder.parse(“world.xml"); new EchoWithDom().echo(doc); } e.g white spaces used for indentation in non mixed data elements

30 private void echo(Node n) { print(n); if (n.getNodeType() == Node.ELEMENT_NODE) { NamedNodeMap atts = n.getAttributes(); ++depth; for (int i = 0; i < atts.getLength(); i++) echo(atts.item(i)); --depth; } depth++; for (Node child = n.getFirstChild(); child != null; child = child.getNextSibling()) echo(child); depth--; } Attribute nodes are not included…

31 private int depth = 0; private String[] NODE_TYPES = { "", "ELEMENT", "ATTRIBUTE", "TEXT", "CDATA", "ENTITY_REF", "ENTITY", "PROCESSING_INST", "COMMENT", "DOCUMENT", "DOCUMENT_TYPE", "DOCUMENT_FRAG", "NOTATION" }; private void print(Node n) { for (int i = 0; i < depth; i++) System.out.print(" "); System.out.print(NODE_TYPES[n.getNodeType()] + ":"); System.out.print("Name: "+ n.getNodeName()); System.out.print(" Value: "+ n.getNodeValue()+"\n"); }} run EchoWithDom, pay attention to the default values

32 public class WorldParser { public static void main(String[] args) throws Exception { DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); factory.setIgnoringElementContentWhitespace(true); DocumentBuilder builder = factory.newDocumentBuilder(); Document doc = builder.parse("world.xml"); printCities(doc); } Another Example

33 public static void printCities(Document doc) { NodeList cities = doc.getElementsByTagName("city"); for(int i=0; i<cities.getLength(); ++i) { printCity((Element)cities.item(i)); } public static void printCity(Element city) { Node nameNode = city.getElementsByTagName("name").item(0); String cName = nameNode.getFirstChild().getNodeValue(); System.out.println("Found City: " + cName); } Another Example (cont) run WorldParser Searches within descendents

34 Normalizing the DOM Tree Normalizing a DOM Tree has two effects: -Combine adjacent textual nodes -Eliminate empty textual nodes To normalize, apply the normalize() method to the document element Created by node manipulation…

35 Node Manipulation Children of a node in a DOM tree can be manipulated - added, edited, deleted, moved, copied, etc. To constructs new nodes, use the methods of Document - createElement, createAttribute, createTextNode, createCDATASection etc. To manipulate a node, use the methods of Node - appendChild, insertBefore, removeChild, replaceChild, setNodeValue, cloneNode(boolean deep) etc.

36 Node Manipulation (cont) Ref New insertBefore Old New replaceChild cloneNode deep = 'false' deep = 'true' Figure as appears in “The XML Companion” - Neil Bradley

37 SAX – Simple API for XML

38 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes the corresponding method of the corresponding handler. This is called event-driven programming. Most GUI programs are written using this paradigm. The handlers are programmer ’ s implementation of standard Java API (i.e., interfaces and classes)

39 Israel 6,199,008 Jerusalem Ashdod France 60,424,213

40 Israel 6,199,008 Jerusalem Ashdod France 60,424,213 Start Document

41 Israel 6,199,008 Jerusalem Ashdod France 60,424,213 Start Element

42 Israel 6,199,008 Jerusalem Ashdod France 60,424,213 Start Element

43 Israel 6,199,008 Jerusalem Ashdod France 60,424,213 Comment

44 Israel 6,199,008 Jerusalem Ashdod France 60,424,213 Start Element

45 Israel 6,199,008 Jerusalem Ashdod France 60,424,213 Characters

46 Israel 6,199,008 Jerusalem Ashdod France 60,424,213 End Element

47 Israel 6,199,008 Jerusalem Ashdod France 60,424,213 End Element

48 Israel 6,199,008 Jerusalem Ashdod France 60,424,213 End Document

49 SAX Parsers SAX Parser When you see the start of the document do … When you see the start of an element do … When you see the end of an element do ….

50 Used to create a SAX Parser Handles document events: start tag, end tag, etc. Handles Parser Errors Handles DTD Handles Entities

51 Creating a Parser The SAX interface is an accepted standard There are many implementations of many vendors -Standard API does not include an actual implementation, but Sun provides one with JDK We would like to be able to change the implementation used, without changing any code in the program -How is this done?

52 Factory Design Pattern Have a “ factory ” class that creates the actual parsers - org.xml.sax.helpers.XMLReaderFactory The factory checks configurations, mainly the value of a system property, that specify the implementation -Can be set outside the Java code: a configuration file, a command-line argument, etc. In order to change the implementation, simply change the system property Read more about XMLReaderFactory ClassXMLReaderFactory Class

53 Creating a SAX Parser import org.xml.sax.*; import org.xml.sax.helpers.*; public class EchoWithSax { public static void main(String[] args) throws Exception { System.setProperty("org.xml.sax.driver", "org.apache.xerces.parsers.SAXParser"); XMLReader reader = XMLReaderFactory.createXMLReader(); reader.parse("world.xml"); } Read more about XMLReader Interface, Xerces SAXParser classXMLReader InterfaceXerces SAXParser class Implements XMLReader

54 Implementing the Content Handler A SAX parser invokes methods such as startDocument, startElement and endElement of its content handler as it runs In order to react to parsing events we must: -implement the ContentHandler interface -set the parser ’ s content handler with an instance of our ContentHandler implementation

55 ContentHandler Methods startDocument - parsing begins endDocument - parsing ends startElement - an opening tag is encountered endElement - a closing tag is encountered characters - text (CDATA) is encountered ignorableWhitespace - white spaces that should be ignored (according to the DTD) and more... Read more about ContentHandler InterfaceContentHandler Interface

56 The Default Handler The class DefaultHandler implements all handler interfaces (usually, in an empty manner) -i.e., ContentHandler, EntityResolver, DTDHandler, ErrorHandler An easy way to implement the ContentHandler interface is to extend DefaultHandler Read more about DefaultHandler ClassDefaultHandler Class

57 A Content Handler Example import org.xml.sax.helpers.DefaultHandler; import org.xml.sax.*; public class EchoHandler extends DefaultHandler { int depth = 0; public void print(String line) { for(int i=0; i<depth; ++i) System.out.print(" "); System.out.println(line); }

58 A Content Handler Example public void startDocument() throws SAXException { print("BEGIN"); } public void endDocument() throws SAXException { print("END"); } public void startElement(String ns, String localName, String qName, Attributes attrs) throws SAXException { print("Element " + qName + "{"); ++depth; for (int i = 0; i < attrs.getLength(); ++i) print(attrs.getLocalName(i) + "=" + attrs.getValue(i)); } We will discuss this interface later…

59 A Content Handler Example public void endElement(String ns, String lName, String qName) throws SAXException { --depth; print("}"); } public void characters(char buf[], int offset, int len) throws SAXException { String s = new String(buf, offset, len).trim(); ++depth; print(s); --depth; } }

60 Fixing The Parser public class EchoWithSax { public static void main(String[] args) throws Exception { XMLReader reader = XMLReaderFactory.createXMLReader(); reader.setContentHandler(new EchoHandler()); reader.parse("world.xml"); } run EchoWithSax What would happen without this line? run the EchoWithSax2

61 Empty Elements What do you think happens when the parser parses an empty element? run EchoWithSax3

62 Attributes Interface The Attributes interface provides an access to all attributes of an element - getLength(), getQName(i), getValue(i), getType(i), getValue(qname), etc. The following are possible types for attributes: CDATA, ID, IDREF, IDREFS, NMTOKEN, NMTOKENS, ENTITY, ENTITIES, NOTATION There is no distinction between attributes that are defined explicitly from those that are specified in the DTD (with a default value) run EchoWithSax and check “capital” attribute, compare to xml source Read more about Attributes InterfaceAttributes Interface #attributes

63 ErrorHandler Interface We implement ErrorHandler to receive error events (similar to implementing ContentHandler ) DefaultHandler implements ErrorHandler in an empty fashion, so we can extend it (as before) An ErrorHandler is registered with - reader.setErrorHandler(handler); Three methods: - void error(SAXParseException ex); - void fatalError(SAXParserExcpetion ex); - void warning(SAXParserException ex);

64 Parsing Errors Fatal errors disable the parser from continuing parsing -For example, the document is not well formed, an unknown XML version is declared, etc. Errors (that is recoverable ones) occur for example when the parser is validating and validity constrains are violated Warnings occur when abnormal (yet legal) conditions are encountered -For example, an entity is declared twice in the DTD Read more about ErrorHandler InterfaceErrorHandler Interface

65 Usually appear with external non-xml resources and describe their type EntityResolver and DTDHandler The interface EntityResolver enables the programmer to specify a new source for translation of external entities The interface DTDHandler enables the programmer to react to notations and unparsed entities declarations inside the DTD e.g. external DTD Read more about EntityResolver Interface, DTDHandler InterfaceEntityResolver Interface DTDHandler Interface

66 Features and Properties SAX parsers can be configured by setting their features and properties Syntax: - reader.setFeature("feature-url", boolean) - reader.setProperty("property-url", Object) Standard feature URLs have the form: Standard property URLs have the form Using URLs is a (confusing) technique used to ensure unique names for non-standard features and properties added by different implementations. These URLs do not represent web addresses!

67 Feature/Property Examples Features: - namespaces - are namespaces supported? - validation - does the parser validate (against the declared DTD) ? - Ignore the DTD? (spec. to Xerces implementation ) Properties: - xml-string - the actual text that caused the current event (read-only with getProperty()) - lexical-handler - see the next slide... Read more about FeaturesFeatures Read more about PropertiesProperties

68 Lexical Events Lexical events have to do with the way that a document was written and not with its content Examples: -A comment is a lexical event ( ) -The use of an entity is a lexical event ( > ) These can be dealt with by implementing the LexicalHandler interface, and setting the lexical- handler property to an instance of the handler

69 LexicalHandler Methods comment(char[] ch, int start, int length) startCDATA() endCDATA() startEntity(java.lang.String name) endEntity(java.lang.String name) and more... Read more about LexicalHandler InterfaceLexicalHandler Interface

70 SAX vs. DOM

71 Parser Efficiency The DOM object built by DOM parsers is usually complicated (composed of many, many objects) and requires more memory storage than the XML file itself. -A lot of time is spent on construction before use. -For some very large documents, this may be impractical. SAX parsers store only local information that is encountered during the serial traversal and do not create many temporary objects during traversal. Hence, programming with SAX parsers is, in general, more efficient.

72 Programming using SAX is Difficult In some cases, programming with SAX might seem difficult at a first glance: -How can we find, using a SAX parser, elements e1 with ancestor e2 ? -How can we find, using a SAX parser, elements e1 that have a descendant element e2 ? -How can we find the element e1 referenced by the IDREF attribute of e2 ? In other cases, using SAX can be more elegant.

73 Node Navigation SAX parsers do not provide access to elements other than the one currently visited in the serial (DFS) traversal of the document In particular, -They do not read backwards -They do not enable access to elements by ID or name DOM parsers enable any traversal method Hence, using DOM parsers can be more comfortable

74 More DOM Advantages DOM object  compiled XML You can save time and effort if you send and receive DOM objects instead of XML files -But, DOM object are generally larger than the source DOM parsers provide a natural integration of XML reading and manipulating -e.g., “ cut and paste ” of XML fragments

75 Which should we use? DOM vs. SAX If your document is very large and you only need a few elements – use SAX If you need to manipulate (i.e., change) the XML – use DOM If you need to access the XML many times – use DOM (assuming the file is not too large) Depending on the task it hand, it might be easier for you to visualize it implemented using one of the APIs and not the other – so use it!