SDPL 20113.4 Streaming API for XML1 3.4 Streaming API for XML (StAX) n Could we process XML documents more conveniently than with SAX, and yet more efficiently?

Slides:



Advertisements
Similar presentations
J0 1 Marco Ronchetti - Web architectures – Laurea Specialistica in Informatica – Università di Trento Java XML parsing.
Advertisements

Technische universität dortmund Service Computing Service Computing Prof. Dr. Ramin Yahyapour IT & Medien Centrum 22. Oktober 2009.
Written by: Dr. JJ Shepherd
XML Parsing Using Java APIs AIP Independence project Fall 2010.
XML Parsers By Chongbing Liu. XML Parsers  What is a XML parser?  DOM and SAX parser API  Xerces-J parsers overview  Work with XML parsers (example)
1 SAX and more… CS , Spring 2008/9. 2 SAX Parser SAX = Simple API for XML XML is read sequentially When a parsing event happens, the parser invokes.
SAX A parser for XML Documents. XML Parsers What is an XML parser? –Software that reads and parses XML –Passes data to the invoking application –The application.
1 The Simple API for XML (SAX) Part I ©Copyright These slides are based on material from the upcoming book, “XML and Bioinformatics” (Springer-
Xerces The Apache XML Project Yvonne Yao. Introduction Set of libraries that provides functionalities to parse XML documents Set of libraries that provides.
21-Jun-15 SAX (Abbreviated). 2 XML Parsers SAX and DOM are standards for XML parsers-- program APIs to read and interpret XML files DOM is a W3C standard.
Java API for XML Processing (JAXP) CSE 4/586: Distributed Systems Department of Computer Science and Engineering University at Buffalo, New York Jia Zhao.
26-Jun-15 SAX. SAX and DOM SAX and DOM are standards for XML parsers--program APIs to read and interpret XML files DOM is a W3C standard SAX is an ad-hoc.
Generic Connection Framework Connection FileConnectionSocketConnectionHTTPConnection InputConnection OutputConnection StreamConnection.
28-Jun-15 StAX Streaming API for XML. XML parser comparisons DOM is Memory intensive Read-write Typically used for documents smaller than 10 MB SAX is.
JAX- Java APIs for XML by J. Pearce. Some XML Standards Basic –SAX (sequential access parser) –DOM (random access parser) –XSL (XSLT, XPATH) –DTD Schema.
XML: Java Dr Andy Evans. Java and XML Couple of things we might want to do: Parse/write data as XML. Load and save objects as XML. We’ll mainly discuss.
Processing of structured documents Spring 2003, Part 5 Helena Ahonen-Myka.
CSE 6331 © Leonidas Fegaras XML Tools1 XML Tools Leonidas Fegaras.
CSE 6331 © Leonidas Fegaras XML Tools1 XML Tools Leonidas Fegaras.
Networking Nasrullah. Input stream Most clients will use input streams that read data from the file system (FileInputStream), the network (getInputStream()/getInputStream()),
SDPL 2003Notes 3: XML Processor Interfaces1 3. XML Processor APIs n How can applications manipulate structured documents? –An overview of document parser.
XML for E-commerce II Helena Ahonen-Myka. XML processing model n XML processor is used to read XML documents and provide access to their content and structure.
XML processing in ColdFusion MX Everything you wanted to know about XML, but were afraid to ask October 2006 Jaxfusion User Group.
5 Processing XML Parsing XML documents  Document Object Model (DOM)  Simple API for XML (SAX) Class generation Overview.
Structured-Document Processing Languages Spring 2011 Course Review Repetitio mater studiorum est!
SAX Parsing Presented by Clifford Lemoine CSC 436 Compiler Design.
CSE 6331 © Leonidas Fegaras XML Tools1 XML Tools.
3/29/2001 O'Reilly Java Java API for XML Processing 1.1 What’s New Edwin Goei Engineer, Sun Microsystems.
1 Java and XML Modified from presentation by: Barry Burd Drew University Portions © 2002 Hungry Minds, Inc.
SDPL 2002Notes 3: XML Processor Interfaces1 3. XML Processor APIs n How can applications manipulate structured documents? –An overview of document parser.
SDPL 20113: XML APIs and SAX1 3. XML Processor APIs n How can (Java) applications manipulate structured (XML) documents? –An overview of XML processor.
On-the-fly Validation of XML Markup Languages using off-the-shelf Tools Mikko Saesmaa Pekka Kilpeläinen Dept of Computer Science University of Kuopio,
XML Parsers Overview  Types of parsers  Using XML parsers  SAX  DOM  DOM versus SAX  Products  Conclusion.
SAX. What is SAX SAX 1.0 was released on May 11, SAX is a common, event-based API for parsing XML documents Primarily a Java API but there implementations.
Beginning XML 4th Edition. Chapter 12: Simple API for XML (SAX)
XML Processing in Java. Required tools Sun JDK 1.4, e.g.: JAXP (part of Java Web Services Developer Pack, already in Sun.
Java API for XML Processing (JAXP) Dr. Rebhi S. Baraka Advanced Topics in Information Technology (SICT 4310) Department of Computer.
Sheet 1XML Technology in E-Commerce 2001Lecture 3 XML Technology in E-Commerce Lecture 3 DOM and SAX.
1 4/13/01 CSE 121/131 Programming Spring 2001 Lecture Notes 7  A. Sahuguet & V.Tannen.
Introduction to Java Lecture Notes 3. Variables l A variable is a name for a location in memory used to hold a value. In Java data declaration is identical.
CSE 6331 © Leonidas Fegaras XML Tools1 XML Tools.
Java and XML. What is XML XML stands for eXtensible Markup Language. A markup language is used to provide information about a document. Tags are added.
WIRED Detector Description in XML Mark Dönszelmann, Applications for Physics and Infrastructure, IT, CERN XML Detector Description Workshop CERN, 14 April,
© Marty Hall, Larry Brown Web core programming 1 Simple API for XML SAX.
SAX2 and DOM2 Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
XML and SAX (A quick overview) ● What is XML? ● What are SAX and DOM? ● Using SAX.
CS 157B: Database Management Systems II February 13 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron.
When we create.rtf document apart from saving the actual info the tool saves additional info like start of a paragraph, bold, size of the font.. Etc. This.
XML and Object Serialization. Structure of an XML Document Header Root Element Start Tags / End Tags Element Contents – Child Elements – Text – Both (mixed.
1 Introduction JAXP. Objectives  XML Parser  Parsing and Parsers  JAXP interfaces  Workshops 2.
Written by: Dr. JJ Shepherd
SDPL 20063: XML Processor Interfaces1 3. XML Processor APIs n How can (Java) applications manipulate structured (XML) documents? –An overview of XML processor.
Simple API for XML (SAX) Aug’10 – Dec ’10. Introduction to SAX Simple API for XML or SAX was developed as a standardized way to parse an XML document.
7-Mar-16 Simple API XML.  SAX and DOM are standards for XML parsers-- program APIs to read and interpret XML files  DOM is a W3C standard  SAX is an.
SDPL 2001Notes 3: XML Processor Interfaces1 3. XML Processor APIs n How applications can manipulate structured documents? –An overview of document parser.
1 Validation SAX-DOM. Objectives 2  Schema Validation Framework  XML Validation After Transformation  Workshops.
1 Introduction SAX. Objectives 2  Simple API for XML  Parsing an XML Document  Parsing Contents  Parsing Attributes  Processing Instructions  Skipped.
21-Jun-16 Document Object Model DOM. SAX and DOM SAX and DOM are standards for XML parsers-- program APIs to read and interpret XML files DOM is a W3C.
Java API for XML Processing
Simple API for XML SAX. Agenda l Introduction to SAX l Installation and setup l Steps for SAX parsing l Defining a content handler l Examples Printing.
Parsing with SAX using Java Kanda Runapongsa Dept. of Computer Engineering Khon Kaen University.
XML Parsers Overview Types of parsers Using XML parsers SAX DOM
Parsing XML into programming languages
Java XML IS
XML Parsers By Chongbing Liu.
XML Parsers Overview Types of parsers Using XML parsers SAX DOM
Java API for XML Processing
A parser for XML Documents
SAX2 29-Jul-19.
Presentation transcript:

SDPL Streaming API for XML1 3.4 Streaming API for XML (StAX) n Could we process XML documents more conveniently than with SAX, and yet more efficiently? n A: Yes, with Streaming API for XML (StAX) –general introduction –an example –comparison with SAX

SDPL Streaming API for XML2 StAX: General n Latest of standard Java XML parser interfaces –Origin: the XMLPull API (A. Slominski, ~ 2000) –developed as a Java Community Process lead by BEA Systems (2003) –included in JAXP 1.4, in Java WSDP 1.6, and in Java SE 6 (JDK 1.6) n An event-driven streaming API, like SAX –does not build in-memory representation n A "pull API" –lets the application to ask for individual events –unlike a "push API" like SAX

Advantages of Pull Parsing n A pull API provides events, on demand, from the chosen stream –can cancel parsing, say, after processing the header of a long message –can read multiple documents simultaneously –application-controlled access (~ iterator design pattern) usually simpler than SAX-style call- backs (~ observer design pattern) SDPL Streaming API for XML3

Cursor and Iterator APIs n StAX consists of two sets of APIs –(1) cursor APIs, and (2) iterator APIs –differ by representation of parse events (1) cursor API XMLStreamReader (1) cursor API XMLStreamReader –lower-level –methods hasNext() and next() to scan events, represented by as int constants START_DOCUMENT, START_ELEMENT,... –access methods, depending on current event type: –getName(), getAttributeValue(.. ), getText(),... SDPL Streaming API for XML4

(2) XMLEventReader Iterator API XMLEventReader provides contents of an XML document to the application using an event object iterator XMLEventReader provides contents of an XML document to the application using an event object iterator n Parse events represented as immutable XMLEvent objects –received using methods hasNext() and nextEvent() –event properties accessed through their methods –can be stored (if needed) –require more resources than the cursor API (See later) Event lookahead, without advancing in the stream, with XMLEventReader.peek() and XMLStreamReader.getEventType() Event lookahead, without advancing in the stream, with XMLEventReader.peek() and XMLStreamReader.getEventType() SDPL Streaming API for XML5

Writing APIs n StAX is a bidirectional API n allows also to write XML data through an XMLStreamWriter or an XMLEventWriter through an XMLStreamWriter or an XMLEventWriter n Useful for "marshaling" data structures into XML n Writers are not required to force well- formedness (not to mention validity) n provide some support: escaping of reserved chars like & and <, and adding unclosed end-tags SDPL Streaming API for XML6

SDPL Streaming API for XML7 Example of Using StAX (1/6) n Use StAX iterator interfaces to –fold element tagnames to uppercase, and to –strip comments n Outline: –Initialize »an XMLEventReader for the input document »an XMLEventWriter (for System.out ) »an XMLEventFactory for creating modified StartElement and EndElement events –Use them to read all input events, and to write some of them, possibly modified

SDPL Streaming API for XML8 StAX example (2/6) First import relevant interfaces & classes: First import relevant interfaces & classes: import java.io.*; import javax.xml.stream.*; import javax.xml.stream.events.*; import javax.xml.namespace.QName; public class capitalizeTags { public static void main(String[] args) throws FactoryConfigurationError, XMLStreamException, IOException { public static void main(String[] args) throws FactoryConfigurationError, XMLStreamException, IOException { if (args.length != 1) System.exit(1); if (args.length != 1) System.exit(1); InputStream input = new FileInputStream(args[0]); InputStream input = new FileInputStream(args[0]);

SDPL Streaming API for XML9 StAX example (3/6) Initialize XMLEventReader/Writer/Factory : Initialize XMLEventReader/Writer/Factory : XMLInputFactory xif = XMLInputFactory.newInstance(); xif.setProperty( XMLInputFactory.IS_NAMESPACE_AWARE, true); XMLInputFactory xif = XMLInputFactory.newInstance(); xif.setProperty( XMLInputFactory.IS_NAMESPACE_AWARE, true); XMLEventReader xer = xif.createXMLEventReader(input); XMLEventReader xer = xif.createXMLEventReader(input); XMLOutputFactory xof = XMLOutputFactory.newInstance(); XMLEventWriter xew = xof.createXMLEventWriter(System.out); XMLEventWriter xew = xof.createXMLEventWriter(System.out); XMLEventFactory xef = XMLEventFactory.newInstance();

SDPL Streaming API for XML10 StAX example (4/6) n Iterate over events of the InputStream: while (xer.hasNext() ) { while (xer.hasNext() ) { XMLEvent inEvent = xer.nextEvent(); XMLEvent inEvent = xer.nextEvent(); if (inEvent.isStartElement()) { if (inEvent.isStartElement()) { StartElement se = (StartElement) inEvent; StartElement se = (StartElement) inEvent; QName inQName = se.getName(); QName inQName = se.getName(); String localName = inQName.getLocalPart(); String localName = inQName.getLocalPart(); xew.add( xef.createStartElement( xew.add( xef.createStartElement( inQName.getPrefix(), inQName.getPrefix(), inQName.getNamespaceURI(), inQName.getNamespaceURI(), localName.toUpperCase(), localName.toUpperCase(), se.getAttributes(), se.getAttributes(), se.getNamespaces() ) ); se.getNamespaces() ) );

SDPL Streaming API for XML11 StAX example (5/6) n Event iteration continues, to capitalize end tags: } else if (inEvent.isEndElement()) { } else if (inEvent.isEndElement()) { EndElement ee = (EndElement) inEvent; QName inQName = ee.getName(); EndElement ee = (EndElement) inEvent; QName inQName = ee.getName(); String localName = inQName.getLocalPart(); String localName = inQName.getLocalPart(); xew.add( xef.createEndElement( xew.add( xef.createEndElement( inQName.getPrefix(), inQName.getPrefix(), inQName.getNamespaceURI(), inQName.getNamespaceURI(), localName.toUpperCase(), localName.toUpperCase(), ee.getNamespaces() ) ); ee.getNamespaces() ) );

SDPL Streaming API for XML12 StAX example (6/6) Output other events, except for comments; Finish when input ends: Output other events, except for comments; Finish when input ends: } else if (inEvent.getEventType() != XMLStreamConstants.COMMENT) { } else if (inEvent.getEventType() != XMLStreamConstants.COMMENT) { xew.add(inEvent); } xew.add(inEvent); } } // while (xer.hasNext()) } // while (xer.hasNext()) xer.close(); input.close(); xer.close(); input.close(); xew.flush(); xew.close(); xew.flush(); xew.close(); } // main() } // class capitalizeTags

Efficiency of Streaming APIs? n An experiment of SAX vs StAX for scanning documents n Task: Count and report the number of elements, attributes, character fragments, and total char length n Inputs: Similar prose-oriented documents, of different size – repeated fragments of W3C XML Schema Rec (Part 1) n Tested on OpenJDK (different updates), with –Red Hat Linux , 3 GHz Pentium,1 GB RAM (”OLD”) –64 b Centos Linux 5, 2.93 GHz Intel Core 2 Duo, 4GB RAM (”NEW”) SDPL Streaming API for XML13

Essentials of the SAX Solution n Obtain and use a JAXP SAX parser: String docFile; // initialized from cmd line String docFile; // initialized from cmd line SAXParserFactory spf = SAXParserFactory.newInstance(); SAXParserFactory spf = SAXParserFactory.newInstance(); spf.setValidating(validate); //from cmd option spf.setValidating(validate); //from cmd option spf.setNamespaceAware(true); spf.setNamespaceAware(true); SAXParser sp = spf.newSAXParser(); SAXParser sp = spf.newSAXParser(); CountHandler ch = new CountHandler(); CountHandler ch = new CountHandler(); sp.parse( new File(docFile), ch ); sp.parse( new File(docFile), ch ); ch.printResult(); // print the statistics ch.printResult(); // print the statistics SDPL Streaming API for XML14

SAX Solution: CountHandler public static class CountHandler extends DefaultHandler { public static class CountHandler extends DefaultHandler { // Instance vars for statistics: // Instance vars for statistics: int elemCount = 0, charFragCount = 0, int elemCount = 0, charFragCount = 0, totalCharLen = 0, attrCount = 0; public void startElement(String nsURI, String locName, String qName, Attributes atts) { elemCount++; attrCount += atts.getLength(); } totalCharLen = 0, attrCount = 0; public void startElement(String nsURI, String locName, String qName, Attributes atts) { elemCount++; attrCount += atts.getLength(); } public void characters(char[] buf, int start, int length) { charFragCount++; totalCharLen += length; } public void characters(char[] buf, int start, int length) { charFragCount++; totalCharLen += length; } SDPL Streaming API for XML15

Essentials of the StAX Solution n First, initialize : XMLInputFactory xif = XMLInputFactory.newInstance(); XMLInputFactory xif = XMLInputFactory.newInstance(); xif.setProperty( XMLInputFactory.IS_NAMESPACE_AWARE, true); InputStream input = new FileInputStream( docFile ); InputStream input = new FileInputStream( docFile ); int elemCount = 0, charFragCount = 0, int elemCount = 0, charFragCount = 0, totalCharLen = 0, attrCount = 0; totalCharLen = 0, attrCount = 0; n Then parse the InputStream, using (a) the cursor API, or (b) the event iterator API SDPL Streaming API for XML16

(a) StAX Cursor API Solution (1) XMLStreamReader xsr = xif.createXMLStreamReader(input); XMLStreamReader xsr = xif.createXMLStreamReader(input); while(xsr.hasNext() ) { while(xsr.hasNext() ) { int eventType = xsr.next(); int eventType = xsr.next(); switch (eventType) { switch (eventType) { case XMLEvent.START_ELEMENT: case XMLEvent.START_ELEMENT: elemCount++; elemCount++; attrCount += xsr.getAttributeCount(); attrCount += xsr.getAttributeCount(); break; break; SDPL Streaming API for XML17

(a) StAX Cursor API Solution (2) case XMLEvent.CHARACTERS: case XMLEvent.CHARACTERS: charFragCount++; charFragCount++; totalCharLen += xsr.getTextLength(); totalCharLen += xsr.getTextLength(); break; break; default: break; default: break; } // switch } // switch } // while (xsr.hasNext() ) } // while (xsr.hasNext() ) xsr.close(); xsr.close(); input.close(); input.close(); SDPL Streaming API for XML18

(b) StAX Iterator API Solution (1) XMLEventReader xer = xif.createXMLEventReader ( input ); while (xer.hasNext() ) { XMLEvent event = xer.nextEvent(); XMLEventReader xer = xif.createXMLEventReader ( input ); while (xer.hasNext() ) { XMLEvent event = xer.nextEvent(); if (event.isStartElement()) { if (event.isStartElement()) { elemCount++; elemCount++; Iterator attrs = event.asStartElement().getAttributes(); Iterator attrs = event.asStartElement().getAttributes(); while (attrs.hasNext()) { while (attrs.hasNext()) { attrs.next(); attrCount++; } attrs.next(); attrCount++; } } // if (event.isStartElement()) } // if (event.isStartElement()) SDPL Streaming API for XML19

(b) StAX Iterator API Solution (2) if (event.isCharacters()) { if (event.isCharacters()) { charFragCount++; charFragCount++; totalCharLen += ((Characters) event).getData().length(); totalCharLen += ((Characters) event).getData().length(); } } // while (xer.hasNext() ) } // while (xer.hasNext() ) xer.close(); xer.close(); input.close(); input.close(); SDPL Streaming API for XML20

Efficiency of SAX vs StAX SDPL Streaming API for XML21

Efficiency of SAX vs StAX (NEW) SDPL Streaming API for XML22

Observations n StAX cursor API is the most efficient Overhead of XMLEvent objects makes StAX iterator some 50 – 80% slower Overhead of XMLEvent objects makes StAX iterator some 50 – 80% slower n SAX is on small documents ~ % slower than the StAX cursor API n Overhead of DTD validation adds ~5 – 10 % to SAX parsing time n StAX loses its advantage with bigger documents: SDPL Streaming API for XML23

Times on Larger Documents SDPL Streaming API for XML24 Why? Let's take a look at memory usage Why? Let's take a look at memory usage

Memory Usage of SAX vs StAX SDPL Streaming API for XML25 StAX implementation has a memory leak! (Should get fixed in future releases) < 6 MB

Memory Usage of SAX vs StAX (NEW) SDPL Streaming API for XML26 Memory-leak also in the SAX implementation!

Circumventing the Memory Leak n The bug appears to be related to a DOCTYPE declaration with an external DTD n Without a DOCTYPE declaration –In first experiment, each API uses less than 6 MB –In second experiment, the StAX Event objects still require increasing amounts of memory; See next SDPL Streaming API for XML27

SAX vs StAX memory need (w.o. DTD) SDPL Streaming API for XML28

Speed on documents without DTD SDPL Streaming API for XML29

Speed on documents without DTD (NEW) SDPL Streaming API for XML30

SDPL Streaming API for XML31 StAX: Summary n Event-based streaming pull-API for XML documents n More convenient than SAX –and often more efficient, esp. the cursor API with small docs n Supports also writing of XML data n A potential substitute for SAX –NB: Sun Java Streaming XML Parser (in JDK 1.6) is non- validating (but the API allows validation, too) –once some implementation bugs (in JDK 1.6) get eliminated