1 4/13/01 CSE 121/131 Programming Spring 2001 Lecture Notes 7 A. Sahuguet & V.Tannen
2 4/13/01 Data on the Web, today: HTML... Primary Faculty Rajeev Alur Associate Professor, Computer and Information Science Formal support for design and analysis of reactive, real-time, and hybrid systems. Hardware verification; Software engineering; Control of distributed multi-agent systems; Logic and concurrency theory; Distributed computing....
3 4/13/01 Data on the Web, tomorrow: XML... Rajeev Alur Associate Professor Computer and Information Science Formal support for design and analysis of reactive, real-time, and hybrid systems. Hardware verification; Software engineering; Control of distributed multi-agent systems; Logic and concurrency theory; Distributed computing....
4 4/13/01 What is XML? Like HTML, XML is a “document markup language” i.e., a way to enrich text with tags and attributes. HTML’s markup is about visual presentation. However, it is difficult for a program to manipulate the data in HTML. XML’s markup is about the meaning of the information. This makes it easier for programs to manipulate XML. Still, what we saw on the previous slide is an external format. Internally, XML is represented as trees.
5 4/13/01 How XML overcomes some HTML limitations Using XML, content providers can separate form and content. XML Content Wireless Markup Language HTML XSL (Stylesheets) HTML (Web-TV)
6 4/13/01 Wireless Applications Hand-held devices have some constraints –small display –narrowband network connection –limited memory and computational resources HTML is not suitable to deliver information to them -> Need for a Wireless Markup Language (WML) What WML offers –specific layout –new metaphor (deck, cards) –state management –binary XML format to make data more concise The same metaphor can be used for e-forms in various domains: interactive kiosks, medical forms, etc.
7 4/13/01 Manipulating XML documents Manipulation –parsing: reading, checking syntax, transforming in internal format –navigating –modifying Fortunately, XML comes with a standard API that offers all these features Document Object Model (DOM) API: Application Programming Interface
8 4/13/01 DOM “DOM provides a programmatic access to the content, structure and style of XML documents and allows languages such as Java to extract information from documents containing specific tags as if they were objects.” [Ardent’s white paper on XML] Platform neutral API designed by W3C using CORBA/IDL Mapping to various programming languages (Java, C++, Perl, etc.) DOM supported by all the major players DOM makes XML documents parser and representation independent
9 4/13/01 DOM overview What DOM is doing Shady Grove Aeolian Over the River, Charlie Dorian
10 4/13/01 The DOM API (overview) Node AttrCharacterData CommentText CDATASection DocumentElementEntity NodeList interface Document createAttribute(…) createCDATASection(…) createComment(…) createElement(…) createTextNode(…) interface Node appendChild(…) getAttributes(…) getChildNodes(…) interface Element getAttribute(name) getAttributeNode(name) getElementsByTagName(name) The full API can be found at
11 4/13/01 DOM in action We take an HTML page from the IBM Patent server and we XML-ize it. From it, we want to extract some specific information, such as the name of the inventors. 4 ways to do it –Java DOM –Java XQL –Perl –XML-QL (will return an XML document)
12 4/13/01 The Patent Example Converted using W4F
13 4/13/01 DOM with Java import com.ibm.xml.parser.*; import org.w3c.dom.*; import java.io.*; public class Test { public static void main(String args[]) throws Exception { Parser parser = new Parser( args[0] ); Document doc = parser.readStream( new FileInputStream( args[0] )); NodeList nodes = doc.getElementsByTagName("Inventor"); int n = nodes.getLength(); for(int i=0; i<n; i++) { Element node = (Element) nodes.item(i); String href= node.getAttribute("First_Name"); System.out.println(href); }
14 4/13/01 DOM with Java and XQL (GMD, IBM) import de.gmd.ipsi.xql.*; import org.w3c.dom.*; import com.ibm.xml.parser.*; import java.io.*; public class XQLTest { public static void main(String args[]) throws Exception { Parser parser = new Parser( args[0] ); Document doc = parser.readStream( new FileInputStream( args[0] )); XQLResult r = XQL.execute("//Inventor", doc ); for(int i=0; i<r.getLength(); i++) { Element inventor = (Element) r.getItem(i); String href = inventor.getAttribute("First_Name"); System.out.println(href); }
15 4/13/01 DOM with Perl Extracting the name of the Inventors from the IBM Patent database. #!/usr/bin/perl use XML::DOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("patent.xml"); my $nodes = $doc->getElementsByTagName ("Inventor"); my $n = $nodes->getLength; for (my $i = 0; $i < $n; $i++) { my $node = $nodes->item ($i); my $href = $node->getAttribute ("First_Name"); print $href, "\n"; } Include the Perl package Instantiate a new parser and parse the source file. Get the list of nodes that correspond to. For each node, extract the First_Name attribute and print it.
16 4/13/01 SAX, a low-level alternative to DOM SAX –simple API for XML –supported by most XML parsers –event-driven parser Instead of reading the entire file in memory and building a tree, SAX reads a stream of tokens and triggers events –startDocument –startElement –endElement –endDocument The programmer has to write a document handler that captures these events and do something with the tokens.
17 4/13/01 An Example of SAX public class OutputHandler implements DocumentHandler { private PrintWriter pw; } public OutputHandler() { this.pw = new PrintWriter( System.out ); } public OutputHandler(PrintWriter pw) { this.pw = pw; } public String toString() { pw.flush(); return ""; } public void characters(char[] ch, int start, int length) { pw.print(new String(ch, length)); return ""; } /* to be continued … */ public void endDocument() { pw.println(" "); } public void endElement(String name) { pw.println(" "); } public void startDocument() { pw.println(" "); return; } public void startElement(String name, AttributeList atts) { pw.print("<" + name); if (atts != null) for(int i = 0; i < atts.getLength(); ++i) pw.print(" " + atts.getName(i) + "=\"" + atts.getValue(i) + "\""); pw.println(">"); return; }
18 4/13/01 SAX vs DOM SAX –does not store anything in memory (great for stream-based processing) –navigation in the document is clumsy –does not permit to update an XML document DOM –permits updates –offers the DOM API for navigation/construction –requires the entire document to be stored in main memory
19 4/13/01 The Missing Link There is only a “gentlemen’s agreement” between the application and its XML environment. Why do we need to go beyond that? –performance –static guarantees (helps to identify and control failures) How do we create a tight contract between the application and its XML environment? XML (input) Application XML (output)
20 4/13/01 XML Binding Requirements –high-level specification for XML (e.g. DTD, XML-Schemas, UML, etc.) –a mapping to your favorite programming language (e.g. Java) –a compiler that will generate code (“stubs” that define an API) (Same paradigm as CORBA/IDL or ODMG/ODL) Sun’s Proposal: XML spec. compiler stubs
21 4/13/01 Generic (DOM/SAX) vs Domain Specific API generic API –generic parsing –getElement(“order”) –getAttribute(“date”) –generic marshalling only runtime checks domain specific API –domain specific parsing –get_order() –get_date() –domain specific marshalling both static and runtime checks Instead of a generic API (e.g. SAX, DOM), the application will use a domain specific API generated from the specification. Issues –mapping accurately XML “types” to a programming language –static checks vs runtime checks (some features from the specification cannot be checked statically)
22 4/13/01 XML programming Resources –Java and XML, Brett McLaughlin, Mike Loukides XML parsers (DOM/SAX) –Apache –Oracle –Sun Project X –Microsoft XML-binding frameworks –Oracle ClassGenerator –Castor