XML – what is it? eXtensible Markup Language Standard for publishing and interchange on the web and over the wire simpler version of SGML adapted to internet what’s it good for: Data exchange between businesses/enterprises common denominator regardless of source data format E-business Publishing of data Storage format for irregular data …
An Example What is the name of this book? Raymond M. Smullyan Penguin Principia Mathematica B. Russel A. Whitehead XML lingo: tags (like brackets) elements (like complex objects can be nested. may be empty. rooted graph. well-formed = brackets properly nested.
Example revisited What is the name of this book? Raymond M. Smullyan Penguin Principia Mathematica B. Russel A. Whitehead Note different kinds of attributes -- ID, IDREF(S), others. All are just attributes nevertheless. attributes vs. elements
Some important properties widely supported self-descriptive – parsers available flexible model (can control how much structure we put in) data intra-doc and inter-doc references Presentation/publishing separated from representation (XSL) Human readable. No need for proprietary formats. Many, many tools
Some predecessors SGML (document authoring standard) [too heavyweight] HTML (almost no capabilities for representation (as opposed to presentation) EDI (electronic data interchange) – used by companies, banks (pre-XML “standard”). – not human readable.
Some database issues for XML How to model XML? (trees with some funny cross-links). How to query? (XPath, XQuery) How to store? (relational, OO/OR, native) How to process XML data efficiently? (devise new algorithms? tweak old ones on, e.g., RDBMS? RDBMS extenders?) XML – authoring is easy: just publish and not worry about schema. (“schema later!”). isn’t always good. (why?) what is an appropriate notion of schema?
Document Type Descriptors a first-cut approx. to schema for XML. extended context-free grammar. enforces only structure. leaves typing mostly out. has many limitations. e.g.:... when is an XML document valid w.r.t. a DTD? Exercise: devise an efficient algorithm for validity checking. sequence choice quantifier optional & mandatory obligations
Why DTDs aren’t good enough? Useful for documents, but not enough for data: no support for structure sharing and reuse Object-oriented-like features not supported (recall: ID/IDREF(s) are purely syntactic.) no support for data types Can’t validate your data! no support for keys (exception – ID single attribute key!) & foreign keys IDREFs not typed (what if an IDREF from a book “points” to a nuclear reactor?) DTD does not conform to XML syntax!
XML Schema Highlights XML format? support for basic data types? (integer, float, string, date, bool, etc.) support for value-based constraints? extensibility (e.g., users can define complex types) OO-like features? e.g., Inheritance (extension or restriction) keys & foreign keys are references typed?
Example XML Schema …
Example XML
(Subset of) useful XML standards Xpath/Xpointer/Xlink*: standard for linking to documents and elements within docs XSL/XSLT*: presentation and transformation RDF: resource description framework (meta- info such as ratings, categorizations, etc.) – plays a pivotal role in semantic web. Namespaces: for resolving name clashes DOM: Document Object Model for manipulating XML documents SAX: Simple API for XML parsing
Tree model for XML Data. bib book rev lang tit auth pubtit auth year... what is the name of this book? raymond m. smullyan penguin principia mathematica b. russel a. whitehead 1950 may 15, date 1.what is the semantics? 2.(when) is order important? 3. what kinds of queries would you like to pose? 4. how do you publish? english
Representing relations EmpPhone John Mary Mike employees tuple e p e p e p John Mary Mike Is order of tuples or of attributes important? (how) can the various relational integrity cons- traints (ICs) be captured?
Relations vs. XML. XML – need not be flat. schema/data distinction blurred. schema may be as large as data and indeed comes with data. typing not strict. missing and repeating elements – e.g., multiple authors, missing pub/year. These distinctions raise challenges when we try to store XML in relations and query it. (forward pointer.) XML QLs next class.