about XML/Xquery/RDF 4/1
TEXT Structured (relational) Data XML Less Structure More Structure
HTML vs. XML Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteoul, Buneman, Suciu Morgan Kaufmann, 1999 Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … “Self-describing” -Schema info part of the data -Good for data exchange (albeit baroque for storage)
Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteoul, Buneman, Suciu Morgan Kaufmann, 1999 Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … HTML describes presentation XML describes content
Why are Database folks so excited about XML? XML is just a syntax for (self- describing) data This is still exciting because –No standard syntax for relational data –With XML, we can Translate any legacy data to XML Can exchange data in XML format –Ship over the web, input to any application
XML machine accessible meaning This is what a web-page in natural language looks like for a machine Jim Hendler
XML machine accessible meaning CV name education work private XML allows “meaningful tags” to be added to parts of the text Jim Hendler
XML machine accessible meaning CV name education work private But to your machine, the tags look like this…. Jim Hendler
XML machine accessible meaning Schemas help…. …by relating common terms between documents Jim Hendler
But other people use other schemas CV name education work private > Someone else has one like this…. Jim Hendler
But other people use other schemas …which don’t fit in Moral: There is still need for ontology mapping.. Jim Hendler
The X-standards… XML: an on-the-wire representation for data –Xquery: a query language for XML –Xschema: a schema description language for XML data RDF: a language for meta- data description WSDL/SOAP/UDDI: languages for describing services
XML Terminology tags: book, title, author, … start tag:, end tag: elements: …, … elements are nested empty element: abbrv. an XML document: single root element well formed XML document: if it has matching tags
Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteoul, Buneman, Suciu Morgan Kaufmann, 1999 Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … HTML describes presentation XML describes content
XML Terminology tags: book, title, author, … start tag:, end tag: elements: …, … elements are nested empty element: abbrv. an XML document: single root element well formed XML document: if it has matching tags
More XML: Attributes Foundations of Databases Abiteboul … 1995 Attributes are single-valued --No guidance on when to use them
More XML: Oids and References Jane Mary John oids and references in XML are just syntax Object identifiers
XML vs. Relational Data XML is meant as a language that supports both Text and Structured Data –Conflicting demands... XML supports semi-structured data –In essence, the schema can be union of multiple schemas Easy to represent books with or without prices, books with any number of authors etc. XML supports free mixing of text and data –using the #PCDATA type XML is ordered (while relational data is unordered) TEXT Structured (relational) Data XML Less Structure More Structure
DTDs <!DOCTYPE paper [ ]> <!DOCTYPE paper [ ]> … Notice that DTD is not In XML syntax… Semi- structured
XML Schemas More recent proposal (with XML syntax) unifies previous schema proposals generalizes DTDs uses XML syntax two documents: structure and datatypes – –
XML Schema
RDF: Meta-data Standard for Web birds, butterflies, snakes John Smith Good’ol semantic networks..?
Querying XML Requirements: –Need to handle lack of schema. We may not know much about the data, so we need to navigate the XML. –Need to support both “information retrieval” and “SQL- style” queries. Ordered vs. un-ordered XML –“Human readable” like SQL? Candidates –Many… based on conflicting requirements XSL: Makes IR folks happy XML-QL: Makes DB folks happy Xquery : W3C’s attempt to make everybody (un)happy
XQuery 1.0: An XML Query Language –W3C Working Draft 20 December 2001 XML Query Use Cases –W3C Working Draft 20 December 2001 Microsoft.Net Xquery Language Demo – – hive.com/xquery/index.ht ml –Supports querying on the documents described in the W3C Use Cases Xquery Tutorial by Fankhauser & Wadler – user/wadler/papers/xquery- tutorial/ xquery-tutorial.pdf Xquery Resources
FLoWeR Expressions Xquery queries are made up of FLWR expressions that work on “paths” For binds variables to nodes Let computes aggregates Where applies a formula to find matching elements Return constructs the output elements Path expressions are of the form: element//element/element[attrib=value]
Comparison to SQL Look at the use case description on Xquery manual Supports all (?) SQL style queries (with different syntax of course) [default queries in the demo] Has support for –“construction”—outputting the answers in arbitrary XML formats (use case “XMP” ) –“path expressions” --- navigating the XML tree (use case “seq”) –Simple text queries [use case “text”] –Allows queries on “Tag” elements Removes the “data/meta-data” barrier in queries –For each book that has at least one author, list the title and first two authors, and an empty "et-al" element if the book has additional authors. [XMP use case 6]
DTD for
Example Query { for $b in /bib/book where $b/publisher = "Addison- Wesley" and > 1991 return { $b/title } } “For all books after 1991, return with Year changed from a tag to an attribute” TCP/IP Illustrated Advanced Programming in the Unix environment Result Query
Example Query (2) Return the books that cost more at amazon than fatbrain Let $amazon := document( Let $fatbrain := document( For $am in $amazon/books/book, $fat in $fatbrain/books/book Where $am/isbn = $fat/isbn and $am/price > $fat/price Return { $am/title, $am/price, $fat/price } Join
XML frenzy in the DB Community Now that XML is there, what can we do with it? –Convert all databases from Relational to XML? Or provide XML views of relational databases? –Develop theory of native XML databases? Or assume that XML data will be stored in relational databases.. –Issues: What sort of storage mechanisms? What sort of indices?
XML middleware for Databases XML adapters (middle-ware) received significant attention in DB community –SilkRoute (AT&T) –Xperanto (IBM) Issues: – Need to convert relational data into XML Tagging (easy) –Need to convert Xquery queries into equivalent SQL queries Trickier as Xquery supports schema querying
Don’t look beyond this..
Xquery Tutorial Craig Knoblock University of Southern California
References XQuery 1.0: An XML Query Language –W3C Working Draft 20 December 2001 XML Query Use Cases –W3C Working Draft 20 December 2001 Microsoft.Net Xquery Language Demo – –Supports querying on the documents described in the W3C Use Cases Xquery Tutorial by Fankhauser & Wadler – y-tutorial/ xquery-tutorial.pdf
DTD for
Data for TCP/IP Illustrated Stevens W. Addison-Wesley Advanced Programming in the Unix environment Stevens W. Addison-Wesley 65.95
Data for (cont.) Data on the Web Abiteboul Serge Buneman Peter Suciu Dan Morgan Kaufmann Publishers The Economics of Technology and Content for Digital TV Gerbarg Darcy CITI Kluwer Academic Publishers
Document References Document can either be referenced explicitly or in the default namespace In the Microsoft Demo –/Bib = document(" We will use /bib throughout, but you must use the expansion to run the demo In Theseus the document for xquery is passed as input
Projection Return the names of all authors of books /bib/book/author = Stevens W. Abiteboul Serge Buneman Peter Suciu Dan
Project (cont.) The same query can also be written as a for loop /bib/book/author = for $bk in /bib/book return for $aut in $bk/author return $aut = Stevens W. Abiteboul Serge Buneman Peter Suciu Dan
Selection Return the titles of all books published before 1997 < "1997"]/title = TCP/IP Illustrated Advanced Programming in the Unix environment
Selection (cont.) Return the titles of all books published before 1997 < "1997"]/title = for $bk in /bib/book where < "1997" return $bk/title = TCP/IP Illustrated Advanced Programming in the Unix environment
Selection (cont.) Return book with the title “Data on the Web” /bib/book[title = "Data on the Web"] = Data on the Web Abiteboul Serge Buneman Peter Suciu Dan Morgan Kaufmann Publishers 39.95
Selection (cont.) Return the price of the book “Data on the Web” /bib/book[title = "Data on the Web"]/price = How would you return the book with a price of $39.95?
Selection (cont.) Return the book with a price of $39.95 for $bk in /bib/book where $bk/price = " 39.95" return $bk = Data on the Web Abiteboul Serge Buneman Peter Suciu Dan Morgan Kaufmann Publishers 39.95
Construction Return year and title of all books published before 1997 for $bk in /bib/book where < "1997" return { $bk/title } = TCP/IP Illustrated Advanced Programming in the Unix environment
Grouping Return titles for each author for $author in distinct(/bib/book/author/last) return { /bib/book[author/last = $author]/title } = TCP/IP Illustrated Advanced Programming in the Unix environment Data on the Web …
Join Return the books that cost more at amazon than fatbrain Let $amazon := document( Let $fatbrain := document( For $am in $amazon/books/book, $fat in $fatbrain/books/book Where $am/isbn = $fat/isbn and $am/price > $fat/price Return { $am/title, $am/price, $fat/price }
Example Query 1 { for $b in /bib/book where $b/publisher = "Addison-Wesley" and > 1991 return { $b/title } } What does this do?
Result Query 1 TCP/IP Illustrated Advanced Programming in the Unix environment
Example Query 2 { for $b in document(" $t in $b/title, $a in $b/author return { $t } { $a } }
Result Query 2 TCP/IP Illustrated Stevens Advanced Programming in the Unix environment Stevens Data on the Web Abiteboul Data on the Web Buneman Data on the Web Suciu
Example Query 3 { for $b in document(" $a in document(" where $b/title = $a/title return { $b/title } { $a/price/text() } { $b/price/text() } }
Result Query 3 TCP/IP Illustrated Advanced Programming in the Unix environment Data on the Web
Example Query 4 { for $b in document(" where $b/publisher = "Addison-Wesley" and > "1991" return { } { $b/title } sortby (title) }
Example Result 4 Advanced Programming in the Unix environment TCP/IP Illustrated
Impact of XML on Integration If and when all sources accept Xqueries and exchange data in XML format, then –Mediator can accept user queries in Xquery –Access sources using Xquery –Get data back in XML format –Merge results and send to user in XML format How about now? –Sources can use XML adapters (middle-ware)
Is XML standardization a magical solution for Integration? If all WEB sources standardize into XML format –Source access (wrapper generation issues) become easier to manage –BUT all other problems remain Still need to relate source (XML)schemas to mediator (XML)schema Still need to reason about source overlap, source access limitations etc. Still need to manage execution in the presence of source/network uncertainities
“Semantic Web” The LAV/GAV approaches assume that some human expert will do the actual schema mapping The “semantic-web” initiative attempts to automate schema mapping –Idea: Allow pages to write logical axioms relating their vocabulary (tags) to other external tags –Support automatic inference of relations between source and mediator schema using these rules DAML+OIL
Data Model
Which will have XML Syntax
Document Type Definition: DTD part of the original XML specification an XML document may have a DTD terminology for XML: –well-formed: if tags are correctly closed –valid: if it has a DTD and conforms to it validation is useful in data exchange
Notice that DTD is not In XML syntax…
External DTD Internal Two ways to specify a DTD Hello, world! <!DOCTYPE greeting [ ]> Hello, world!
DTDs as Grammars <!DOCTYPE paper [ ]> <!DOCTYPE paper [ ]> …
Shortcomings of DTDs Useful for documents, but not so good for data: No support for structural re-use –Object-oriented-like structures aren’t supported No support for data types –Can’t do data validation Can have a single key item (ID), but: –No support for multi-attribute keys –No support for foreign keys (references to other keys) –No constraints on IDREFs (reference only a Section)
XML Schema In XML format Includes primitive data types (integers, strings, dates, etc.) Supports value-based constraints (integers > 100) User-definable structured types Inheritance (extension or restriction) Foreign keys Element-type reference constraints
XML Schemas DTD: Pre-specified tags How many different RDBMS Schemas are needed here?
Sample XML Schema …
@ssn Subtyping in XML Schema
DTDs as Schemas Not so well suited: impose unwanted constraints on order references cannot be constrained can be too vague: Union of schemas..?
XML Schemas recent proposal unifies previous schema proposals generalizes DTDs uses XML syntax two documents: structure and datatypes – –
Although DB folks have several beefs Give me the names of people who are Listed either as editor or author of a book
Differences between XML and SSD Pure SSD uses edge-labeled graphs as data model XML is ordered, ssd is not XML can mix text and elements: Making Java easier to type and easier to type Phil Wadler XML has lots of other stuff: entities, processing instructions, comments
XML vs. standard semi- structured data models Alan 42 { person: &o123 { name: “Alan”, age: 42, } } person nameage person name age father … { person: { father: &o123 …} } similar on trees, different on graphs Node labeling Edge labeling
XML seen from (R)DBMS world RDBMS may want to “publish” data in XML [provide an XML view of their data] –“Tagging” the output –Support XML-based querying (which are then converted to SQL querying) Single XML-QL query may correspond to a set of SQL queries –E.g. Schema queries SilkRoute, Xperanto systems –Support XML-based updating Tukwila RDBMS can be used to provide an efficient storage for XML files –Efficient indexing/retrieval of path expressions
Other Important XML Standards XSL/XSLT*: –presentation and transformation standards RDF: –resource description framework (meta-info such as ratings, categorizations, etc.) Xpath/Xpointer/Xlink*: –standard for linking to documents and elements within Namespaces: –for resolving name clashes DOM: –Document Object Model for manipulating XML documents SAX: –Simple API for XML parsing
RDF (2/99) purpose: metadata for Web –help search engines syntax in XML semantics: edge-labeled graphs
RDF Metadata standard birds, butterflies, snakes John Smith
More RDF Examples
RDF Terminology subject object predicate statement
More RDF: Containers bag, sequence, alternative s1 s2
RDF Containers (cont’d) Bag s1 s2 a rdf:type rdf_1 rdf_2
More RDF: Higher Order Statements “the author of says: ‘the topic of is environment’ “ environment topic says author RDF uses reification
XML Parsers traditional: return data structure (DOM?) event based: SAX (Simple API for XML) – –write handler for start tag and for end tag
Need for Ontology standardization
XML Data Model does not exists Document Object Model (DOM): – (10/98) –class hierarchy (node, element, attribute,…) –objects have behavior –defines API to inspect/modify the document
Start of 4/9 lecture
Querying XML
XML Data Model (Graph) Issues: distinguish between attributes and sub-elements? Should we conserve order? Think of the labels as names of binary relations.
Need for XML querying human-readable documents to retrieve individual documents, to provide dynamic indexes, to perform context-sensitive searching, and to generate new documents. data-oriented documents to query (virtual) XML representations of databases, to transform data into new XML representations, and to integrate data from multiple heterogeneous data sources. mixed-model documents to perform queries on documents with embedded data, such as catalogs, patient health records, employment records, or business analysis documents.
Querying XML Requirements: –Query a graph, not a relation. –The result should be a graph (representing an XML document), not a relation. –No schema. –We may not know much about the data, so we need to navigate the XML.
W3C requirements The W3C Query Working Group has identified many technical requirements: requirements at least one XML syntax; at least one human-readable syntax. must be declarative; must be protocol independent; must respect XML data model; must be namespace aware; must coordinate with XML Schema; must work even if schemas are unavailable; must support simple and complex datatypes; must support universal and existential quantifiers; must support operations on hierarchy and sequence of document structures; must combine information from multiple documents; must support aggregation; must be able to transform and to create XML structures; must be able to traverse ID references.
Query Languages XML-QL: Invented by DB folks –XML-QL is relational-complete (allows Joins) also supports path expressions Can extract as well as transform data into different formats (like XSL) –XML-QL is not in XML syntax XSL: can also be seen as a query language –Can transform data
XML-QL data model XML-QL works on an abstraction, called an XML graph, of the concrete XML document: comments and processing instructions are ignored; the relative order of elements is ignored; every node has an ID (autogenerated, if necessary); all leaves are character data. XML graphs are obtained from XML documents but are also generated by queries. A graph is mapped back into an XML document by choosing arbitrary orderings of element sequences. This abstraction is very similar to that from tables to relations: disregard the order of tuples and attributes.
Extracting Data by Query Matching data using elements patterns. WHERE Addison-Wesley $t $a IN “ CONSTRUCT $a “where” clause only specifies What must be in the pattern --pattern can have other stuff besides what is listed in where
Constructing XML Data WHERE Addison-Wesley $t $a IN “ CONSTRUCT $a $t
Grouping with Nested Queries WHERE $t, Addison-Wesley CONTENT_AS $p IN “ CONSTRUCT $t WHERE $a IN $p CONSTRUCT $a ”
Joining Elements by Value (also integration) WHERE $f $l ELEMENT_AS $e IN “ $f $l IN “ y > 1995 CONSTRUCT $e Find all articles whose writers also published a book after Multiple queries That share values
Tag variables (schema queries) WHERE $t 1995 Smith IN " $e IN {author, editor} CONSTRUCT $t Smith $p matches book and article. $e matches author and editor. this saves us from writing four queries. This finds all publications in 1995 where Smith is either author or editor
Path Expressions WHERE $r Ford IN " CONSTRUCT $r WHERE $r IN " CONSTRUCT $r Matches any sequence of nodes all of which are labeled part (can substitute $ for part in the above…)
Due 30 th April