1 XML and Databases
2 Outline (ambitious) Background: documents (SGML/HTML) and databases (structured and semistructured data) XML Basics and Document Type Descriptors XML query languages: XML-QL and XSL. XML additions: Xlink, Xpointer, RDF, SOX, XML- Data Document Object Model (XML API's)
3 Some Useful Articles XML, Java, and the future of the web XML and the Second-Generation Web Articles/standards for XML, XSL, XML-QL
4 Background What’s the difference between the world of documents and information retrieval and databases and query interfaces?
5 Documents vs Databases Document world > plenty of small documents > usually static > implicit structure section, paragraph, toc, > tagging > human friendly > content form/layout, annotation > Paradigms “Save as”, wysiwyg > meta-data author name, date, subject Database world > a few large databases > usually dynamic > explicit structure (schema) > records > machine friendly > content schema, data, methods > Paradigms Atomicity, Concurrency, Isolation, Durability > meta-data schema description
6 What to do with them Documents editing printing spell-checking counting words retrieving (IR) searching Database updating cleaning querying composing/transforming
7 HTML Publishing hypertext on the World Wide Web Designed to describe how a Web browser should arrange text, images and push-buttons on a page. Easy to learn, but does not convey structure. Fixed tag set. Welcome to the XML course Introduction Opening tag Text (PCDATA) Closing tag “Bachelor” tag Attribute nameAttribute value
8 The Structure of XML XML consists of tags and text Tags come in pairs... They must be properly nested good bad (You can’t do in HTML)
9 XML text XML has only one “basic” type -- text. It is bounded by tags e.g. The Big Sleep is still text XML text is called PCDATA (for parsed character data). It uses a 16-bit encoding, e.g. \&\#x0152 for the Hebrew letter Mem Later we shall see how new types are specified by XML-data
10 XML structure Nesting tags can be used to express various structures. E.g. A tuple (record) : Malcolm Atchison (215)
11 XML structure (cont.) We can represent a list by using the same tag repeatedly:...
12 Terminology The segment of an XML document between an opening and a corresponding closing tag is called an element. Malcolm Atchison (215) element not an element element, a sub-element of
13 XML is tree-like person name tel Malcolm Atchison (215) Semistructured data models typically put the labels on the edges
14 Mixed Content An element may contain a mixture of sub-elements and PCDATA British Airways World’s favorite airline Data of this form is not typically generated from databases. It is needed for consistency with HTML
15 A Complete XML Document Malcolm Atchison (215)
16 Two ways of representing a DB projects: title budget managedBy employees: name ssn age
17 Project and Employee relations in XML Pattern recognition Joe Joe Sandra Auto guided vehicle Sandra : Projects and employees are intermixed
18 Pattern recognition Joe Auto guided vehicles Sandra : Project and Employee relations in XML (cont’d) Joe Sandra : Employees follow projects
19 Pattern recognition Joe Auto guided vehicles Sandra : Project and Employee relations in XML (cont’d) Joe Sandra : Or without “separator” tags …
20 Attributes An (opening) tag may contain attributes. These are typically used to describe the content of an element cheese fromage branza A food made … Order of attributes in an element does not matter XML elements are ordered
21 Attributes (cont’d) Another common use for attributes is to express dimension or type A document that obeys the “nested tags” rule and does not repeat an attribute within a tag is said to be well-formed.
22 When to use attributes It’s not always clear when to use attributes F. MacNiel F. MacNiel
23 XML Misc. Apart from elements and attributes, XML allows processing instructions and comments. A processing instruction is a statement of the form: A comment takes the following form: enclose comments between
24 Document Type Descriptors Imposing structure on XML documents
25 Document Type Descriptors Document Type Descriptors (DTDs) impose structure on an XML document. There is some relationship between a DTD and a schema, but it is not close -- hence the need for additional “typing” systems. The DTD is a syntactic specification.
26 Example: The Address Book MacNiel, John Dr. John MacNiel 1234 Huron Street Rome, OH (321) Exactly one name At most one greeting As many address lines as needed (in order) Mixed telephones and faxes As many as needed
27 Specifying the structure name to specify a name element greet? to specify an optional (0 or 1) greet elements name,greet? to specify a name followed by an optional greet
28 Specifying the structure (cont) addr* to specify 0 or more address lines tel | fax a tel or a fax element (tel | fax)* 0 or more repeats of tel or fax * 0 or more elements
29 Specifying the structure (cont) So the whole structure of a person entry is specified by name, greet?, addr*, (tel | fax)*, * This is known as a regular expression. Why is it important?
30 Regular Expressions Each regular expression determines a corresponding finite state automaton. Let’s start with a simpler example: name, addr*, This suggests a simple parsing program name addr
31 Another example name,address*,(tel | fax)*, * name address tel fax Adding in the optional greet further complicates things
32 A DTD for the address book <!DOCTYPE addressbook [ <!ELEMENT person (name, greet?, address*, (fax | tel)*, *)> ]>
33 Our relational DB revisited projects: title budget managedBy employees: name ssn age
34 Two DTDs for the relational DB <!DOCTYPE db [... ]> <!DOCTYPE db [... ]>
35 Some things are hard to specify Each employee element is to contain name, age and ssn elements in some order. <!ELEMENT employee ( (name, age, ssn) | (age, ssn, name) | (ssn, name, age) |... )> Suppose there were many more fields !
36 Summary of XML regular expressions AThe tag A occurs e1,e2The expression e1 followed by e2 e*0 or more occurrences of e e?Optional -- 0 or 1 occurrences e+1 or more occurrences e1 | e2either e1 or e2 (e)grouping
37 Specifying attributes in the DTD <!ATTLIST height dimension CDATA #REQUIRED accuracy CDATA #IMPLIED > The dimension attribute is required; the accuracy attribute is optional. CDATA is the “type” of the attribute -- it means string.
38 The DTD Language Default modifiers in DTD attributes:
39 The DTD Language Datatypes in DTD attributes:
40 Consistency of ID and IDREF attribute values If an attribute is declared as ID –the associated values must all be distinct (no confusion) –Id is a poor cousin of a key in relational databases. If an attribute is declared as IDREF –the associated value must exist as the value of some ID attribute (no dangling “pointers”) –IDREF is a poor cousin of foreign key in relational databases. Similarly for all the values of an IDREFS attribute –An attribute of type IDREFS represent a space- separated list of strings of references to valid IDs. ID and IDREF attributes are not typed
41 Specifying ID and IDREF attributes <!DOCTYPE family [ <!ATTLIST person id ID #REQUIRED mother IDREF #IMPLIED father IDREF #IMPLIED children IDREFS #IMPLIED> ]>
42 Some conforming data Jane Doe John Doe Mary Doe Jack Doe
43 An alternative specification <!DOCTYPE family [ ]>
44 The revised data Jane Doe John Doe...
45 The DTD Language Example: Sales Order Document “An order document is comprised of several sales orders. Each individual order has a number and it contains the customer information, the date when the order was received, and the items ordered. Each customer has a number, a name, street, city, state, and ZIP code. Each item has an item number, parts information and a quantity. The parts information contains a number, a description of the product and its unit price. The numbers should be treated as attributes.”
46 The DTD Language Example: Sales Order Document DTD
47 The DTD Language Example: Sales Order XML Document ABC Industries 123 Main St. Chicago IL Turkey wrench
48 A useful abbreviation When an element has empty content we can use for For example: Jane Doe...
49 Schema.dtd <!DOCTYPE db [
50 Schema.dtd (cont’d) ]>
51 Connecting the document with its DTD In line: … ]>... Another file : A URL: <!DOCTYPE db SYSTEM "
52 Well-formed and Valid Documents Well-formed applies to any document (with or without a DTD): proper nesting of tags and unique attributes Valid specifies that the document conforms to the DTD: conforms to regular expression grammar, types of attributes correct, and constraints on references satisfied
53 DTDs v.s Schemas (or Types) By database (or programming language) standards DTDs are rather weak specifications. –Only one base type -- PCDATA –No useful “abstractions” e.g., sets –IDREFs are untyped. You point to something, but you don’t know what! –No constraints e.g., child is inverse of parent –No methods –Tag definitions are global Some of the XML extensions impose something like a schema or type on an XML document. We’ll see these later
54 Lots of possibilities for schemas XML Schema (under W3C’s spotlight) XDR (Microsoft’s BizTalk) SOX (Schema for Object-Oriented XML) Schematron DSD (AT&T Labs and BRICS) and more.
55 Some tools XML Authority _authority/index.htm XML Spy
56 Summary XML is a new data format. Its main virtues are widespread acceptance and the (important) ability to handle semistructured data (data without schema) DTDs provide some useful syntactic constraints on documents. As schemas they are weak How to store large XML documents? How to query them? How to map between XML and other representations?