1 XML Semistructured Data Extensible Markup Language Document Type Definitions
2 Semistructured Data uAnother data model, based on trees. uMotivation: flexible representation of data. wOften, data comes from multiple sources with differences in notation, meaning, etc. uMotivation: sharing of documents among systems and databases.
3 The Information-Integration Problem uRelated data exists in many places and could, in principle, work together. uBut different databases differ in: 1.Model (relational, object-oriented?). 2.Schema (normalized/unnormalized?). 3.Terminology: are consultants employees? Retirees? Subcontractors? 4.Conventions (meters versus feet?).
4 Example uEvery bar has a database. wOne may use a relational DBMS; another keeps the menu in an MS-Word document. wOne stores the phones of distributors, another does not. wOne distinguishes ales from other beers, another doesn’t. wOne counts beer inventory by bottles, another by cases.
5 Two Approaches to Integration 1.Warehousing : Make copies of the data sources at a central site and transform it to a common schema. wReconstruct data daily/weekly, but do not try to keep it more up-to-date than that. 2.Mediation : Create a view of all sources, as if they were integrated. wAnswer a view query by translating it to terminology of the sources and querying them.
6 Warehouse Diagram Warehouse Wrapper Source 1Source 2
7 A Mediator Mediator Wrapper Source 1Source 2 User query Query Result
8 Graphs of Semistructured Data uNodes = objects. uLabels on arcs (attributes, relationships). uAtomic values at leaf nodes (nodes with no arcs out). uFlexibility: no restriction on: wLabels out of a node.
9 Example: Data Graph Bud A.B. Gold1995 MapleJoe’s Miller beer bar manf servedAt name addr prize yearaward root The bar object for Joe’s Bar The beer object for Bud Notice a new kind of data.
10 XML uXML = Extensible Markup Language. uWhile HTML uses tags for formatting (e.g., “italic”), XML uses tags for semantics (e.g., “this is an address”). uKey idea: create tag sets for a domain (e.g., genomics), and translate all data into properly tagged XML documents.
11 Well-Formed and Valid XML uWell-Formed XML allows you to invent your own tags. wSimilar to labels in semistructured data. uValid XML involves a DTD (Document Type Definition), a grammar for tags.
12 Well-Formed XML uStart the document with a declaration, surrounded by. uNormal declaration is: “Standalone” = “no DTD provided.” uBalance of document is a root tag surrounding nested tags.
13 Tags uTags, as in HTML, are normally matched pairs, as …. uTags may be nested arbitrarily. uXML tags are case sensitive.
14 Example: Well-Formed XML Joe’s Bar Bud 2.50 Miller 3.00 … A NAME subobject A BEER subobject
15 XML and Semistructured Data uWell-Formed XML with nested tags is exactly the same idea as trees of semistructured data. uWe shall see that XML also enables nontree structures, as does the semistructured data model.
16 Example uThe XML document is: Joe’s Bar Bud2.50Miller3.00 PRICE BAR BARS NAME... BAR PRICE NAME BEER NAME
17 DTD Structure [ ( )>... more elements... ]>
18 DTD Elements uThe description of an element consists of its name (tag), and a parenthesized description of any nested tags. wIncludes order of subtags and their multiplicity. uLeaves (text elements) have #PCDATA (Parsed Character DATA ) in place of nested tags.
19 Example: DTD <!DOCTYPE BARS [ ]> A BARS object has zero or more BAR’s nested within. A BAR has one NAME and one or more BEER subobjects. A BEER has a NAME and a PRICE. NAME and PRICE are text.
20 Element Descriptions uSubtags must appear in order shown. uA tag may be followed by a symbol to indicate its multiplicity. w* = zero or more. w+ = one or more. w? = zero or one. uSymbol | can connect alternative sequences of tags.
21 Example: Element Description uA name is an optional title (e.g., “Prof.”), a first name, and a last name, in that order, or it is an IP address: <!ELEMENT NAME ( (TITLE?, FIRST, LAST) | IPADDR )>
22 Use of DTD’s 1.Set standalone = “no”. 2.Either: a)Include the DTD as a preamble of the XML document, or b)Follow DOCTYPE and the by SYSTEM and a path to the file where the DTD can be found.
23 Example (a) <!DOCTYPE BARS [ ]> Joe’s Bar Bud 2.50 Miller 3.00 … The DTD The document
24 Example (b) uAssume the BARS DTD is in file bar.dtd. Joe’s Bar Bud 2.50 Miller 3.00 … Get the DTD from the file bar.dtd
25 Attributes uOpening tags in XML can have attributes. uIn a DTD, declares an attribute for element E, along with its datatype.
26 Example: Attributes Bars can have an attribute kind, a character string describing the bar. Character string type; no tags Attribute is optional opposite: #REQUIRED
27 Example: Attribute Use uIn a document that allows BAR tags, we might see: Akasaka Sapporo Note attribute values are quoted
28 ID’s and IDREF’s uAttributes can be pointers from one object to another. wCompare to HTML’s NAME = “foo” and HREF = “#foo”. uAllows the structure of an XML document to be a general graph, rather than just a tree.
29 Creating ID’s uGive an element E an attribute A of type ID. uWhen using tag in an XML document, give its attribute A a unique value. uExample:
30 Creating IDREF’s uTo allow objects of type F to refer to another object with an ID attribute, give F an attribute of type IDREF. uOr, let the attribute have type IDREFS, so the F –object can refer to any number of other objects.
31 Example: ID’s and IDREF’s uLet’s redesign our BARS DTD to include both BAR and BEER subelements. Both bars and beers will have ID attributes called name. Bars have SELLS subobjects, consisting of a number (the price of one beer) and an IDREF theBeer leading to that beer. Beers have attribute soldBy, which is an IDREFS leading to all the bars that sell it.
32 The DTD <!DOCTYPE BARS [ ]> Beer elements have an ID attribute called name, and a soldBy attribute that is a set of Bar names. SELLS elements have a number (the price) and one reference to a beer. Bar elements have name as an ID attribute and have one or more SELLS subelements. Explained next
33 Example Document … <BEER name = “Bud” soldBy = “JoesBar SuesBar …”/> …
34 Empty Elements uWe can do all the work of an element in its attributes. wLike BEER in previous example. Another example: SELLS elements could have attribute price rather than a value that is a price.
35 Example: Empty Element uIn the DTD, declare: uExample use: Note exception to “matching tags” rule