From Semistructured Data to XML Dan Suciu AT&T Labs

Slides:

Advertisements

Similar presentations

XML: Extensible Markup Language

Advertisements

1 Web Data Management Path Expressions. 2 In this lecture Path expressions Regular path expressions Evaluation techniques Resources: Data on the Web Abiteboul,

XML, XML Schema, Xpath and XQuery Slides collated from various sources, many from Dan Suciu at Univ. of Washington.

1 Web Data Management XML Schema. 2 In this lecture XML Schemas Elements v. Types Regular expressions Expressive power Resources W3C Draft:

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 311 Database Systems I The Semistructured Data Model.

Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27, Part D Based on slides by Dan Suciu University of.

Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.

Managing XML and Semistructured Data Lecture 8: Query Languages - XML-QL Prof. Dan Suciu Spring 2001.

Web-site Management System Strudel Presented by: LAKHLIFI Houda Instructor: Dr. Haddouti.

Querying XML (cont.). Comments on XPath? What’s good about it? What can’t it do that you want it to do? How does it compare, say, to SQL?

1 Lecture 10 XML Wednesday, October 18, XML Outline XML (4.6, 4.7) –Syntax –Semistructured data –DTDs.

1 Lecture 10: Database Design XML Wednesday, October 20, 2004.

Query Languages Aswin Yedlapalli. XML Query data model Document is viewed as a labeled tree with nodes Successors of node may be : - an ordered sequence.

1 COS 425: Database and Information Management Systems XML and information exchange.

CSC056-Z1 – Database Management Systems – Vinnie Costa – Hofstra University1 Database Management Systems Session 10 Instructor: Vinnie Costa

Managing XML and Semistructured Data

Managing XML and Semistructured Data Lecture 6: XPath Prof. Dan Suciu Spring 2001.

1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.

Managing XML and Semistructured Data

XML and Databases 198:541. XML Motivation  Huge amounts of unstructured data on the web: HTML documents  No structure information  Only format instructions.

Managing XML and Semistructured Data Lecture 1: Preliminaries and Overview Prof. Dan Suciu Spring 2001.

Putting Semi-structured Data to Practice Alon Levy Seattle, Washingon University of Washington.

Managing XML Data Dan Suciu University of Washington.

Sebastian Bitzer Seminar Semistructured Data University of Osnabrueck May 2, 2003 XML An introduction in relation to semistructured.

1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.

Managing XML and Semistructured Data Lecture 2: XML Prof. Dan Suciu Spring 2001.

1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.

Querying XML February 12 th, Querying XML Data XPath = simple navigation through the tree XQuery = the SQL of XML XSLT = recursive traversal –will.

XML: Extensible Markup Language FST-UMAC Gong Zhiguo.

IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.

XML by Dan Suciu 1 Introduction to Semistructured Data and XML Based on slides by Dan Suciu University of Washington.

Dan SuciuTools for XML Data Exchange Dan Suciu AT&T Labs Joint work with Mary Fernandez.

XML and XPath. Web Services: XML+XPath2 EXtensible Markup Language (XML) a W3C standard to complement HTML A markup language much like HTML origins: structured.

Extensible Markup and Beyond

XML과 Database 홍기형 성신여자대학교 성신여자대학교 홍기형.

1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.

1 What Is XML? eXtensible Markup Language for data –Standard for publishing and interchange –“Cleaner” SGML for the Internet Applications: –Data exchange.

Management of XML and Semistructured Data Lecture 5: Query Languages Wednesday, 4/1/2001.

Lecture 6: XML Query Languages Thursday, January 18, 2001.

Lecture 5: XML Tuesday, January 16, Outline XML, DTDs (Data on the Web, 3.1) Semistructured data in XML (3.2) Exporting Relational Data in XML (8.3.1)

XML query. introduction An XML document can represent almost anything, and users of an XML query language expect it to perform useful queries on whatever.

1 Introduction to Semistructured Data and XML. 2 How the Web is Today  HTML documents often generated by applications consumed by humans only easy access:

More XML: semantics, DTDs, XPATH February 18, 2004.

Management of XML and Semistructured Data Lecture 10: Schemas Monday, April 30, 2001.

XML e X tensible M arkup L anguage (XML) By: Albert Beng Kiat Tan Ayzer Mungan Edwin Hendriadi.

Part One XML and Databases Soumen Chakrabarti CSE, IIT Bombay.

XML SNU OOPSLA Lab. October Contents  Semistructured Data  Introduction  History  XML Application  DTD & XML Schema  DOM & SAX  Summary.

Semi-structured Data In many applications, data does not have a rigidly and predefined schema: –e.g., structured files, scientific data, XML. Managing.

Jackson, Web Technologies: A Computer Science Perspective, © 2007 Prentice-Hall, Inc. All rights reserved Chapter 7 Representing Web Data:

SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.

S EMISTRUCTURED D ATA AND XML D ISCUSSION Q UESTION Think about your personal Itunes library. Should it be maintained in a database system?

Lecture 14: Relational Algebra Projects XML?

XML path expressions CSE 350 Fall 2003.

Management of XML and Semistructured Data

Management of XML and Semistructured Data

Management of XML and Semistructured Data

Managing XML and Semistructured Data

Managing XML and Semistructured Data

eXtensible Markup Language (XML)

Lecture 12: XML, XPath, XQuery

Semi-Structured data (XML Data MODEL)

Alin Deutsch, University of Pennsylvania Mary Mernandez, AT&T Labs

Lecture 9: XML Monday, October 17, 2005.

Lecture 8: XML Data Wednesday, October

Semi-structured Data In many applications, data does not have a rigidly and predefined schema: e.g., structured files, scientific data, XML. Managing such.

Introduction to Database Systems CSE 444 Lecture 10 XML

Lecture 15: Querying XML Friday, October 27, 2000.

Semi-Structured data (XML)

Lecture 11: XML and Semistructured Data

Presentation transcript:

From Semistructured Data to XML Dan Suciu AT&T Labs

How the Web is Today HTML documents all intended for human consumption many generated automatically by applications Easy to fetch any Web page, from any server, any platform

Limits of the Web Today application cannot consume HTML HTML wrapper technology is brittle –screen scraping OO technology (Corba) requires controlled environment companies merge, form partnerships; need interoperability fast people are inventive: send data by fax !

Paradigm Shift on the Web new Web standard XML: –XML generated by applications –XML consumed by applications data exchange –across platforms: enterprise interoperability –across enterprises Web: from collection of documents to data and documents

Database Community Can Help query optimization, processing views, transformations data warehouses, data integration mediators, query rewriting secondary storage, indexes

But Needs a Paradigm Shift Too Web data differs from database data: –self-describing, schema-less –structure changes without notice –heterogeneous, deeply nested, irregular –documents and data mixed together designed by document, not db experts need Web data management

What This Tutorial is About what the database community has done –semistructured data model –query languages, schemas what the Web community has done: –data formats/models: XML, RDF –transformation language (XSL), schemas where they meet and where they differ

Outline Semistructured data and XML Query languages Schemas Systems issues Conclusions

Part 1 Semistructured Data and XML

Semistructured Data Origins: integration of heterogeneous sources data sources with non-rigid structure biological data Web data

The Semistructured Data Model &o1 &o12&o24&o29 &o43 &96 &243 &206 &25 “Serge” “Abiteboul” 1997 “Victor” “Vianu” paper book paper references author title year http author title publisher author title page firstname lastname firstnamelastnamefirst last Bib Object Exchange Model (OEM) complex object atomic object

Syntax for Semistructured Data Bib: &o1 { paper: &o12 { … }, book: &o24 { … }, paper: &o29 { author: &o52 “Abiteboul”, author: &o96 { firstname: &243 “Victor”, lastname: &o206 “Vianu”}, title: &o93 “Regular path queries with constraints”, references: &o12, references: &o24, pages: &o25 { first: &o64 122, last: &o92 133} }

Syntax for Semistructured Data May omit oid’s: { paper: { author: “Abiteboul”, author: { firstname: “Victor”, lastname: “Vianu”}, title: “Regular path queries …”, page: { first: 122, last: 133 } }

Characteristics of Semistructured Data missing or additional attributes multiple attributes different types in different objects heterogeneous collections self-describing, irregular data, no a priori structure

Comparison with Relational Data { row: { name: “John”, phone: 3634 }, row: { name: “Sue”, phone: 6343 }, row: { name: “Dick”, phone: 6363 } } row name phone “John”3634“Sue”“Dick”

XML a W3C standard to complement HTML origins: structured text SGML motivation: –HTML describes presentation –XML describes content (2/98)

From HTML to XML HTML describes the presentation

HTML Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteoul, Buneman, Suciu Morgan Kaufmann, 1999

XML Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … XML describes the content

XML Terminology tags: book, title, author, … start tag:, end tag: elements: …, … elements are nested empty element: abbrv. an XML document: single root element well formed XML document: if it has matching tags

More XML: Attributes Foundations of Databases Abiteboul … 1995 attributes are alternative ways to represent data

More XML: Oids and References Jane Mary John oids and references in XML are just syntax

XML Data Model does not exists Document Object Model (DOM): – (10/98) –class hierarchy (node, element, attribute,…) –objects have behavior –defines API to inspect/modify the document

XML Parsers traditional: return data structure (DOM?) event based: SAX (Simple API for XML) – –write handler for start tag and for end tag

XML Namespaces (1/99) name ::= [prefix:]localpart … 15 ….

XML Namespaces syntactic:, semantic: provide URL for schema … defined here

XML v.s. Semistructured Data both described best by a graph both are schema-less, self-describing

Similarities and Differences Alan 42 { person: &o123 { name: “Alan”, age: 42, } } person nameage person name age father … { person: { father: &o123 …} } similar on trees, different on graphs

More Differences XML is ordered, ssd is not XML can mix text and elements: Making Java easier to type and easier to type Phil Wadler XML has lots of other stuff: entities, processing instructions, comments

RDF (2/99) purpose: metadata for Web –help search engines syntax in XML semantics: edge-labeled graphs

RDF Syntax birds, butterflies, snakes John Smith

RDF Data Model birds, butterflies, snakes JohnSmith aboutauthor firstnamelastname the RDF Data Model is very close to semistructured data

More RDF Examples birds, butterflies, snakes JohnSmith aboutauthor firstnamelastname author related Joe Doe author

birds, butterflies, snakes John Smith Joe Doe

RDF Terminology subject object predicate statement

More RDF: Containers bag, sequence, alternative s1 s2

RDF Containers (cont’d) Bag s1 s2 a rdf:type rdf_1 rdf_2

More RDF: Higher Order Statements “the author of says: ‘the topic of is environment’ “ environment topic says author RDF uses reification

Summary of Data Models semistructured data, XML, RDF data is self-describing, irregular schema embedded in the data

Part 2 Query Languages Semistructured data and XML Query languages Schemas Systems issues Conclusions

Query Languages: Motivation granularity of the HTML Web: one file granularity of Web data varies: –single data item: “get John’s salary” –entire database: “get all salaries” –aggregates: “get average salary” need query language to define granularity

Query Languages: Outline for semistructured data: –Lorel –UnQL –StruQL for XML: XML-QL a different paradigm –structural recursion –XSL

Lorel part of the Lore system (Stanford) adapts OQL to semistructured data select X.title from Bib.paper X where X.year > 1995 select X.title from Bib.paper X where X.year > 1995 select Bib.paper.title from Bib.paper where Bib.paper.year > 1995 select Bib.paper.title from Bib.paper where Bib.paper.year > 1995 example: abbreviated to:

Lorel v.s. OQL implicit coercions: 1995 to “1995” missing attributes –empty answer v.s. type error set-valued attributes –in X.year>1995, X may have several years regular path expressions (next)

Regular Path Expressions Useful for: syntactic substitute for inheritance: paper|book navigating partially known structures: lastname? transitive closure: reference+ select X.title from Bib.paper X, Bib.(paper|book) Y where Y.author.lastname? = “Ullman” and Y.reference+ X select X.title from Bib.paper X, Bib.(paper|book) Y where Y.author.lastname? = “Ullman” and Y.reference+ X

select T where Bib.paper: { title: T, year: Y, journal: “TODS”} and Y > 1995 select T where Bib.paper: { title: T, year: Y, journal: “TODS”} and Y > 1995 UnQL Unstructured Query Language patterns, templates, structural recursion patterns:

UnQL: Templates select result: { fn: F, ln: L, pub: { title: T, year: Y }} where Bib.paper: { title: T, year: Y, journal: “TODS”} and Y > 1995 select result: { fn: F, ln: L, pub: { title: T, year: Y }} where Bib.paper: { title: T, year: Y, journal: “TODS”} and Y > 1995 Result looks like: { result: { fn: “John”, ln: “Smith”, pub: { title: “P equals NP”, year: 2005}}, result: { fn: “Joe”, ln: “Doe”, pub: { title: “Errata to P=NP”, year: 2006}} … }

Skolem Functions Maier, 1986 –in OO systems Kifer et al, 1989 –F-logic Hull and Yoshikawa, 1990 –deductive db (ILOG) Papakonstantinou et al., 1996 –semistructured db (MSL) illustrate with Strudel (next)

Skolem Functions in StruQL Strudel: a Web Site Management System StruQL: its query language

Example: Bibliography Data {Bib: { paper: { author: “Jones”, author: “Smith”, title: “The Comma”, year: 1994 } }, { paper: ….. } }

Example: A Complex Web Site Root() YearPage(“Smith”, 1994) YearPage(“Smith”, 1996) YearPage(“Jones”, 1994) YearPage(“Jones”, 1998) YearPage(“Mark”, 1996) yearentry publication PubPage(“The Comma”)PubPage(“The Dot”) publication title author HomePage(“Smith”)HomePage(“Jones”)HomePage(“Mark”) person

Example: Skolem Functions in StruQL where Root -> “Bib” -> X, X -> “paper” -> P, P -> “author” -> A, P -> “title” -> T, P -> “year” -> Y create Root(), HomePage(A), YearPage(A,Y), PubPage(P) link Root() -> “person” -> HomePage(A), HomePage(A) -> “yearentry” -> YearPage(A,Y), YearPage(A,Y) -> “publication” -> PubPage(P), PubPage(P) -> “author” -> HomePage(A), PubPage(P) -> “title” -> T where Root -> “Bib” -> X, X -> “paper” -> P, P -> “author” -> A, P -> “title” -> T, P -> “year” -> Y create Root(), HomePage(A), YearPage(A,Y), PubPage(P) link Root() -> “person” -> HomePage(A), HomePage(A) -> “yearentry” -> YearPage(A,Y), YearPage(A,Y) -> “publication” -> PubPage(P), PubPage(P) -> “author” -> HomePage(A), PubPage(P) -> “title” -> T

XML-QL: A Query Language for XML (8/98) features: –regular path expressions –patterns, templates –Skolem Functions based on OEM data model

Pattern Matching in XML-QL where Morgan Kaufmann $a in “ construct $a where Morgan Kaufmann $a in “ construct $a

Simple Constructors in XML-QL Note: abbreviates or or... where $a in “ construct $a $l where $a in “ construct $a $l Smith English Smith Mandarin Doe English

Skolem Functions in XML-QL where $a in “ construct $a $l where $a in “ construct $a $l Smith English Mandarin Doe English

A Different Paradigm: Structural Recursion Data as sets with a union operator: {a:3, a:{b:”one”, c:5}, b:4} = {a:3} U {a:{b:”one”,c:5}} U {b:4}

Structural Recursion f(T1 U T2) = f(T1) U f(T2) f({L: T}) = f(T) f({}) = {} f(V) = if isInt(V) then {result: V} else {} f(T1 U T2) = f(T1) U f(T2) f({L: T}) = f(T) f({}) = {} f(V) = if isInt(V) then {result: V} else {} Example: retrieve all integers in the data a a b b c3 “one”5 4 result 354 standard textbook programming on trees

Structural Recursion Example: increase all engine prices by 10% f(T1 U T2) = f(T1) U f(T2) f({L: T}) = if L= engine then {L: g(T)} else {L: f(T)} f({}) = {} f(V) = V f(T1 U T2) = f(T1) U f(T2) f({L: T}) = if L= engine then {L: g(T)} else {L: f(T)} f({}) = {} f(V) = V g(T1 U T2) = g(T1) U g(T2) g({L: T}) = if L= price then {L:1.1*T} else {L: g(T)} g({}) = {} g(V) = V g(T1 U T2) = g(T1) U g(T2) g({L: T}) = if L= price then {L:1.1*T} else {L: g(T)} g({}) = {} g(V) = V enginebody partprice part price enginebody partprice part price

XSL two W3C drafts: XSLT and XPATH – 7/99 – 7/99 in commercial products (e.g. IE5.0) purpose: stylesheet specification language: –stylesheet: XML -> HTML –in general: XML -> XML

XSL Templates and Rules query = collection of template rules template rule = match pattern + template Retrieve all book titles:

XPath Expressions in Match Patterns bibmatches a bib element *matches any element /matches the root element /bibmatches a bib element under root bib/papermatches a paper in bib bib//papermatches a paper in bib, at any depth //papermatches a paper at any depth paper|bookmatches a paper or a a price attribute price attribute in book, in bib

Flow Control in XSL

XSL is Structural Recursion Equivalent to: f(T1 U T2) = f(T1) U f(T2) f({L: T}) = if L= c then {C: t} else L= b then {B: f(t)} else L= a then {A: f(t)} else f(t) f({}) = {} f(V) = V f(T1 U T2) = f(T1) U f(T2) f({L: T}) = if L= c then {C: t} else L= b then {B: f(t)} else L= a then {A: f(t)} else f(t) f({}) = {} f(V) = V XSL query = single function XSL query with modes = multiple function

XSL and Structural Recursion XSL: trees only may loop Structural Recursion: arbitrary graphs always terminates stack overflow on IE 5.0 add the following rule:

Summary of Query Languages studied extensively in semistructured data some quite powerful features no standard for XML QL yet (WG soon) XSL available today (for stylesheets) XSL = structural recursion

Part 3 Schemas Semistructured data and XML Query languages Schemas Systems issues Conclusions

Schemas why ? –XML: to describe semantics –semistructured data: to improve processing what ? –semistructured data: foundational –XML: several concrete proposals here lies our interest

Schemas when ? –semistructured data, XML: a posteriori –RDBMS: a priori, to interpret binary data how ? –semistructured data: schema is independent –XML: schema is hardwired with the data

Outline schemas for semistructured data: –foundations –schema extraction schemas for XML: –DTD –XML-Schema –RDF-Schema

Schemas: An Example &r1 &c1&c2 &s2&s3&s6&s7 &s10 company name address name url address “Widget” “Trenton” “Gadget” “ “Paris” &p2&p1&p3 &s0&s1&s4&s5&s8&s9 person “Smith” name position name phone name position “Manager” “Jones” “ ” “Dupont” “Sales” employee manages c.e.o. works-for c.e.o. &a1 &a2 &a3 &a4 &a5 &a6 &a7 description procurement salesrep contact task eval “on target” “below target” Some database:

Lower-Bound Schemas Root Company Employee string company person works-for c.e.o. address name managed-by name

Upper Bound Schemas Root Company Employee string company person works-for c.e.o. | employee name | address | url managed-by name | phone | position Any description -

The Two Questions to Ask Conformance: does that data conform to this schema ? Classification: if so, then which objects belong to what classes ?

Graph Simulation Definition Two edge-labeled graphs G1, G2 A simulation is a relation R between nodes: if (x1, x2) in R, and (x1,a,y1) in G1, then exists (x2,a,y2) in G2 (same label) s.t. (y1,y2) in R x1x2 a R G1G2 y1 a R y2 Note: a simulation can be efficiently computed [Henzinger, et a. 1995]

Using Simulation Data graph D, schema S upper bound schema: –conformance: find simulation R from D to S –classification: check if (x,c) in R lower bound schema –conformance: find simulation R from S to D –classification: check if (c,x) in R [Buneman et al 1997]

Example &r1 &c1&c2 &s2&s3&s6&s7 &s10 company name address name url address “Widget” “Trenton” “Gadget” “ “Paris” &p2&p1&p3 &s0&s1&s4&s5&s8&s9 person “Smith” name positionname phone name position “Manager” “Jones” “ ” “Dupont” “Sales” employee manages c.e.o. works-for c.e.o. &a1 &a2 &a3 &a4 &a5 &a6 &a7 description procurement salesrep contact task eval “on target” “below target” Root Company Employee string company person works-for c.e.o. address name managed-by name Root Company Employee string company person works-for c.e.o. | employee name | address | url managed-by name | phone | position Any description - DatabaseLower BoundUpper Bound simulation: efficient technique for checking conformance to schema

Application 1: Improve Secondary Storage Root Company Employee string company person works-for c.e.o. address name managed-by name Company Employee Store rest in overflow graph Lower-bound schema

Application 2: Query Optimization Bib paperbook year journal title intstring address author title zip city street last name first name string select X.title from Bib._ X where X.*.zip = “12345” select X.title from Bib._ X where X.*.zip = “12345” select X.title from Bib.book X where X.address.zip = “12345” select X.title from Bib.book X where X.address.zip = “12345” Upper-bound schema [Fernandez, Suciu 1998]

Schema Extraction (From Data) Problem statement given data instance D find the “most specific” schema S for D In practice: S too large, need to relax [Nestorov et al. 1998]

Schema Extraction: Sample Data &r &p8&p1&p2&p3&p4&p5&p6&p7 &c company employee worksfor manages managedby manages managedby

Lower Bound Schema Extraction Root &r Bosses &p1,&p4,&p6 Regulars &p2,&p3,&p5,&p7,&p8 Company &c company employee manages managedby worksfor employee

Upper Bound Schema Extraction: Data Guides Root &r Employees &p1,&p1,&p3,P4 &p5,&p6,&p7,&p8 Bosses &p1,&p4,&p6 Regulars &p2,&p3,&p5,&p7,&p8 Company &c company employee manages managedby manages managedby worksfor

Schemas in XML Document Type Definition (DTD) XML Schema RDF Schema

Document Type Definition: DTD part of the original XML specification an XML document may have a DTD terminology for XML: –well-formed: if tags are correctly closed –valid: if it has a DTD and conforms to it validation is useful in data exchange

DTDs as Grammars <!DOCTYPE paper [ ]> <!DOCTYPE paper [ ]> …

DTDs as Schemas Not so well suited: impose unwanted constraints on order references cannot be constraint can be to vague:

XML Schemas very recent proposal unifies previous schema proposals generalizes DTDs uses XML syntax two documents: structure and datatypes – –

XML Schemas DTD:

RDF Schemas (3/99) object-oriented flavor

RDF Schemas recall RDF data: –resources –properties RDF schema: –classes –properties subject object predicate statement

RDF Schemas Data: My Honda 50000

RDF Schemas Schema:

RDF Schemas Truck MotorVehicle car001 type Class type subClassOf name miles My Honda 50000

RDF Schemas different from object-oriented systems: –OO: define a class by set of properties –RDF: define a property in terms of its classes metadata in RDF: –an RDF schema described as an RDF data

Summary of Schemas in SS data: –graph theoretic –data and schema are decoupled –used in data processing in XML –from grammar to object-oriented –schema wired with the data –emphasis on semantics for exchange

Part 4 Systems Issues Semistructured data and XML Query languages Schemas Systems issues Conclusions

Systems Issues servers mediators

Servers for Semistructured Data / XML storage index query evaluation [McHugh, Widom 1999]

XML Storage text file (XML) store in ternary relation use DTD to derive schema mine data to derive schema build special purpose repository (Lore)

XML Storage: Text File advantages –simple –less space than one thinks –reasonable clustering disadvantage –no updates –require special purpose query processor

&o1 &o3 &o2 &o4&o5 paper title author year &o6 “The Calculus”“…” “1986” Store XML in Ternary Relation [Florescu, Kossman 1999] Ref Val

Use DTD to derive Schema DTD: ODMG classes: [Christophides et al. 1994, Shanmugasundaram et al. 1999] class Employee public type tuple (name:string, address:Address, project:List(Project)) class Address public type tuple (street:string, …)

Mine Data to Derive Schema paper author title year fn ln Paper1 Paper2 [Deutsch et al. 1999]

Indexing Semistructured Data coercions: 1995 v.s. “1995” regular path expressions –data guides [Goldman, Widom, 1997] –T-indexes [Milo, Suciu, 1999]

Indexing All Paths in the Data ttttt abacadaab Semistructured Data t a b c d Data Guide t a bcd b T-Index

Mediators for Semistructured Data / XML XML = virtual view of Relational/OO/OR sources mediator = translation, integration issues: –query composition and rewriting [Papakonstatinou, et al ] –limited source capabilities [Yerneni, et al. 1999]

Example: An XML Mediator relational database: virtual XML view: n1... n2... … StoreSBBook

Example: An XML Mediator specify mediator declaratively (a view): from Store, SB, Book where Store.sid=SB.sid and SB.bid=Book.bid construct Store.name Book.title from Store, SB, Book where Store.sid=SB.sid and SB.bid=Book.bid construct Store.name Book.title

Example: An XML Mediator users ask XML-QL queries: –find stores who sell “The Calculus” where $n The Calculus construct $n where $n The Calculus construct $n

Example: An XML Mediator system composes query with view: from Store, SB, Book where Store.sid=SB.sid and SB.bid=Book.bid and Book.title=“The Calculus” construct Store.name from Store, SB, Book where Store.sid=SB.sid and SB.bid=Book.bid and Book.title=“The Calculus” construct Store.name

Summary of Systems unclear today how XML will be used –materialized ? Need servers –virtual ? Need mediators most work is still ahead

Part 5 Conclusions Semistructured data and XML Query languages Schemas Systems issues Conclusions

Summary XML = what is out there semistructured data = what we can process paradigm shift, for both Web and db covered in tutorial: – data models, queries, schemas

Current and Future Technologies Web applications possible today: –export relational data to XML (e.g. Oracle) –import XML directly into applications Web applications in the future: –mediator technology (XML view) –store/process native XML data –compress XML –mine/analyze XML

Why This Is Cool for Database Researchers put to work what you teach in CS101 ! –tree traversals (structural recursion, XSL) –automata theory (DTD’s, path expressions) –graph theory (simulation) adapt old DB tricks to new kind of data save the trees: from fax to XML The End

Further Readings www. w3.org/XML www-db.stanford.edu/~widom www-rocq.inria.fr/~abiteboul db.cis.upenn.edu Abiteboul, Buneman, Suciu Data on the Web: From Relational to Semistructured to XML Morgan Kaufmann, 1999 (appears in October)