1 Managing XML and Semistructured Data Part 1: Preliminaries, Motivation and Overview Acknowledgement: Part of the materials in this set of XML slides.

Slides:



Advertisements
Similar presentations
XML e X tensible M arkup L anguage (XML) By: Albert Beng Kiat Tan Ayzer Mungan Edwin Hendriadi.
Advertisements

XML: Extensible Markup Language
XML, XML Schema, Xpath and XQuery Slides collated from various sources, many from Dan Suciu at Univ. of Washington.
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 311 Database Systems I The Semistructured Data Model.
Database Management Systems, R. Ramakrishnan1 Introduction to Semistructured Data and XML Chapter 27, Part D Based on slides by Dan Suciu University of.
Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.
CSE 636 Data Integration XML Semistructured Data Document Type Definitions.
Web-site Management System Strudel Presented by: LAKHLIFI Houda Instructor: Dr. Haddouti.
Querying XML (cont.). Comments on XPath? What’s good about it? What can’t it do that you want it to do? How does it compare, say, to SQL?
1 Lecture 10 XML Wednesday, October 18, XML Outline XML (4.6, 4.7) –Syntax –Semistructured data –DTDs.
1 Lecture 10: Database Design XML Wednesday, October 20, 2004.
1 Lecture 9: XQuery. 2 XQuery Motivation XPath expressivity insufficient –no join queries (as in SQL) –no changes to the XML structure possible –no quantifiers.
1 COS 425: Database and Information Management Systems XML and information exchange.
1 Introduction to XML Yanlei Diao UMass Amherst April 19, 2007 Slides Courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives and Gerome Miklau.
XML and Databases 198:541. XML Motivation  Huge amounts of unstructured data on the web: HTML documents  No structure information  Only format instructions.
End of SQL XML April 22 th, Null Values If x=Null then 4*(3-x)/7 is still NULL If x=Null then x=“Joe” is UNKNOWN Three boolean values: –FALSE =
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
Managing XML and Semistructured Data Lecture 2: XML Prof. Dan Suciu Spring 2001.
1 Lecture 08: XML and Semistructured Data. 2 Outline XML (Section 17) –XML syntax, semistructured data –Document Type Definitions (DTDs) XPath.
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
XML – a data sharing standard DSC340 Mike Pangburn.
1 Lecture 12: XML Publishing, XML Storage Monday, October 24, 2005.
XML: Extensible Markup Language FST-UMAC Gong Zhiguo.
IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.
XML – what is it? eXtensible Markup Language Standard for publishing and interchange on the web and over the wire simpler version of SGML adapted to internet.
XML by Dan Suciu 1 Introduction to Semistructured Data and XML Based on slides by Dan Suciu University of Washington.
Dan SuciuTools for XML Data Exchange Dan Suciu AT&T Labs Joint work with Mary Fernandez.
XML and XPath. Web Services: XML+XPath2 EXtensible Markup Language (XML) a W3C standard to complement HTML A markup language much like HTML origins: structured.
Semistructured data and XML CS 645 April 5, 2006 Some slide content courtesy of Ramakrishnan & Gehrke, Dan Suciu, Zack Ives.
1 © Netskills Quality Internet Training, University of Newcastle Introducing XML © Netskills, Quality Internet Training University.
End of XML February 19 th, FLWR (“Flower”) Expressions FOR... LET... WHERE... RETURN... FOR... LET... WHERE... RETURN...
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
1 What Is XML? eXtensible Markup Language for data –Standard for publishing and interchange –“Cleaner” SGML for the Internet Applications: –Data exchange.
Lecture 6: XML Query Languages Thursday, January 18, 2001.
5/2/20051 XML Data Management Yaw-Huei Chen Department of Computer Science and Information Engineering National Chiayi University.
Lecture 5: XML Tuesday, January 16, Outline XML, DTDs (Data on the Web, 3.1) Semistructured data in XML (3.2) Exporting Relational Data in XML (8.3.1)
Chapter 27 The World Wide Web and XML. Copyright © 2004 Pearson Addison-Wesley. All rights reserved.27-2 Topics in this Chapter The Web and the Internet.
1 XML eXtensible Markup Language. 2 XML vs. HTML HTML is a HyperText Markup language HTML is a HyperText Markup language Designed for a specific application,
1 Introduction to Semistructured Data and XML. 2 How the Web is Today  HTML documents often generated by applications consumed by humans only easy access:
More XML: semantics, DTDs, XPATH February 18, 2004.
Transactions, Relational Algebra, XML February 11 th, 2004.
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
XML and Database.
XML e X tensible M arkup L anguage (XML) By: Albert Beng Kiat Tan Ayzer Mungan Edwin Hendriadi.
1 CSE544 XML, XQuery Wednesday, April 5, Announcements Project Ideas are posted (to discuss in class) Groups due today Proposals due Monday Two.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
XML Databases Presented By: Pardeep MT15042 Anurag Goel MT15006.
1 CSE544 XML, XQuery Monday+Wednesday, April 11+13, 2004.
Lecture 14: Relational Algebra Projects XML?
XML path expressions CSE 350 Fall 2003.
Management of XML and Semistructured Data
Management of XML and Semistructured Data
Lecture 15: Midterm Review
Lecture 11 XML Wednesday, Oct. 24, 2001.
eXtensible Markup Language (XML)
Lecture 12: XML, XPath, XQuery
Semi-Structured data (XML Data MODEL)
Alin Deutsch, University of Pennsylvania Mary Mernandez, AT&T Labs
Lecture 9: XML Monday, October 17, 2005.
CSE 544: Lecture 5 XML 4/15/2002.
Wednesday, May 29, 2002 XML Storage Final Review
Lecture 8: XML Data Wednesday, October
Wednesday, May 22, 2002 XML Publishing, Storage
Introduction to Database Systems CSE 444 Lecture 10 XML
Lecture 15: Querying XML Friday, October 27, 2000.
Semi-Structured data (XML)
Lecture 11: XML and Semistructured Data
Lecture 14: XML Publishing & Storage Midterm Review
Lecture 13: XQuery XML Publishing, XML Storage
Presentation transcript:

1 Managing XML and Semistructured Data Part 1: Preliminaries, Motivation and Overview Acknowledgement: Part of the materials in this set of XML slides are extracted from Prof. Dan Suciu’s course materials. Thanks for his permission of using them COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng

2 HTML XML SGML  a W3C standard to complement HTML  origins: structured text SGML  motivation: HTML describes presentation XML describes content   (version 2, 10/2000)

3 HTML Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteoul, Buneman, Suciu Morgan Kaufmann, 1999 Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteoul, Buneman, Suciu Morgan Kaufmann, 1999 HTML describes the presentation

4 XML Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … XML describes the content

5 SGML and XML  eXtensible Markup Language - XML  XML 1.0 – a recommendation from W3C, 1998  XML – related standards: W3C group – XSL, XQuery,…  Roots: SGML [ Markup = encoding: possibilities for expressing information Information modelling freedom, Reusability, provability, validity But a very nasty language  SGML is an international standard for device-independent, system-independent methods of representing texts in electronic form  After the roots: a format for sharing data (on the Web)

6 Basic XML Terminology  tags: book, title, author, …  start tag:, end tag:  elements: …, …  elements are nested  empty element: abbrv.  an XML document: single root element

7 The Role of XML Data  XML is designed for data exchange, not to replace relational or E/R data  Sources of XML data: Created manually with text editors: not really data Generated automatically from relational data Text files, replacing older data formats: Web server logs, scientific data (biological, astronomical) Stored/processed in native XML engines: very few applications need that today

8 XML Advantages for Web Data  Over SGML Supported by mainstream browsers such as IE and Netscape Standard Stylesheet Standard linking  Over HTML Interchangable Searchable Reusable Enables Automation

9 Why XML is of Interest to Us  HTML fails to meet structure specification – tags for appearance only  XML is a language syntax for data Note: we have no langauge syntax for relational data But XML is not relational: semistructured  This is exciting because: Can translate any data to XML Can ship XML over the Web (HTTP) Can input XML into any application Thus: make data sharing and exchange on the Web possible!  40% annual growth and accelerating: publishers, government, education,…

10 Relational Data in XML John 3634 Sue 6343 Dick 6363 John 3634 Sue 6343 Dick 6363 row name phone “John”3634“Sue”“Dick” person XML: person

11 XML Data: more expressive  Missing attributes:  Could represent in a table with nulls John 1234 Joe John 1234 Joe  no phone ! namephone John1234 Joe-

12 XML Data: more expressive  Repeated attributes  Impossible in tables: Mary Mary  two phones ! namephone Mary ???

13 XML Data: more expressive  Attributes with different types in different objects  Nested collections (no 1NF)  Heterogeneous collections: contains both s and s John Smith 1234 John Smith 1234  structured name !

14 XML Data Sharing and Exchange application relational data Transform Integrate Warehouse XML DataWEB (HTTP) application legacy data object-relational Specific data management tasks

15 From Relational Data to XML Data John 3634 Sue 6343 Dick 6363 John 3634 Sue 6343 Dick 6363 row name phone “John”3634“Sue”“Dick” persons XML – Data tree and its doc persons

16 XML Data  XML is self-describing  Schema elements become part of the data Relational schema: persons(name,phone) In XML,, are part of the data, and are repeated many times !  Consequence: XML is much more flexible  XML data is semistructured data

17 XML from/to Relational Data  XML publishing: relational data  XML  XML storage: XML  relational data Relational data is regular in XML tree representation. But XML data can be non- relational! Supplier Common XML Illusion Broker ? ?

18 XML Publishing: ideas sketch Relational Database Application Web XML publishing Tuple streams XML SQL XPath/ XQuery

19 XML Publishing  Exporting the data is relatively easier: we do this already for HTML  Translating XQuery  SQL is hard  XML publishing systems:  Research: Experanto (IBM/DB2), SilkRoute (AT&T Labs and UW, software unavailable) now SilkRoute downloadableExperanto SilkRoute downloadable XQuery  SQL  Commercial: SQL Server, Oracle Only XPath  SQL and with restrictions A middle-ware approach

20 XML Publishing  Backend relational engine  Relational schema: Student(sid, name, address) Course(cid, title, room) Enroll(sid, cid, grade)  Schemas are often proprietary but XML schemas are public studentcourse enroll Will follow the idea in SilkRoute [ACM TODS 27(4), 2002]SilkRoute

21 XML Publishing Operating Systems MGH084 John Seattle 3.8 … Database EE045 Mary Shoreline 3.9 … … Operating Systems MGH084 John Seattle 3.8 … Database EE045 Mary Shoreline 3.9 … … Group by courses: Redundant representation of students Other representations possible too: tags may not be the same as attributes in general

22 XML Publishing First thing to do: design the DTD (public for users) Second thing to do: develop an XML canonical view under db root for each relation {t1,…tk} with schema R(A1,…,An) a11 … a11 … ak1 … ak1

23 { FOR $x IN /db/Course/row RETURN { $x/title/text() } { $x/room/text() } { FOR $y IN /db/Enroll/row[cid/text() = $x/cid/text()]/row $z IN /db/Student/row[sid/text() = $y/sid/text()]/row RETURN { $z/name/text() } { $z/address/text() } { $y/grade/text() } } { FOR $x IN /db/Course/row RETURN { $x/title/text() } { $x/room/text() } { FOR $y IN /db/Enroll/row[cid/text() = $x/cid/text()]/row $z IN /db/Student/row[sid/text() = $y/sid/text()]/row RETURN { $z/name/text() } { $z/address/text() } { $y/grade/text() } } Now we write an XQuery to export relational data  XML Note: result should conform to DTD (slide 20 for the relational Schema; the generated result is shown in slide 21.)

24 XML Publishing Query: find Mary’s grade in Operating Systems FOR $x IN /xmlview/course[title/text()=“Operating Systems”], $y IN $x/student/[name/text()=“Mary”] RETURN $y/grade/text() FOR $x IN /xmlview/course[title/text()=“Operating Systems”], $y IN $x/student/[name/text()=“Mary”] RETURN $y/grade/text() XQuery over public view SELECT Enroll.grade FROM Student, Enroll, Course WHERE Student.name=“Mary” and Course.title=“OS” and Student.sid = Enroll.sid and Enroll.cid = Course.cid SQL over database schema SilkRoute does this automatically

25 XML Publishing How do we choose the output structure ?  Determined by agreement, with our partners, or dictated by committees in XML dialects (called applications) and generate DTDs  The DTD is selective wrt the underlying databases  XML Data is often nested, irregular, etc – that’s why we need the xmlview query in slide 23 to do the transformation  No agreed normal forms for XML but a few work on it…  [M. Arenas and L. Libkin, A Normal Form for XML Documents]

26 XML Storage  Often the XML data is small and is parsed directly into the application (DOM API) – file based management (pros and cons?)DOM  Sometimes it is big, and we need to store it in a database  Much harder than XML publishing (why ?)  A fundamental XML storage problem: How do we choose the schema of the database ?  Possible solutions: Schema derived from DTD – structure mapping Storing XML as a graph such as “Edge relation” – model mapping Native Approach – native XML/SSD engine

27 XML Storage in a Relational DB  Use generic schema [Florescu, Kossman 1999]FlorescuKossman  Use DTD to derive schema [Shanmugasundaram, et al. 1999]Shanmugasundaram  Use data mining to derive schema [Deutsch, Fernandez, Suciu 1999]DeutschFernandezSuciu  Use the Path table [T.Amagasa, T.Shimura, S.Uemura 2001]T.AmagasaT.ShimuraS.Uemura

28 XML Storage: Edge Relation  [Florescu, Kossman 1999]FlorescuKossman  Monet [Schmidt et al. WebDB 00] Monet [Schmidt et al. WebDB 00]  Structured-based approach  Mapping tree to relations  Use generic relational schemas (independent on the XML schema): Ref(source,label,dest) Val(node,value) Ref(source,label,dest) Val(node,value) DOM Tree

29 &o1 &o3 &o2 &o4&o5 paper title author year &o6 “The Calculus”“…” “1986” XML Storage: Edge Relation [Florescu, Kossman 1999]FlorescuKossman Ref Val

30 XML Storage: Edge Relation  In practice may need more tables for reference links and nodes: RefTag1(source,dest) RefTag2(source,dest) … IntVal(node,intVal) RealVal(node,realVal) … RefTag1(source,dest) RefTag2(source,dest) … IntVal(node,intVal) RealVal(node,realVal) … RefTag1 SourceDest &o2&o7 &o5&o8 &o6&o8 &o1 &o3 &o2 &o4&o5 paper title author year &o6 &o7 &o8

31 XML Storage: DTD to Schema [Christophides, Abiteboul, Cluet, Scholl 1994]ChristophidesAbiteboulCluetScholl [Shanmugasundaram, Tufte, He, Zhang, DeWitt, Naughton 1999]ShanmugasundaramTufteHeZhangDeWittNaughton  Basic idea: use the XML schema to derive the relational schema  DTD:  Relational schemas: Paper(pid, title, year) Author(aid, pid, firstName, lastName) Paper(pid, title, year) Author(aid, pid, firstName, lastName)

32 XML Storage: DTD to Schema  Each Element corresponds to a relation  Each Attribute of Element corresponds to a column of relation  Connect elements using foreign keys  But the problems: fragmentation!  Example: How many relations should be used for finding an address of an author? Paper(pid, title, year) Author(aid, pid, firstName, lastName) Address (addid, aid, country) City (cid, addid, postcode, cityname) Street (sid, addid, postcode, streetname) Paper(pid, title, year) Author(aid, pid, firstName, lastName) Address (addid, aid, country) City (cid, addid, postcode, cityname) Street (sid, addid, postcode, streetname)

33 XML Storage: Path Relations [T.Amagasa, T.Shimura, S.Uemura 2001]T.AmagasaT.ShimuraS.Uemura ACM TOIT (1) Store paths as strings (model-based approach)  XPath expressions become the SQL like operator  Additional information for parent/child, ancestor/descendant relationship  XRel: table schemas for paths, elements and val XRel  XParent:table schemas for elements, labelpath, parent (alternative but similar)

34 XML Storage: Path Relations pathIDPathexpr 1#/bib 2#/bib#/paper 3#/bib#/paper#/author 4#/bib#/paper#/title 5#/bib#/paper#/year 6#/bib#/book#/author 7#/bib#/book#/title 8#/bib#/book#/publisher Path One entry for every path in the database Relatively small

35 XML Storage: Path Relations Node ID Path ID StartEnd Parent ID Element One entry for every element in the database relatively large NodeIDVal 3Smith 4Vance 5Tim 6Wallace 7The Best… Val One entry for every leaf in the database Relatively large Positions in doc

36 XRel Example: Path Table  Contains all path string Path(pathID, pathexp) PathIDPathExp 0 #/PLAY 1 #/PLAY#/ACT 2 #/PLAY#/ACT#/SCENE 3 4 #/PLAY#/ACT#/SCENE#/TITLE 5 #/PLAY#/ACT#/SCENE#/SPEECH …… PLAY 2 ACT 3 ACT … TITLE 6 SPEECH 7 “Intro” “CURIO”“This is …” … SCENE 4 … id 5 “000” … SPEAKER 8 TEXT 9 root 1 Path Table

37 XRel Example: Other Tables  Example docIDPathIDstartendindexreindex 100…11 116… … ……………… Element Table Attribute Table docIDPathIDstartendvalue 1321 “000” Intro CURIO This is … Text Table docIDPathIDstartendvalue “Intro” …………… PathIDPathExp 0 #/PLAY …… Path Table

38 Storing XML as a Graph  Every XML instance is a tree  Hence we can store it as any graph, using an Edge table  In addition we need a Value table to store the data values (#PCDATA) Edge relation summary:  Same relational schema for every XML document: Edge(Source, Tag, Dest) Value(Source, Val)  Generic: works for every XML instance  But inefficient: Repeat tags multiple times Need many joins to reconstruct data

39 Storing XML as a Graph db book publisher titleauthor titleauthor titlestate “Complete Guide to DB2” “Chamberlin”“Transaction Processing” “Bernstein”“Newcomer” “Morgan Kaufman” “CA” SourceTagDest 0db1 1book2 2title3 2author4 1book5 5title6 5author7... SourceVal 3Complete guide... 4Chamberlin 6... Edge Value

40 Storing XML as a Graph What happens to queries: SELECT vtitle.value FROM Edge xdb, Edge xbook, Edge xauthor, Edge xtitle, Value vauthor, Value vtitle WHERE xdb.source=0 and xdb.tag = ‘db’ and xdb.dest = xbook.source and xbook.tag = ‘book’ and xbook.dest = xauthor.source and xauthor.tag = ‘author’ and xbook.dest = xtitle.source and xtitle.tag = ‘title’ and xauthor.dest = vauthor.source and vauthor.value = ‘Chamberin and xtitle.dest = vtitle.source SELECT vtitle.value FROM Edge xdb, Edge xbook, Edge xauthor, Edge xtitle, Value vauthor, Value vtitle WHERE xdb.source=0 and xdb.tag = ‘db’ and xdb.dest = xbook.source and xbook.tag = ‘book’ and xbook.dest = xauthor.source and xauthor.tag = ‘author’ and xbook.dest = xtitle.source and xtitle.tag = ‘title’ and xauthor.dest = vauthor.source and vauthor.value = ‘Chamberin and xtitle.dest = vtitle.source FOR $x IN /db/book[author/text()=“Chamberlin”] RETURN $x/title FOR $x IN /db/book[author/text()=“Chamberlin”] RETURN $x/title

41 XML Storage: Data Mining to Schema [Deutsch, Fernandez, Suciu 1999]DeutschFernandezSuciu Given: One large XML data instance No schema/DTD Query workload  Problem: find a “good” relational schema for it  Notice: even when a DTD is present, it may be imprecise: E.g. when a person may have 1-3 phones: phone*

42 XML Storage: Data Mining to Schema paper author title year fn ln Paper1 Paper2 [Deutsch, Fernandez, Suciu 1999]DeutschFernandezSuciu A BC D A B C D

43 Useful References  Data on the Web: from Relations, to Semistructured Data and XML, AbiteboulAbiteboul, Buneman, SuciuBunemanSuciu For foundations  W3C homepage, For current standards  F. Tian et al., The Design and Performance Evaluation of Alternative XML Storage Strategies, SIGMOD Record, 2002  XParent: An Efficient RDBMS-Based XML Database System, In Proc. of ICDE, 2002  Tatarinov, Stroring and Querying Ordered XML Using a Relational Database System, In Proc. of SIGMOD, 2002  Yoshikawa et al., XRel: A Path-Based Approach to Storage and Retrieval of XML Documents Using Relational Databases, ACM Trans. on Internet Technology, Vol. 1, No. 1, pp , 2001  D. Florescu, D. Kossman, Storing and Querying XML Data using an RDBMS. IEEE Data Engineering Bulletin 22(3), 1999  J. Shanmugasundaram et al. Relational Databases for Querying XML Documents: Limitations and Opportunities, In Proc. of VLDB, 1999  C. Zhang et al., On Supporting Containment Queries in Relational Databases Management Systems, In Proc. of SIGMOD, 2001