Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Managing XML and Semistructured Data Part 1: Preliminaries, Motivation and Overview Acknowledgement: Part of the materials in this set of XML slides.

Similar presentations


Presentation on theme: "1 Managing XML and Semistructured Data Part 1: Preliminaries, Motivation and Overview Acknowledgement: Part of the materials in this set of XML slides."— Presentation transcript:

1 1 Managing XML and Semistructured Data Part 1: Preliminaries, Motivation and Overview Acknowledgement: Part of the materials in this set of XML slides are extracted from Prof. Dan Suciu’s course materials. Thanks for his permission of using them COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng

2 2 HTML XML SGML  a W3C standard to complement HTML  origins: structured text SGML  motivation: HTML describes presentation XML describes content   http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000) http://www.w3.org/TR/2000/REC-xml-20001006

3 3 HTML Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteoul, Buneman, Suciu Morgan Kaufmann, 1999 Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteoul, Buneman, Suciu Morgan Kaufmann, 1999 HTML describes the presentation

4 4 XML Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … XML describes the content

5 5 SGML and XML  eXtensible Markup Language - XML  XML 1.0 – a recommendation from W3C, 1998  XML – related standards: W3C group – XSL, XQuery,…  Roots: SGML [http://www.w3.org/MarkUp/SGML/]http://www.w3.org/MarkUp/SGML/ Markup = encoding: possibilities for expressing information Information modelling freedom, Reusability, provability, validity But a very nasty language  SGML is an international standard for device-independent, system-independent methods of representing texts in electronic form  After the roots: a format for sharing data (on the Web)

6 6 Basic XML Terminology  tags: book, title, author, …  start tag:, end tag:  elements: …, …  elements are nested  empty element: abbrv.  an XML document: single root element

7 7 The Role of XML Data  XML is designed for data exchange, not to replace relational or E/R data  Sources of XML data: Created manually with text editors: not really data Generated automatically from relational data Text files, replacing older data formats: Web server logs, scientific data (biological, astronomical) Stored/processed in native XML engines: very few applications need that today

8 8 XML Advantages for Web Data  Over SGML Supported by mainstream browsers such as IE and Netscape Standard Stylesheet Standard linking  Over HTML Interchangable Searchable Reusable Enables Automation

9 9 Why XML is of Interest to Us  HTML fails to meet structure specification – tags for appearance only  XML is a language syntax for data Note: we have no langauge syntax for relational data But XML is not relational: semistructured  This is exciting because: Can translate any data to XML Can ship XML over the Web (HTTP) Can input XML into any application Thus: make data sharing and exchange on the Web possible!  40% annual growth and accelerating: publishers, government, education,…

10 10 Relational Data in XML John 3634 Sue 6343 Dick 6363 John 3634 Sue 6343 Dick 6363 row name phone “John”3634“Sue”“Dick”63436363 person XML: person

11 11 XML Data: more expressive  Missing attributes:  Could represent in a table with nulls John 1234 Joe John 1234 Joe  no phone ! namephone John1234 Joe-

12 12 XML Data: more expressive  Repeated attributes  Impossible in tables: Mary 2345 3456 Mary 2345 3456  two phones ! namephone Mary23453456 ???

13 13 XML Data: more expressive  Attributes with different types in different objects  Nested collections (no 1NF)  Heterogeneous collections: contains both s and s John Smith 1234 John Smith 1234  structured name !

14 14 XML Data Sharing and Exchange application relational data Transform Integrate Warehouse XML DataWEB (HTTP) application legacy data object-relational Specific data management tasks

15 15 From Relational Data to XML Data John 3634 Sue 6343 Dick 6363 John 3634 Sue 6343 Dick 6363 row name phone “John”3634“Sue”“Dick”63436363 persons XML – Data tree and its doc persons

16 16 XML Data  XML is self-describing  Schema elements become part of the data Relational schema: persons(name,phone) In XML,, are part of the data, and are repeated many times !  Consequence: XML is much more flexible  XML data is semistructured data

17 17 XML from/to Relational Data  XML publishing: relational data  XML  XML storage: XML  relational data Relational data is regular in XML tree representation. But XML data can be non- relational! Supplier Common XML Illusion Broker ? ?

18 18 XML Publishing: ideas sketch Relational Database Application Web XML publishing Tuple streams XML SQL XPath/ XQuery

19 19 XML Publishing  Exporting the data is relatively easier: we do this already for HTML  Translating XQuery  SQL is hard  XML publishing systems:  Research: Experanto (IBM/DB2), SilkRoute (AT&T Labs and UW, software unavailable) now SilkRoute downloadableExperanto SilkRoute downloadable XQuery  SQL  Commercial: SQL Server, Oracle Only XPath  SQL and with restrictions A middle-ware approach

20 20 XML Publishing  Backend relational engine  Relational schema: Student(sid, name, address) Course(cid, title, room) Enroll(sid, cid, grade)  Schemas are often proprietary but XML schemas are public studentcourse enroll Will follow the idea in SilkRoute [ACM TODS 27(4), 2002]SilkRoute

21 21 XML Publishing Operating Systems MGH084 John Seattle 3.8 … Database EE045 Mary Shoreline 3.9 … … Operating Systems MGH084 John Seattle 3.8 … Database EE045 Mary Shoreline 3.9 … … Group by courses: Redundant representation of students Other representations possible too: tags may not be the same as attributes in general

22 22 XML Publishing First thing to do: design the DTD (public for users) Second thing to do: develop an XML canonical view under db root for each relation {t1,…tk} with schema R(A1,…,An) a11 … a11 … ak1 … ak1

23 23 { FOR $x IN /db/Course/row RETURN { $x/title/text() } { $x/room/text() } { FOR $y IN /db/Enroll/row[cid/text() = $x/cid/text()]/row $z IN /db/Student/row[sid/text() = $y/sid/text()]/row RETURN { $z/name/text() } { $z/address/text() } { $y/grade/text() } } { FOR $x IN /db/Course/row RETURN { $x/title/text() } { $x/room/text() } { FOR $y IN /db/Enroll/row[cid/text() = $x/cid/text()]/row $z IN /db/Student/row[sid/text() = $y/sid/text()]/row RETURN { $z/name/text() } { $z/address/text() } { $y/grade/text() } } Now we write an XQuery to export relational data  XML Note: result should conform to DTD (slide 20 for the relational Schema; the generated result is shown in slide 21.)

24 24 XML Publishing Query: find Mary’s grade in Operating Systems FOR $x IN /xmlview/course[title/text()=“Operating Systems”], $y IN $x/student/[name/text()=“Mary”] RETURN $y/grade/text() FOR $x IN /xmlview/course[title/text()=“Operating Systems”], $y IN $x/student/[name/text()=“Mary”] RETURN $y/grade/text() XQuery over public view SELECT Enroll.grade FROM Student, Enroll, Course WHERE Student.name=“Mary” and Course.title=“OS” and Student.sid = Enroll.sid and Enroll.cid = Course.cid SQL over database schema SilkRoute does this automatically

25 25 XML Publishing How do we choose the output structure ?  Determined by agreement, with our partners, or dictated by committees in XML dialects (called applications) and generate DTDs  The DTD is selective wrt the underlying databases  XML Data is often nested, irregular, etc – that’s why we need the xmlview query in slide 23 to do the transformation  No agreed normal forms for XML but a few work on it…  [M. Arenas and L. Libkin, A Normal Form for XML Documents]

26 26 XML Storage  Often the XML data is small and is parsed directly into the application (DOM API) – file based management (pros and cons?)DOM  Sometimes it is big, and we need to store it in a database  Much harder than XML publishing (why ?)  A fundamental XML storage problem: How do we choose the schema of the database ?  Possible solutions: Schema derived from DTD – structure mapping Storing XML as a graph such as “Edge relation” – model mapping Native Approach – native XML/SSD engine

27 27 XML Storage in a Relational DB  Use generic schema [Florescu, Kossman 1999]FlorescuKossman  Use DTD to derive schema [Shanmugasundaram, et al. 1999]Shanmugasundaram  Use data mining to derive schema [Deutsch, Fernandez, Suciu 1999]DeutschFernandezSuciu  Use the Path table [T.Amagasa, T.Shimura, S.Uemura 2001]T.AmagasaT.ShimuraS.Uemura

28 28 XML Storage: Edge Relation  [Florescu, Kossman 1999]FlorescuKossman  Monet [Schmidt et al. WebDB 00] Monet [Schmidt et al. WebDB 00]  Structured-based approach  Mapping tree to relations  Use generic relational schemas (independent on the XML schema): Ref(source,label,dest) Val(node,value) Ref(source,label,dest) Val(node,value) DOM Tree

29 29 &o1 &o3 &o2 &o4&o5 paper title author year &o6 “The Calculus”“…” “1986” XML Storage: Edge Relation [Florescu, Kossman 1999]FlorescuKossman Ref Val

30 30 XML Storage: Edge Relation  In practice may need more tables for reference links and nodes: RefTag1(source,dest) RefTag2(source,dest) … IntVal(node,intVal) RealVal(node,realVal) … RefTag1(source,dest) RefTag2(source,dest) … IntVal(node,intVal) RealVal(node,realVal) … RefTag1 SourceDest &o2&o7 &o5&o8 &o6&o8 &o1 &o3 &o2 &o4&o5 paper title author year &o6 &o7 &o8

31 31 XML Storage: DTD to Schema [Christophides, Abiteboul, Cluet, Scholl 1994]ChristophidesAbiteboulCluetScholl [Shanmugasundaram, Tufte, He, Zhang, DeWitt, Naughton 1999]ShanmugasundaramTufteHeZhangDeWittNaughton  Basic idea: use the XML schema to derive the relational schema  DTD:  Relational schemas: Paper(pid, title, year) Author(aid, pid, firstName, lastName) Paper(pid, title, year) Author(aid, pid, firstName, lastName)

32 32 XML Storage: DTD to Schema  Each Element corresponds to a relation  Each Attribute of Element corresponds to a column of relation  Connect elements using foreign keys  But the problems: fragmentation!  Example: How many relations should be used for finding an address of an author? Paper(pid, title, year) Author(aid, pid, firstName, lastName) Address (addid, aid, country) City (cid, addid, postcode, cityname) Street (sid, addid, postcode, streetname) Paper(pid, title, year) Author(aid, pid, firstName, lastName) Address (addid, aid, country) City (cid, addid, postcode, cityname) Street (sid, addid, postcode, streetname)

33 33 XML Storage: Path Relations [T.Amagasa, T.Shimura, S.Uemura 2001]T.AmagasaT.ShimuraS.Uemura ACM TOIT 2001 1(1) Store paths as strings (model-based approach)  XPath expressions become the SQL like operator  Additional information for parent/child, ancestor/descendant relationship  XRel: table schemas for paths, elements and val XRel  XParent:table schemas for elements, labelpath, parent (alternative but similar)

34 34 XML Storage: Path Relations pathIDPathexpr 1#/bib 2#/bib#/paper 3#/bib#/paper#/author 4#/bib#/paper#/title 5#/bib#/paper#/year 6#/bib#/book#/author 7#/bib#/book#/title 8#/bib#/book#/publisher Path One entry for every path in the database Relatively small

35 35 XML Storage: Path Relations Node ID Path ID StartEnd Parent ID 1101000- 2252001 338202 4321302 53311002 631011502 741511802 823005001... Element One entry for every element in the database relatively large NodeIDVal 3Smith 4Vance 5Tim 6Wallace 7The Best… 63 74 82... Val One entry for every leaf in the database Relatively large Positions in doc

36 36 XRel Example: Path Table  Contains all path string Path(pathID, pathexp) PathIDPathExp 0 #/PLAY 1 #/PLAY#/ACT 2 #/PLAY#/ACT#/SCENE 3 #/PLAY#/ACT#/SCENE#/@id 4 #/PLAY#/ACT#/SCENE#/TITLE 5 #/PLAY#/ACT#/SCENE#/SPEECH …… PLAY 2 ACT 3 ACT … TITLE 6 SPEECH 7 “Intro” “CURIO”“This is …” … SCENE 4 … id 5 “000” … SPEAKER 8 TEXT 9 root 1 Path Table

37 37 XRel Example: Other Tables  Example docIDPathIDstartendindexreindex 100…11 116…11 1211…11 14274811 ……………… Element Table Attribute Table docIDPathIDstartendvalue 1321 “000” Intro CURIO This is … 0 6 11 2748 21 49 5790 91 Text Table docIDPathIDstartendvalue 143440“Intro” …………… PathIDPathExp 0 #/PLAY …… Path Table

38 38 Storing XML as a Graph  Every XML instance is a tree  Hence we can store it as any graph, using an Edge table  In addition we need a Value table to store the data values (#PCDATA) Edge relation summary:  Same relational schema for every XML document: Edge(Source, Tag, Dest) Value(Source, Val)  Generic: works for every XML instance  But inefficient: Repeat tags multiple times Need many joins to reconstruct data

39 39 Storing XML as a Graph db book publisher titleauthor titleauthor titlestate “Complete Guide to DB2” “Chamberlin”“Transaction Processing” “Bernstein”“Newcomer” “Morgan Kaufman” “CA” 1 2 3 4 5 67 8 9 10 11 0 SourceTagDest 0db1 1book2 2title3 2author4 1book5 5title6 5author7... SourceVal 3Complete guide... 4Chamberlin 6... Edge Value

40 40 Storing XML as a Graph What happens to queries: SELECT vtitle.value FROM Edge xdb, Edge xbook, Edge xauthor, Edge xtitle, Value vauthor, Value vtitle WHERE xdb.source=0 and xdb.tag = ‘db’ and xdb.dest = xbook.source and xbook.tag = ‘book’ and xbook.dest = xauthor.source and xauthor.tag = ‘author’ and xbook.dest = xtitle.source and xtitle.tag = ‘title’ and xauthor.dest = vauthor.source and vauthor.value = ‘Chamberin and xtitle.dest = vtitle.source SELECT vtitle.value FROM Edge xdb, Edge xbook, Edge xauthor, Edge xtitle, Value vauthor, Value vtitle WHERE xdb.source=0 and xdb.tag = ‘db’ and xdb.dest = xbook.source and xbook.tag = ‘book’ and xbook.dest = xauthor.source and xauthor.tag = ‘author’ and xbook.dest = xtitle.source and xtitle.tag = ‘title’ and xauthor.dest = vauthor.source and vauthor.value = ‘Chamberin and xtitle.dest = vtitle.source FOR $x IN /db/book[author/text()=“Chamberlin”] RETURN $x/title FOR $x IN /db/book[author/text()=“Chamberlin”] RETURN $x/title

41 41 XML Storage: Data Mining to Schema [Deutsch, Fernandez, Suciu 1999]DeutschFernandezSuciu Given: One large XML data instance No schema/DTD Query workload  Problem: find a “good” relational schema for it  Notice: even when a DTD is present, it may be imprecise: E.g. when a person may have 1-3 phones: phone*

42 42 XML Storage: Data Mining to Schema paper author title year fn ln Paper1 Paper2 [Deutsch, Fernandez, Suciu 1999]DeutschFernandezSuciu A BC D A B C D

43 43 Useful References  Data on the Web: from Relations, to Semistructured Data and XML, AbiteboulAbiteboul, Buneman, SuciuBunemanSuciu For foundations  W3C homepage, www.w3.orgwww.w3.org For current standards  F. Tian et al., The Design and Performance Evaluation of Alternative XML Storage Strategies, SIGMOD Record, 2002  XParent: An Efficient RDBMS-Based XML Database System, In Proc. of ICDE, 2002  Tatarinov, Stroring and Querying Ordered XML Using a Relational Database System, In Proc. of SIGMOD, 2002  Yoshikawa et al., XRel: A Path-Based Approach to Storage and Retrieval of XML Documents Using Relational Databases, ACM Trans. on Internet Technology, Vol. 1, No. 1, pp 110-141, 2001  D. Florescu, D. Kossman, Storing and Querying XML Data using an RDBMS. IEEE Data Engineering Bulletin 22(3), 1999  J. Shanmugasundaram et al. Relational Databases for Querying XML Documents: Limitations and Opportunities, In Proc. of VLDB, 1999  C. Zhang et al., On Supporting Containment Queries in Relational Databases Management Systems, In Proc. of SIGMOD, 2001


Download ppt "1 Managing XML and Semistructured Data Part 1: Preliminaries, Motivation and Overview Acknowledgement: Part of the materials in this set of XML slides."

Similar presentations


Ads by Google