Download presentation
Presentation is loading. Please wait.
1
1 Managing XML and Semistructured Data Part 1: Preliminaries, Motivation and Overview Acknowledgement: Part of the materials in this set of XML slides are extracted from Prof. Dan Suciu’s course materials. Thanks for his permission of using them COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng
2
2 HTML XML SGML a W3C standard to complement HTML origins: structured text SGML motivation: HTML describes presentation XML describes content http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000) http://www.w3.org/TR/2000/REC-xml-20001006
3
3 HTML Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteoul, Buneman, Suciu Morgan Kaufmann, 1999 Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteoul, Buneman, Suciu Morgan Kaufmann, 1999 HTML describes the presentation
4
4 XML Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … XML describes the content
5
5 SGML and XML eXtensible Markup Language - XML XML 1.0 – a recommendation from W3C, 1998 XML – related standards: W3C group – XSL, XQuery,… Roots: SGML [http://www.w3.org/MarkUp/SGML/]http://www.w3.org/MarkUp/SGML/ Markup = encoding: possibilities for expressing information Information modelling freedom, Reusability, provability, validity But a very nasty language SGML is an international standard for device-independent, system-independent methods of representing texts in electronic form After the roots: a format for sharing data (on the Web)
6
6 Basic XML Terminology tags: book, title, author, … start tag:, end tag: elements: …, … elements are nested empty element: abbrv. an XML document: single root element
7
7 The Role of XML Data XML is designed for data exchange, not to replace relational or E/R data Sources of XML data: Created manually with text editors: not really data Generated automatically from relational data Text files, replacing older data formats: Web server logs, scientific data (biological, astronomical) Stored/processed in native XML engines: very few applications need that today
8
8 XML Advantages for Web Data Over SGML Supported by mainstream browsers such as IE and Netscape Standard Stylesheet Standard linking Over HTML Interchangable Searchable Reusable Enables Automation
9
9 Why XML is of Interest to Us HTML fails to meet structure specification – tags for appearance only XML is a language syntax for data Note: we have no langauge syntax for relational data But XML is not relational: semistructured This is exciting because: Can translate any data to XML Can ship XML over the Web (HTTP) Can input XML into any application Thus: make data sharing and exchange on the Web possible! 40% annual growth and accelerating: publishers, government, education,…
10
10 Relational Data in XML John 3634 Sue 6343 Dick 6363 John 3634 Sue 6343 Dick 6363 row name phone “John”3634“Sue”“Dick”63436363 person XML: person
11
11 XML Data: more expressive Missing attributes: Could represent in a table with nulls John 1234 Joe John 1234 Joe no phone ! namephone John1234 Joe-
12
12 XML Data: more expressive Repeated attributes Impossible in tables: Mary 2345 3456 Mary 2345 3456 two phones ! namephone Mary23453456 ???
13
13 XML Data: more expressive Attributes with different types in different objects Nested collections (no 1NF) Heterogeneous collections: contains both s and s John Smith 1234 John Smith 1234 structured name !
14
14 XML Data Sharing and Exchange application relational data Transform Integrate Warehouse XML DataWEB (HTTP) application legacy data object-relational Specific data management tasks
15
15 From Relational Data to XML Data John 3634 Sue 6343 Dick 6363 John 3634 Sue 6343 Dick 6363 row name phone “John”3634“Sue”“Dick”63436363 persons XML – Data tree and its doc persons
16
16 XML Data XML is self-describing Schema elements become part of the data Relational schema: persons(name,phone) In XML,, are part of the data, and are repeated many times ! Consequence: XML is much more flexible XML data is semistructured data
17
17 XML from/to Relational Data XML publishing: relational data XML XML storage: XML relational data Relational data is regular in XML tree representation. But XML data can be non- relational! Supplier Common XML Illusion Broker ? ?
18
18 XML Publishing: ideas sketch Relational Database Application Web XML publishing Tuple streams XML SQL XPath/ XQuery
19
19 XML Publishing Exporting the data is relatively easier: we do this already for HTML Translating XQuery SQL is hard XML publishing systems: Research: Experanto (IBM/DB2), SilkRoute (AT&T Labs and UW, software unavailable) now SilkRoute downloadableExperanto SilkRoute downloadable XQuery SQL Commercial: SQL Server, Oracle Only XPath SQL and with restrictions A middle-ware approach
20
20 XML Publishing Backend relational engine Relational schema: Student(sid, name, address) Course(cid, title, room) Enroll(sid, cid, grade) Schemas are often proprietary but XML schemas are public studentcourse enroll Will follow the idea in SilkRoute [ACM TODS 27(4), 2002]SilkRoute
21
21 XML Publishing Operating Systems MGH084 John Seattle 3.8 … Database EE045 Mary Shoreline 3.9 … … Operating Systems MGH084 John Seattle 3.8 … Database EE045 Mary Shoreline 3.9 … … Group by courses: Redundant representation of students Other representations possible too: tags may not be the same as attributes in general
22
22 XML Publishing First thing to do: design the DTD (public for users) Second thing to do: develop an XML canonical view under db root for each relation {t1,…tk} with schema R(A1,…,An) a11 … a11 … ak1 … ak1
23
23 { FOR $x IN /db/Course/row RETURN { $x/title/text() } { $x/room/text() } { FOR $y IN /db/Enroll/row[cid/text() = $x/cid/text()]/row $z IN /db/Student/row[sid/text() = $y/sid/text()]/row RETURN { $z/name/text() } { $z/address/text() } { $y/grade/text() } } { FOR $x IN /db/Course/row RETURN { $x/title/text() } { $x/room/text() } { FOR $y IN /db/Enroll/row[cid/text() = $x/cid/text()]/row $z IN /db/Student/row[sid/text() = $y/sid/text()]/row RETURN { $z/name/text() } { $z/address/text() } { $y/grade/text() } } Now we write an XQuery to export relational data XML Note: result should conform to DTD (slide 20 for the relational Schema; the generated result is shown in slide 21.)
24
24 XML Publishing Query: find Mary’s grade in Operating Systems FOR $x IN /xmlview/course[title/text()=“Operating Systems”], $y IN $x/student/[name/text()=“Mary”] RETURN $y/grade/text() FOR $x IN /xmlview/course[title/text()=“Operating Systems”], $y IN $x/student/[name/text()=“Mary”] RETURN $y/grade/text() XQuery over public view SELECT Enroll.grade FROM Student, Enroll, Course WHERE Student.name=“Mary” and Course.title=“OS” and Student.sid = Enroll.sid and Enroll.cid = Course.cid SQL over database schema SilkRoute does this automatically
25
25 XML Publishing How do we choose the output structure ? Determined by agreement, with our partners, or dictated by committees in XML dialects (called applications) and generate DTDs The DTD is selective wrt the underlying databases XML Data is often nested, irregular, etc – that’s why we need the xmlview query in slide 23 to do the transformation No agreed normal forms for XML but a few work on it… [M. Arenas and L. Libkin, A Normal Form for XML Documents]
26
26 XML Storage Often the XML data is small and is parsed directly into the application (DOM API) – file based management (pros and cons?)DOM Sometimes it is big, and we need to store it in a database Much harder than XML publishing (why ?) A fundamental XML storage problem: How do we choose the schema of the database ? Possible solutions: Schema derived from DTD – structure mapping Storing XML as a graph such as “Edge relation” – model mapping Native Approach – native XML/SSD engine
27
27 XML Storage in a Relational DB Use generic schema [Florescu, Kossman 1999]FlorescuKossman Use DTD to derive schema [Shanmugasundaram, et al. 1999]Shanmugasundaram Use data mining to derive schema [Deutsch, Fernandez, Suciu 1999]DeutschFernandezSuciu Use the Path table [T.Amagasa, T.Shimura, S.Uemura 2001]T.AmagasaT.ShimuraS.Uemura
28
28 XML Storage: Edge Relation [Florescu, Kossman 1999]FlorescuKossman Monet [Schmidt et al. WebDB 00] Monet [Schmidt et al. WebDB 00] Structured-based approach Mapping tree to relations Use generic relational schemas (independent on the XML schema): Ref(source,label,dest) Val(node,value) Ref(source,label,dest) Val(node,value) DOM Tree
29
29 &o1 &o3 &o2 &o4&o5 paper title author year &o6 “The Calculus”“…” “1986” XML Storage: Edge Relation [Florescu, Kossman 1999]FlorescuKossman Ref Val
30
30 XML Storage: Edge Relation In practice may need more tables for reference links and nodes: RefTag1(source,dest) RefTag2(source,dest) … IntVal(node,intVal) RealVal(node,realVal) … RefTag1(source,dest) RefTag2(source,dest) … IntVal(node,intVal) RealVal(node,realVal) … RefTag1 SourceDest &o2&o7 &o5&o8 &o6&o8 &o1 &o3 &o2 &o4&o5 paper title author year &o6 &o7 &o8
31
31 XML Storage: DTD to Schema [Christophides, Abiteboul, Cluet, Scholl 1994]ChristophidesAbiteboulCluetScholl [Shanmugasundaram, Tufte, He, Zhang, DeWitt, Naughton 1999]ShanmugasundaramTufteHeZhangDeWittNaughton Basic idea: use the XML schema to derive the relational schema DTD: Relational schemas: Paper(pid, title, year) Author(aid, pid, firstName, lastName) Paper(pid, title, year) Author(aid, pid, firstName, lastName)
32
32 XML Storage: DTD to Schema Each Element corresponds to a relation Each Attribute of Element corresponds to a column of relation Connect elements using foreign keys But the problems: fragmentation! Example: How many relations should be used for finding an address of an author? Paper(pid, title, year) Author(aid, pid, firstName, lastName) Address (addid, aid, country) City (cid, addid, postcode, cityname) Street (sid, addid, postcode, streetname) Paper(pid, title, year) Author(aid, pid, firstName, lastName) Address (addid, aid, country) City (cid, addid, postcode, cityname) Street (sid, addid, postcode, streetname)
33
33 XML Storage: Path Relations [T.Amagasa, T.Shimura, S.Uemura 2001]T.AmagasaT.ShimuraS.Uemura ACM TOIT 2001 1(1) Store paths as strings (model-based approach) XPath expressions become the SQL like operator Additional information for parent/child, ancestor/descendant relationship XRel: table schemas for paths, elements and val XRel XParent:table schemas for elements, labelpath, parent (alternative but similar)
34
34 XML Storage: Path Relations pathIDPathexpr 1#/bib 2#/bib#/paper 3#/bib#/paper#/author 4#/bib#/paper#/title 5#/bib#/paper#/year 6#/bib#/book#/author 7#/bib#/book#/title 8#/bib#/book#/publisher Path One entry for every path in the database Relatively small
35
35 XML Storage: Path Relations Node ID Path ID StartEnd Parent ID 1101000- 2252001 338202 4321302 53311002 631011502 741511802 823005001... Element One entry for every element in the database relatively large NodeIDVal 3Smith 4Vance 5Tim 6Wallace 7The Best… 63 74 82... Val One entry for every leaf in the database Relatively large Positions in doc
36
36 XRel Example: Path Table Contains all path string Path(pathID, pathexp) PathIDPathExp 0 #/PLAY 1 #/PLAY#/ACT 2 #/PLAY#/ACT#/SCENE 3 #/PLAY#/ACT#/SCENE#/@id 4 #/PLAY#/ACT#/SCENE#/TITLE 5 #/PLAY#/ACT#/SCENE#/SPEECH …… PLAY 2 ACT 3 ACT … TITLE 6 SPEECH 7 “Intro” “CURIO”“This is …” … SCENE 4 … id 5 “000” … SPEAKER 8 TEXT 9 root 1 Path Table
37
37 XRel Example: Other Tables Example docIDPathIDstartendindexreindex 100…11 116…11 1211…11 14274811 ……………… Element Table Attribute Table docIDPathIDstartendvalue 1321 “000” Intro CURIO This is … 0 6 11 2748 21 49 5790 91 Text Table docIDPathIDstartendvalue 143440“Intro” …………… PathIDPathExp 0 #/PLAY …… Path Table
38
38 Storing XML as a Graph Every XML instance is a tree Hence we can store it as any graph, using an Edge table In addition we need a Value table to store the data values (#PCDATA) Edge relation summary: Same relational schema for every XML document: Edge(Source, Tag, Dest) Value(Source, Val) Generic: works for every XML instance But inefficient: Repeat tags multiple times Need many joins to reconstruct data
39
39 Storing XML as a Graph db book publisher titleauthor titleauthor titlestate “Complete Guide to DB2” “Chamberlin”“Transaction Processing” “Bernstein”“Newcomer” “Morgan Kaufman” “CA” 1 2 3 4 5 67 8 9 10 11 0 SourceTagDest 0db1 1book2 2title3 2author4 1book5 5title6 5author7... SourceVal 3Complete guide... 4Chamberlin 6... Edge Value
40
40 Storing XML as a Graph What happens to queries: SELECT vtitle.value FROM Edge xdb, Edge xbook, Edge xauthor, Edge xtitle, Value vauthor, Value vtitle WHERE xdb.source=0 and xdb.tag = ‘db’ and xdb.dest = xbook.source and xbook.tag = ‘book’ and xbook.dest = xauthor.source and xauthor.tag = ‘author’ and xbook.dest = xtitle.source and xtitle.tag = ‘title’ and xauthor.dest = vauthor.source and vauthor.value = ‘Chamberin and xtitle.dest = vtitle.source SELECT vtitle.value FROM Edge xdb, Edge xbook, Edge xauthor, Edge xtitle, Value vauthor, Value vtitle WHERE xdb.source=0 and xdb.tag = ‘db’ and xdb.dest = xbook.source and xbook.tag = ‘book’ and xbook.dest = xauthor.source and xauthor.tag = ‘author’ and xbook.dest = xtitle.source and xtitle.tag = ‘title’ and xauthor.dest = vauthor.source and vauthor.value = ‘Chamberin and xtitle.dest = vtitle.source FOR $x IN /db/book[author/text()=“Chamberlin”] RETURN $x/title FOR $x IN /db/book[author/text()=“Chamberlin”] RETURN $x/title
41
41 XML Storage: Data Mining to Schema [Deutsch, Fernandez, Suciu 1999]DeutschFernandezSuciu Given: One large XML data instance No schema/DTD Query workload Problem: find a “good” relational schema for it Notice: even when a DTD is present, it may be imprecise: E.g. when a person may have 1-3 phones: phone*
42
42 XML Storage: Data Mining to Schema paper author title year fn ln Paper1 Paper2 [Deutsch, Fernandez, Suciu 1999]DeutschFernandezSuciu A BC D A B C D
43
43 Useful References Data on the Web: from Relations, to Semistructured Data and XML, AbiteboulAbiteboul, Buneman, SuciuBunemanSuciu For foundations W3C homepage, www.w3.orgwww.w3.org For current standards F. Tian et al., The Design and Performance Evaluation of Alternative XML Storage Strategies, SIGMOD Record, 2002 XParent: An Efficient RDBMS-Based XML Database System, In Proc. of ICDE, 2002 Tatarinov, Stroring and Querying Ordered XML Using a Relational Database System, In Proc. of SIGMOD, 2002 Yoshikawa et al., XRel: A Path-Based Approach to Storage and Retrieval of XML Documents Using Relational Databases, ACM Trans. on Internet Technology, Vol. 1, No. 1, pp 110-141, 2001 D. Florescu, D. Kossman, Storing and Querying XML Data using an RDBMS. IEEE Data Engineering Bulletin 22(3), 1999 J. Shanmugasundaram et al. Relational Databases for Querying XML Documents: Limitations and Opportunities, In Proc. of VLDB, 1999 C. Zhang et al., On Supporting Containment Queries in Relational Databases Management Systems, In Proc. of SIGMOD, 2001
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.