Presentation is loading. Please wait.

Presentation is loading. Please wait.

Managing XML Data Dan Suciu University of Washington.

Similar presentations


Presentation on theme: "Managing XML Data Dan Suciu University of Washington."— Presentation transcript:

1 Managing XML Data Dan Suciu University of Washington

2 How the Web is Today HTML documents often generated by applications consumed by humans only easy access: across platforms, across organizations

3 No Application Interoperability HTML not understood by applications –screen scraping brittle database technology: client-server –still vendor specific companies merge, form partnerships; need interoperability fast people are inventive: send data by fax !

4 New Universal Data Exchange Format: XML A recommendation from the W3C XML = data XML generated by applications XML consumed by applications easy access: across platforms, organizations XML data management problems

5 Paradigm Shift on the Web from documents (HTML) to data (XML) from information retrieval to data management

6 Paradigm shift for databases from relational model to semistructured dat model from data processing to data/query translation from storage to transport

7 What This Tutorial is About what the database community has done –semistructured data model –query languages, schemas what the Web community has done: –data formats/models: XML –transformation language (XSL), schemas where they meet and where they differ

8 Outline Semistructured and XML data models Query languages Schemas Systems issues Conclusions

9 Part 1 Semistructured and XML Data Models

10 Semistructured Data Origins: integration of heterogeneous sources data sources with non-rigid structure biological data Web data

11 The Semistructured Data Model &o1 &o12&o24&o29 &o43 &96 &243 &206 &25 “Serge” “Abiteboul” 1997 “Victor” “Vianu” 122133 paper book paper references author title year http author title publisher author title page firstname lastname firstnamelastnamefirst last Bib Object Exchange Model (OEM) complex object atomic object [H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, J. Widom, 1997]

12 Syntax for Semistructured Data Bib: &o1 { paper: &o12 { … }, book: &o24 { … }, paper: &o29 { author: &o52 “Abiteboul”, author: &o96 { firstname: &243 “Victor”, lastname: &o206 “Vianu”}, title: &o93 “Regular path queries with constraints”, references: &o12, references: &o24, pages: &o25 { first: &o64 122, last: &o92 133} }

13 Syntax for Semistructured Data May omit oid’s: { paper: { author: “Abiteboul”, author: { firstname: “Victor”, lastname: “Vianu”}, title: “Regular path queries …”, page: { first: 122, last: 133 } }

14 Characteristics of Semistructured Data missing or additional attributes multiple attributes different types in different objects heterogeneous collections self-describing, irregular data, no a priori structure

15 Comparison with Relational Data { row: { name: “John”, phone: 3634 }, row: { name: “Sue”, phone: 6343 }, row: { name: “Dick”, phone: 6363 } } row name phone “John”3634“Sue”“Dick”63436363

16 XML a W3C standard to complement HTML origins: structured text SGML motivation: –HTML describes presentation –XML describes content http://www.w3.org/TR/2000/REC-xml-20001006 (version 2, 10/2000)

17 From HTML to XML HTML describes the presentation

18 HTML Bibliography Foundations of Databases Abiteboul, Hull, Vianu Addison Wesley, 1995 Data on the Web Abiteoul, Buneman, Suciu Morgan Kaufmann, 1999

19 XML Foundations… Abiteboul Hull Vianu Addison Wesley 1995 … XML describes the content

20 XML Terminology tags: book, title, author, … start tag:, end tag: elements: …, … elements are nested empty element: abbrv. an XML document: single root element well formed XML document: if it has matching tags

21 More XML: Attributes Foundations of Databases Abiteboul … 1995 attributes are alternative ways to represent data

22 More XML: Oids and References Jane Mary John oids and references in XML are just syntax

23 XML Data Model Several competing models: Document Object Model (DOM): –http://www.w3.org/TR/2001/WD-DOM-Level-3-CMLS-20010209/ (2/2001) –class hierarchy (node, element, attribute,…) –objects have behavior –defines API to inspect/modify the document XSL data model Infoset –PSV (post schema validation) XML Query data model (next)

24 XML Query Data Model http://www.w3.org/TR/query-datamodel/ 2/2001 Describes XML as a tree, specialized nodes Uses a functional-style notation (think ML)

25 XML Query Data Model Node ::= DocNode | ElemNode | ValueNode | AttrNode | NSNode | PINode | CommentNode | InfoItemNode | RefNode

26 XML Query Data Model Element node (simplified definition): elemNode : (QNameValue, {AttrNode }, [ ElemNode | ValueNode])  ElemNode QNameValue = means “a tag name” {...} = means “set of...” [...] = means “list of...”

27 XML Query Data Model Reads: “give me a tag, a set of attributes, a list of elements/values, and I will return an element”

28 XML Query Data Model Example <book price = “55” currency = “USD”> Foundations … Abiteboul Hull Vianu 1995 <book price = “55” currency = “USD”> Foundations … Abiteboul Hull Vianu 1995 book1= elemNode(book, {price2, currency3}, [title4, author5, author6, author7, year8]) price2 = attrNode(…) /* next */ currency3 = attrNode(…) title4 = elemNode(title, string9) … book1= elemNode(book, {price2, currency3}, [title4, author5, author6, author7, year8]) price2 = attrNode(…) /* next */ currency3 = attrNode(…) title4 = elemNode(title, string9) …

29 XML Query Data Model Attribute node: attrNode : (QNameValue, ValueNode)  AttrNode

30 XML Query Data Model Example <book price = “55” currency = “USD”> Foundations … Abiteboul Hull Vianu 1995 <book price = “55” currency = “USD”> Foundations … Abiteboul Hull Vianu 1995 price2 = attrNode(price,string10) string10 = valueNode(…) /* next */ currency3 = attrNode(currency, string11) string11 = valueNode(…)

31 XML Query Data Model Value node: ValueNode = StringValue | BoolValue | FloatValue … stringValue : string  StringValue boolValue : boolean  BoolValue floatValue : float  FloatValue

32 XML Query Data Model Example <book price = “55” currency = “USD”> Foundations … Abiteboul Hull Vianu 1995 <book price = “55” currency = “USD”> Foundations … Abiteboul Hull Vianu 1995 price2 = attrNode(price,string10) string10 = valueNode(stringValue(“55”)) currency3 = attrNode(currency, string11) string11 = valueNode(stringValue(“USD”)) title4 = elemNode(title, string9) string9 = valueNode(stringValue(“Foundations…”)) price2 = attrNode(price,string10) string10 = valueNode(stringValue(“55”)) currency3 = attrNode(currency, string11) string11 = valueNode(stringValue(“USD”)) title4 = elemNode(title, string9) string9 = valueNode(stringValue(“Foundations…”))

33 XML Namespaces http://www.w3.org/TR/REC-xml-names (1/99) name ::= [prefix:]localpart … 15 …. … 15 ….

34 … … XML Namespaces syntactic:, semantic: provide URL for schema defined here

35 XML v.s. Semistructured Data both described best by a graph both are schema-less, self-describing

36 Similarities and Differences Alan 42 ab@com { person: &o123 { name: “Alan”, age: 42, email: “ab@com” } } person nameageemail Alan42ab@com person name age email Alan42ab@com father … { person: { father: &o123 …} } similar on trees, different on graphs

37 More Differences XML is ordered, ssd is not XML can mix text and elements: Making Java easier to type and easier to type Phil Wadler XML has lots of other stuff: entities, processing instructions, comments Very important: these differences make XML data management harder

38 Summary of Data Models semistructured data, XML data is self-describing, irregular schema embedded with the data

39 Part 2 Query Languages Semistructured and XML data models Query languages Schemas Systems issues Conclusions

40 Query Languages: Outline Path expressions: –XPath Declarative languages (query languages) –XML-QL –Quilt –XQuery Recursive languages –structural recursion –XSL

41 XPath http://www.w3.org/TR/xpath (11/99) Building block for other W3C standards: – XSL Transformations (XSLT) – XML Link (XLink) – XML Pointer (XPointer) – XML Query Was originally part of XSL

42 Example for XPath Queries Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998 Addison-Wesley Serge Abiteboul Rick Hull Victor Vianu Foundations of Databases 1995 Freeman Jeffrey D. Ullman Principles of Database and Knowledge Base Systems 1998

43 XPath: Simple Expressions /bib/book/year Result: 1995 1998 /bib/paper/year Result: empty (there were no papers)

44 XPath: Restricted Kleene Closure //author Result: Serge Abiteboul Rick Hull Victor Vianu Jeffrey D. Ullman /bib//first-name Result: Rick

45 Xpath: Text Nodes /bib/book/author/text() Result: Serge Abiteboul Jeffrey D. Ullman Rick Hull doesn’t appear because he has firstname, lastname

46 Xpath: Wildcard //author/* Result: Rick Hull * Matches any element

47 Xpath: Attribute Nodes /bib/book/@price Result: “55” @price means that price is has to be an attribute

48 Xpath: Qualifiers /bib/book/author[firstname] Result: Rick Hull

49 Xpath: More Qualifiers /bib/book/author[firstname][address[//zip][city]]/lastname Result: … …

50 Xpath: More Qualifiers /bib/book[@price < “60”] /bib/book[author/@age < “25”] /bib/book[author/text()]

51 Xpath: Summary bibmatches a bib element *matches any element /matches the root element /bibmatches a bib element under root bib/papermatches a paper in bib bib//papermatches a paper in bib, at any depth //papermatches a paper at any depth paper|bookmatches a paper or a book @pricematches a price attribute bib/book/@pricematches price attribute in book, in bib bib/book/[@price<“55”]/author/lastname matches…

52 Xpath: More Details An Xpath expression, p, establishes a relation between: –A context node, and –A node in the answer set In other words, p denotes a function: –S[p] : Nodes -> {Nodes} Examples: –author/firstname –. = self –.. = parent –part/*/*/subpart/../name = part/*/*[subpart]/name

53 The Root and the Root 1 2 bib is the “document element” The “root” is above bib /bib = returns the document element / = returns the root Why ? Because we may have comments before and after ; they become siblings of This is advanced xmlogy

54 Xpath: More Details We can navigate along 13 axes: ancestor ancestor-or-self attribute child descendant descendant-or-self following following-sibling namespace parent preceding preceding-sibling self

55 Xpath: More Details Examples: –child::author/child:lastname = author/lastname –child::author/descendant::zip = author//zip –child::author/parent::* = author/.. –child::author/attribute::age = author/@age

56 XML-QL: A Query Language for XML http://www.w3.org/TR/NOTE-xml-ql (8/98) Also: –[Deutsch, Fernandez, Florescu, Levy, Suciu, 99] features: –regular path expressions –patterns, templates –Skolem Functions based on OEM data model

57 Pattern Matching in XML-QL WHERE Morgan Kaufmann $a in “www.a.b.c/bib.xml” CONSTRUCT $a WHERE Morgan Kaufmann $a in “www.a.b.c/bib.xml” CONSTRUCT $a $a = a variable

58 Path Expressions in XML-QL WHERE Morgan Kaufmann $a in “www.a.b.c/bib.xml” CONSTRUCT $a WHERE Morgan Kaufmann $a in “www.a.b.c/bib.xml” CONSTRUCT $a Note: abbreviates or or... Here: we use Xpath expressions; Original XML-QL: used math syntax

59 Simple Constructors in XML- QL WHERE $a in “www.a.b.c/bib.xml” CONSTRUCT $a $l WHERE $a in “www.a.b.c/bib.xml” CONSTRUCT $a $l Smith English Smith Mandarin Doe English

60 Skolem Functions in XML-QL WHERE $a in “www.a.b.c/bib.xml” CONSTRUCT $a $l WHERE $a in “www.a.b.c/bib.xml” CONSTRUCT $a $l Smith English Mandarin Doe English

61 XQuery Based on Quilt (which is based on XML-QL) http://www.w3.org/TR/xquery/ 2/2001 Xquery = XML-QL – patterns – Skolem + aggregates + duplicates + based on XML data model

62 XQuery FOR $x in /bib/book WHERE $x/year > 1995 RETURN $x/title

63 XQuery FOR $x in expr -- binds $x to each value in the list expr LET $x = expr -- binds $x to the entire list expr –Useful for common subexpressions and for aggregations

64 XQuery FOR $p IN distinct(document("bib.xml")//publisher) LET $b := document("bib.xml")/book[publisher = $p] WHERE count($b) > 100 RETURN $p distinct = a function that eliminates duplicates count = a (aggregate) function that returns the number of elms

65 XQuery Summary: FOR-LET-WHERE-RETURN = FLWR FOR/LET Clauses WHERE Clause RETURN Clause List of tuples Instance of Xquery data model

66 Structural Recursion Data as sets with a union operator: {a:3, a:{b:”one”, c:5}, b:4} = {a:3} U {a:{b:”one”,c:5}} U {b:4}

67 Structural Recursion f(T1 U T2) = f(T1) U f(T2) f({L: T}) = f(T) f({}) = {} f(V) = if isInt(V) then {result: V} else {} f(T1 U T2) = f(T1) U f(T2) f({L: T}) = f(T) f({}) = {} f(V) = if isInt(V) then {result: V} else {} Example: retrieve all integers in the data a a b b c3 “one”5 4 result 354 standard textbook programming on trees

68 Structural Recursion Example: increase all engine prices by 10% f(T1 U T2) = f(T1) U f(T2) f({L: T}) = if L= engine then {L: g(T)} else {L: f(T)} f({}) = {} f(V) = V f(T1 U T2) = f(T1) U f(T2) f({L: T}) = if L= engine then {L: g(T)} else {L: f(T)} f({}) = {} f(V) = V g(T1 U T2) = g(T1) U g(T2) g({L: T}) = if L= price then {L:1.1*T} else {L: g(T)} g({}) = {} g(V) = V g(T1 U T2) = g(T1) U g(T2) g({L: T}) = if L= price then {L:1.1*T} else {L: g(T)} g({}) = {} g(V) = V enginebody partprice part price 100 1000 100 1000 enginebody partprice part price 110 1100 100 1000

69 XSL http://www.w3.org/TR/xslt.html 11/99 purpose: stylesheet specification language: –stylesheet: XML -> HTML –in general: XML -> XML

70 XSL Templates and Rules query = collection of template rules template rule = match pattern + template Retrieve all book titles:

71 Flow Control in XSL

72 1 2 3 4 1 2 3 4

73 XSL is Structural Recursion Equivalent to: f(T1 U T2) = f(T1) U f(T2) f({L: T}) = if L= c then {C: t} else L= b then {B: f(t)} else L= a then {A: f(t)} else f(t) f({}) = {} f(V) = V f(T1 U T2) = f(T1) U f(T2) f({L: T}) = if L= c then {C: t} else L= b then {B: f(t)} else L= a then {A: f(t)} else f(t) f({}) = {} f(V) = V XSL query = single function XSL query with modes = multiple function

74 XSL and Structural Recursion XSL: trees only may loop Structural Recursion: arbitrary graphs always terminates stack overflow on IE 5.0 add the following rule:

75 Summary of Query Languages Path navigation –Xpath: restricted regular path expressions Declarative languages: –XML-QL –Quilt/ Xquery Structural recursion: –UnQL –XSLT

76 Part 3 Schemas Semistructured and XML data models Query languages Schemas Systems issues Conclusions

77 Schemas why ? –XML: to describe semantics –semistructured data: to improve processing what ? –semistructured data: foundational –XML: several concrete proposals

78 Schemas when ? –semistructured data, XML: a posteriori –RDBMS: a priori, to interpret binary data how ? –semistructured data: schema is independent –XML: schema is hardwired with the data

79 Outline schemas for semistructured data: –foundations –schema extraction schemas for XML: –DTD –XML-Schema

80 Schemas: An Example &r1 &c1&c2 &s2&s3&s6&s7 &s10 company name address name url address “Widget” “Trenton” “Gadget” “www.gp.fr” “Paris” &p2&p1&p3 &s0&s1&s4&s5&s8&s9 person “Smith” name position name phone name position “Manager” “Jones” “5552121” “Dupont” “Sales” employee manages c.e.o. works-for c.e.o. &a1 &a2 &a3 &a4 &a5 &a6 &a7 description procurement salesrep contact task eval 1997 1998 “on target” “below target” Some database:

81 Lower-Bound Schemas Root Company Employee string company person works-for c.e.o. address name managed-by name

82 Upper Bound Schemas Root Company Employee string company person works-for c.e.o. | employee name | address | url managed-by name | phone | position Any description -

83 The Two Questions to Ask Conformance: does that data conform to this schema ? Classification: if so, then which objects belong to what classes ?

84 Graph Simulation Definition Two edge-labeled graphs G1, G2 A simulation is a relation R between nodes: if (x1, x2) in R, and (x1,a,y1) in G1, then exists (x2,a,y2) in G2 (same label) s.t. (y1,y2) in R x1x2 a R G1G2 y1 a R y2 Note: a simulation can be efficiently computed [Henzinger, et a. 1995]

85 Using Simulation Data graph D, schema S upper bound schema: –conformance: find simulation R from D to S –classification: check if (x,c) in R lower bound schema –conformance: find simulation R from S to D –classification: check if (c,x) in R [Buneman, Davidson, Fernandez, Suciu 1997]

86 Example &r1 &c1&c2 &s2&s3&s6&s7 &s10 company name address name url address “Widget” “Trenton” “Gadget” “www.gp.fr” “Paris” &p2&p1&p3 &s0&s1&s4&s5&s8&s9 person “Smith” name positionname phone name position “Manager” “Jones” “5552121” “Dupont” “Sales” employee manages c.e.o. works-for c.e.o. &a1 &a2 &a3 &a4 &a5 &a6 &a7 description procurement salesrep contact task eval 1997 1998 “on target” “below target” Root Company Employee string company person works-for c.e.o. address name managed-by name Root Company Employee string company person works-for c.e.o. | employee name | address | url managed-by name | phone | position Any description - DatabaseLower BoundUpper Bound simulation: efficient technique for checking conformance to schema

87 Application 1: Improve Secondary Storage Root Company Employee string company person works-for c.e.o. address name managed-by name Company Employee Store rest in overflow graph Lower-bound schema

88 Application 2: Query Optimization Bib paperbook year journal title intstring address author title zip city street last name first name string select X.title from Bib._ X where X.*.zip = “12345” select X.title from Bib._ X where X.*.zip = “12345” select X.title from Bib.book X where X.address.zip = “12345” select X.title from Bib.book X where X.address.zip = “12345” Upper-bound schema [Fernandez, Suciu 1998]

89 Schema Extraction (From Data) Problem statement given data instance D find the “most specific” schema S for D In practice: S too large, need to relax [Nestorov, Abiteboul, Motwani 1998]

90 Schema Extraction: Sample Data &r &p8&p1&p2&p3&p4&p5&p6&p7 &c company employee worksfor manages managedby manages managedby

91 Lower Bound Schema Extraction Root &r Bosses &p1,&p4,&p6 Regulars &p2,&p3,&p5,&p7,&p8 Company &c company employee manages managedby worksfor employee

92 Upper Bound Schema Extraction: Data Guides Root &r Employees &p1,&p1,&p3,P4 &p5,&p6,&p7,&p8 Bosses &p1,&p4,&p6 Regulars &p2,&p3,&p5,&p7,&p8 Company &c company employee manages managedby manages managedby worksfor

93 Schemas in XML Document Type Definition (DTD) XML Schema

94 Document Type Definition: DTD part of the original XML specification an XML document may have a DTD terminology for XML: –well-formed: if tags are correctly closed –valid: if it has a DTD and conforms to it validation is useful in data exchange

95 DTDs as Grammars <!DOCTYPE paper [ ]> <!DOCTYPE paper [ ]> …

96 DTDs as Schemas Not so well suited: impose unwanted constraints on order references cannot be constrained can be too vague:

97 XML Schemas http://www.w3.org/TR/xmlschema-1/ 10/2000 generalizes DTDs uses XML syntax two documents: structure and datatypes –http://www.w3.org/TR/xmlschema-1 –http://www.w3.org/TR/xmlschema-2 XML-Schema is very complex –often criticized –some alternative proposals

98 XML Schemas DTD:

99 Elements v.s. Types in XML Schema DTD:

100 Types: –Simple types (integers, strings,...) –Complex types (regular expressions, like in DTDs) Element-type-element alternation: –Root element has a complex type –That type is a regular expression of elements –Those elements have their complex types... –... –On the leaves we have simple types

101 Local and Global Types in XML Schema Local type: [define locally the person’s type] Global type: [define here the type ttt] Global types: can be reused in other elements

102 Local v.s. Global Elements in XML Schema Local element:... Global element:... Global elements: like in DTDs

103 Regular Expressions in XML Schema Recall the element-type-element alternation: [regular expression on elements] Regular expressions: A B C = A B C A B C = A | B | C A B C = (A B C).. = (...)*.. = (...)?

104 Summary of XML Schema Formal Expressive Power: –Can express precisely the regular tree languages (over unranked trees) Lots of other stuff –Some form of inheritance –A “null” value –Large collection of data types

105 Summary of Schemas in SS data: –graph theoretic –data and schema are decoupled –used in data processing in XML –from grammar to object-oriented –schema wired with the data –emphasis on semantics for exchange

106 Part 4 Systems Issues Semistructured and XML data models Query languages Schemas Systems issues Conclusions

107 Systems Issues XML publishing (relational  XML) XML storage (XML  relational) XML processing (XML  XML) XML compression (XML  ) XML indexing (XML  )

108 XML Publishing Today: legacy data –fragmented into many flat relations –3rd normal form –schema is proprietary XML data –nested –un-normalized –schema designed by agreement

109 XML Publishing XML = view of Relational/OO/OR sources –Virtual view= mediator based publishing –Materialized view= warehouse based publishing Two systems: –SilkRoute Virtual [Fernandez, Suciu, Tan, 2000] Materialized [Fernandez, Morishima, Suciu, 2001] –Experanto Virtual [Shanmugasundaram, Tufte, DeWitt, Naughton, Maier, 2000] Materalized [Shanmugasundaram, Shekita, Barr, Carey, Lindsay, Pirahesh, Reinwald, 2000]

110 XML Publishing in SilkRoute relational database: virtual XML view: n1... n2... … StoreSBBook

111 XML Publishing in SilkRoute specify mediator declaratively (a view): from Store, SB, Book where Store.sid=SB.sid and SB.bid=Book.bid construct Store.name Book.title from Store, SB, Book where Store.sid=SB.sid and SB.bid=Book.bid construct Store.name Book.title

112 XML Publishing in SilkRoute users ask XML-QL queries: –find stores who sell “The Calculus” where $n The Calculus construct $n where $n The Calculus construct $n

113 XML Publishing in SilkRoute system composes query with view: from Store, SB, Book where Store.sid=SB.sid and SB.bid=Book.bid and Book.title=“The Calculus” construct Store.name from Store, SB, Book where Store.sid=SB.sid and SB.bid=Book.bid and Book.title=“The Calculus” construct Store.name

114 XML Storage Text file (XML) Use generic schema –[Florescu, Kossman 1999] Use DTD to derive schema –[Shanmugasundaram, et al. 1999] Use data mining to derive schema –[Deutsch, Fernandez, Suciu 1999] Build special purpose repository

115 XML Storage: Text File advantages –simple –less space than database !!! –reasonable clustering disadvantage –no updates –require special purpose query processor

116 &o1 &o3 &o2 &o4&o5 paper title author year &o6 “The Calculus”“…” “1986” XML Stoarge: Ternary Relation [Florescu, Kossman 1999] Ref Val

117 XML Storage: DTD to Schema DTD: Relational schema: Employee(eid, name, address) Address(aid, eid, street, city, state, zip) [Christophides, Abiteboul, Cluet, Scholl 1994] [Shanmugasundaram, Tufte, He, Zhang, DeWitt, Naughton 1999]

118 XML Storage: Data Mining to Schema paper author title year fn ln Paper1 Paper2 [Deutsch, Fernandez, Suciu 1999]

119 Query Evaluation Stream XML processing –Xscan [Halevy, Ives] –Niagara [Shanmugasundaram, Tufte, DeWitt, Naughton, Maier 2000] [Chen, DeWitt, Tian, Wang, 2000] XML updates [Halevy, Ives, Tatarinov, Weld, 2001] Query optimization [McHugh, Widom 99]

120 XScan Store and query approach: –Store XML on disk, then index & query –Cannot amortize storage costs Streaming data approach (XScan): –Read & parse –Track nodes by ID Index XML graph structure

121 XScan Illustrated here: pattern matching XML-QL pattern: Becomes: $x = (root)/doc/person $n = $x/name $a = $x/address $s = $a/street $z = $a/zip $n $s $z $n $s $z

122 XScan Built a tree of binding tables: $x elm1 elm2... $n $a $n Saves joins Can free space when not needed

123 Compression: The Problem XML for exchange (space or time) but XML is verbose users prefer application specific formats: –Web Server Logs –EMBL –G2 is XML doomed to fail ?

124 An Example:Web Server Logs 202.239.238.16|GET / HTTP/1.0|text/html|200|1997/10/01-00:00:02|-|4478|-|-|http://www.net.jp/|Mozilla/3.1[ja](I) 202.239.238.16 GET / HTTP/1.0 text/html 200 1997/10/01-00:00:02 4478 http://www.net.jp/ Mozilla/3.1$[$ja$]$(I) 202.239.238.16 GET / HTTP/1.0 text/html 200 1997/10/01-00:00:02 4478 http://www.net.jp/ Mozilla/3.1$[$ja$]$(I) ASCII File 15.9 MB (gzipped 1.6MB): XML-ized inflates to 24.2 MB (gzipped 2.1MB):

125 XMill [Liefke and Suciu’2000] specialized compressor for XML data makes XML look “small” www.research.att.com/sw/tools/xmill

126 How Xmill Works: Three Ideas...... 202.239.238.16 GET / HTTP/1.0 text/html 200 … 202.239.238.16 GET / HTTP/1.0 text/html 200 … gzip Structuregzip Data =1.75MB +1...... 202.23.23.16 224.42.24.55 … 202.23.23.16 224.42.24.55 … gzip Structuregzip Data1 =1.33MB +2 GET / HTTP/1.0 GET / HTTP/1.1 … GET / HTTP/1.0 GET / HTTP/1.1 … gzip Data2 + 3 gzip Structure + gzip c1(Data1) + gzip c2(Data2) +... =0.82MB

127 XML Compression

128 Compression Tradeoff

129 Indexing Semistructured Data coercions: 1995 v.s. “1995” regular path expressions –data guides [Goldman, Widom, 1997] –T-indexes [Milo, Suciu, 1999]

130 Indexing All Paths in the Data 1 23456 78910111213 ttttt abacadaab Semistructured Data 1 2 3 4 5 6 7 8 10 12 137 13 911 t a b c d Data Guide 1 2 3 4 5 6 7 138 10 12 911 t a acd b T-Index

131 Summary of Systems XML processing –With relational engines –With XML engines Mapping XML to relations is still not well understood Indexing work is mostly missing

132 Part 5 Conclusions Semistructured and XML data models Query languages Schemas Systems issues Conclusions

133 Summary XML = what is out there semistructured data = what we can process paradigm shift, for both Web and db covered in tutorial: – data models, queries, schemas

134 Current and Future Technologies Web applications possible today: –export relational data to XML (e.g. Oracle) –import XML into applications (DOM) Web applications in the future: –mediator technology (XML view) –store/process native XML data –mine/analyze XML

135 Why This Is Cool for Database Researchers put to work what you teach in CS101 ! –tree traversals (structural recursion, XSL) –automata theory (DTD’s, path expressions) –graph theory (simulation) adapt old DB techniques to novel data save the trees: from fax to XML The End

136 Further Readings www. w3.org/XML http://data.cs.washington.edu/xml/ http://cm.bell-labs.com/cm/cs/who/wadler/xml/ http://www.cs.washington.edu/homes/suciu Abiteboul, Buneman, Suciu Data on the Web: From Relational to Semistructured to XML Morgan Kaufmann, 1999


Download ppt "Managing XML Data Dan Suciu University of Washington."

Similar presentations


Ads by Google