Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02.

Similar presentations


Presentation on theme: "1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02."— Presentation transcript:

1 1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02

2 2 Introductions Where to save XML document?. XML database. Object-Oriented database. Object-Relational database. Relational database

3 3 Difficulties of Saving XML document into Relational Database XML has more complex tree structure than flat relational tables XML contains richer data types The integration with legacy tables

4 4 Different Approaches to schema mappings Fixed XML-to-relational mappings Commercial RDBMS utility tools Bell Laboratories cost-based approach

5 5 LegoDB, an XML storage mapping system Three design principles. Cost-based search. Logical/physical independence. Reuse of existing technology

6 6 The Basic Approach of LegoDB Create a p-schema for input XML schema Obtain cost estimates with input of data statistics and XQuery workload Exploit alternative storage configurations and achieve an optimal mapping

7 7 Architecture of the Mapping Engine Generate Physical Schema Physical Schema Transformation Query/Schema Translation Query Optimizer XML data statisticsXML Schema PS0PSiRSi Optimal ConfigurationXQuery workload cost(SQi) Rsi: Relational Schema/Queries/Stats Psi: Physical Schema

8 8 Questions Its Advantages? Its Disadvantages?

9 9 Example of P-Schema Creation type Show= type Show= TABLE Show show [@type [String], show [@type[String], ( show_id INT, title [String], title [String] type STRING, year [Integer], year [Integer], year INT ) reviews [String]*, Reviews*, TABLE Review …] type Reviews = ( Review_id, reviews[String] review String, parent_show INT) (a) Initial XML schema (b) P-Schema © Relational table

10 10 What’s P-Schema? Physical schemas (p-schemas) is an extension of XML schemas in two significant ways:. They contain data statistics. They can be easily mapped into relational tables

11 11 Example of P-Schema with statistics type Show = show [ @type[ String ], year[ Integer ], title[ String ], Review* ] type Review = review[ String ] Scalar String *

12 12 Stratified Physical Types scalar type s ::= Integer | String | Boolean Physical type ps ::= ps Named type nt::= X (type name) | nt | nt (choice) |  (empty) | nt{n,m,# repetition Optional type ot ::= nt (named type) | s (optional scalar) | L[ot] (optional element) | ot, ot (optional sequence) | () (empty) Physical type pt ::= nt (named type) | ot{0,1} (optional type) | s (scalar) | L[pt] (element) | pt, pt (sequence) | () empty Schema item si ::= type X = pt (type declaration) Schema ::= schema Sn = si, si, … end (schema)

13 13 Mapping of p-schema to relations Create one relation R T for each type name T For each R T, create a key that stores node id For each R T, create a foreign key to all relations R PT such that PT is a parent type of T A column is created in R T for each sub-element of T that is a physical type If the data type is contained within an optional type then the corresponding column can be null

14 14 More details of P_Schema to relational mappings

15 15 Schema Transformations Advantages of transformations at XML Schema level. Much of the XML schema semantics not present in a given relational schema.. More natural rewriting at the XML level. The framework is more easily extensible to other non-relational stores

16 16 Inlining/Outlining Transformation One can either associate a type name to a given nested element (outlining) or next its definition directly within its parent element (inlining). type TV= seasons [Integer] type TV = Description, seasons[Integer], Episode* => description[String], Episode* type Description = description [String]

17 17 Union Factorization/Distribution Transformation The first law ((a,(b|c)) == (a,b|a,c) type Show = show [@type[String], title[String] show [(@type[String], title[String], year [Integer], title[String], year[Integer], Aka{1,10}, Review*, {Movie|TV}] Aka{1,10}, Review*, box_office[Integer], type Movie = => video_sales[Integer]) box_office[Integer] | (@type[String], title[String], video_sales[Integer] year[Integer], Aka{1,10} Review*, seasons[Integer], Type TV = seasons[Integer], description[String],Episode*)] description[String], Episode*

18 18 Corresponding relational configuration TABLE TV ( TV_id INT, seasons String, TABLE TV ( parent_show ) TV_id INT, => seasons String, TABLE Description description String, ( Description_id INT parent_Show ) description String, parent_TV )

19 19 Union Factorization/Distribution continues The Second law (a[t1|t2] == a[t1]|a[t2]) Type Show = show[(@type[String], type Show = (Show Part1 | Show Part2 ) title[String],year[Integer], type Show Part 1 = show [@type [String], Aka{1,10}, Review*, title [String], year[Integer], Aka{1,10}, box_office[Integer], Review*, box_office[Integer], video_sales[Integer]) video_sales[Integer] ] | (@type [String], => title [String], year [Integer], type Show Part2 = Aka{1,10}, Review*, show [@type [String], title[String], seasons [Integer], year [Integer], Aka{1,10}, description [String], Review*, seasons [Integer], Episode*) ] description [String], Episode* ]

20 20 Corresponding relational configurations TABLE Show ( Show_id INT, TABLE Show_Part1 ( type String, title String, Show_Part1_id INT, year INT) type String, title String, year INT, box_office INT, TABLE Movie ( video_sales INT) Movie_id INT, Box_Office INT, => video_sales INT, parent_show INT) TABLE Show_Part2 ( Show_Part2_id INT, TABLE TV ( type String, title String, TV_id INT, seasons INT, year INT, seasons INT, description string, parent_show INT) description String )

21 21 Wildcard rewritings ‘~’: any element names can be used ‘~!a’: any name but “a” can be used. Type Review = type Reviews = review [~[ String ]*] review[ (NYTReview | OtherReview)*] => type NYTReview = nyt[ String] type OtherReview = (~!nyt) [String]

22 22 XQuery Queries Examples Q1: FOR $v in imdb/show WHERE $v/year = 1999 RETURN ($v/title, $v/year, $v/nyt_reviews) Q2: FOR $v in imdb/show RETURN $v Q3: FOR $v in imdb/show WHERE $v/title = c3 RETURN $v/description Q4: FOR $v in imdb/show RETURN { $v/title, $v/year, (FOR $e IN $v/episode WHERE $e/guest_director = c4 RETURN $e) }

23 23 XQuery Workload Examples Publish = { Q1 : 0.4, Q2: 0.4, Q3: 0.1, Q4: 0.1} Lookup = {Q1: 0.1, Q2: 0.1, Q3:0.4, Q4: 0.4}

24 24 Search Algorithm Procedure GreedySearch Input: xSchema: schema, xWkld: query workload, xStats:data statistics Output: pSchema: an efficient physical schema 1 begin minCost = infinite large ; pSchema = GetInitialPhysicalSchema(xSchema) cost = GetPSchemaCost(pSchema, xWkld, xStats) while (cost < minCost) do 5 minCost = cost pSchemaList = ApplyTransformations(pSchema) for each pSchema’ € pSchemaList do cost’=GetPSchemaCost(pSchema’,xWkld,xStats) if cost’s < cost then cost = cost’; pSchema = pSchema’ endif 10 endfor endwhile return pSchema end.

25 25 Experimental Settings Two variations of the greedy search: greedy-so and greedy-si. Greedy-so: Initial physical schema: all element outlined (except base type). During search: Inlining transformations applied. Greedy-si: Initial physical schema: all elements inlined (except elements with multiple occurences) During search: Outlining transformations applied.

26 26 Efficiency of Greedy Search 5 lookup queries and 3 publish queries

27 27 Results For lookup: Greedy-so converges to the final configuration a lot faster. For publish: opposite.

28 28 Reasons: The traversals made by lookup queries are localized. The final configuration has only a few inlined elements. Greedy-so can reach this configuration earlier than greedy- si. The publish queries traverse larger number of elements. The final configuration has several inlined elements. Greedy-si can reach this configuration earlier than greedy- so.

29 29 Sensitivity of configurations to varied workloads Create a spectrum of workloads that combined the lookup queries and publish queries in the ratio k : (1- k), where k€[0,1] is the fraction of lookup queries in the particular workload. Three workloads corresponding to k = 0.25, 0.50, and 0.75, resulting three configurations.

30 30 Figure 11: Sensitivity to variations in the workload

31 31 Inlining as a bad idea to some queries (a)The query does limited, localized traversals and/or does not access all the attributes involved. (b)The query has highly selective selection predicates. (c)The query involves join of attributes not structurally adjacent in the XML Schema (e.g. actor and director).

32 32 Effectiveness of XML transformations:Union Distribution

33 33 Results of the union-transformed configuration Overlap between the curves for C[0.25] and C[0.75] with OPT. C[0.25] and C[0.75] cross at a small angle. C[All-inlined] performed 2~5 times worse than optimal.

34 34 Wildcards Find the NYTimes reviews for shows produced in 1999:

35 35 Questions The optimal mapping in this paper is cost-based. What else needs to be considered?

36 36 References P.Bohannon, J.Freire, P. Roy, and J. Sim’eon. From XML schema to relations: A cost –based approach to XML storage. Technical report, Bell Laboratories, 2001. Full version. A. Deutsch, M. Fernandez, and D. Suciu. Storing semi-structured data with STORED. In Proc. Of SIGMOND, pp 431-442, 1999. D. Florescu and D. Kossman. A performance evaluation of alternative mapping schemas for storing XML in a relational database. Technical Report 3680, INRIA, 1999 M. Klettke and H. Meyer. XML and object-relational database system – enhancing structural mappings based on statistics. In Proc. Of WebDB, pp63- 68, 2000. A. Schmidt, M. Kersten, M. Windhouwer, and F.Waas. Efficient relational storage and retrieval of XML documents. In Proc. Of WebDB, pp47-52, 2000. J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton. Relational databases for querying XML documents: Limitations and Opportunities. In Proc. Of VLDB, pp302-314, 1999.


Download ppt "1 From XML Schema to Relations: A Cost- Based Approach to XML Storage Presented by Xinwan Bian and Danyu Wu 02-21-02."

Similar presentations


Ads by Google