1 Lecture 15 Monday, May 20, 2002 Size Estimation, XML Processing.

Slides:



Advertisements
Similar presentations
CS4432: Database Systems II
Advertisements

Query Optimization May 31st, Today A few last transformations Size estimation Join ordering Summary of optimization.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
1 Lecture 23: Query Execution Friday, March 4, 2005.
Lecture 13: Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data.
Query Optimization Goal: Declarative SQL query
INFS614, Fall 08 1 Relational Algebra Lecture 4. INFS614, Fall 08 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Estimating the Cost of Operations We don’t want to execute the query in order to learn the costs. So, we need to estimate the costs. How can we estimate.
Lecture 24: Query Execution Monday, November 20, 2000.
Estimating the Cost of Operations. From l.q.p. to p.q.p Having parsed a query and transformed it into a logical query plan, we must turn the logical plan.
1 Lecture 22: Query Execution Wednesday, March 2, 2005.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Introduction XML: an emerging standard for exchanging data on the WWW. Relational database: most wildly used DBMS. Goal: how to map the relational data.
2005rel-xml-i1 Relational to XML Transformations  Background & Issues  Preliminaries  Execution strategies  The SilkRoute System.
Database Systems and XML David Wu CS 632 April 23, 2001.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Managing XML and Semistructured Data Lecture 17: Publishing XML Data From Relations Prof. Dan Suciu Spring 2001.
Managing XML and Semistructured Data Lecture 18: Publishing XML Data From Relations Prof. Dan Suciu Spring 2001.
Efficiently Publishing Relational Data as XML Documents Jayavel Shanmugasundaram University of Wisconsin-Madison/ IBM Almaden Research Center Joint work.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
CS411 Database Systems Kazuhiro Minami 12: Query Optimization.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
Query Optimization March 10 th, Very Big Picture A query execution plan is a program. There are many of them. The optimizer is trying to chose a.
Publishing Relational Data in XML David McWherter.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Estimating the Cost of Operations. Suppose we have parsed a query and transformed it into a logical query plan (lqp) Also suppose all possible transformations.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
Dec. 13, 2002 WISE2002 Processing XML View Queries Including User-defined Foreign Functions on Relational Databases Yoshiharu Ishikawa Jun Kawada Hiroyuki.
Lecture 17: Query Execution Tuesday, February 28, 2001.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
1 Lecture 23: Query Execution Monday, November 26, 2001.
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Optimization Spring 2016.
Efficiently Publishing Relational Data as XML Documents IBM Almaden Research Center Eugene Shekita Rimon Barr Michael Carey Bruce Lindsay Hamid Pirahesh.
XPERANTO: A Middleware for Publishing Object-Relational Data as XML Documents Michael Carey Daniela Florescu Zachary Ives Ying Lu Jayavel Shanmugasundaram.
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinAkshay Shenoy Computer Science Dept San Jose State University.
CS 540 Database Management Systems
CS 440 Database Management Systems
Management of XML and Semistructured Data
Database Management System
Efficiently Publishing Relational Data as XML Documents
Lecture 26: Query Optimizations and Cost Estimation
Lecture 27: Size/Cost Estimation
Database Management Systems (CS 564)
Introduction to Database Systems
Lecture 26: Query Optimization
SilkRoute: A Framework for Publishing Rational Data in XML
Instructor: Mohamed Eltabakh
Lecture 2- Query Processing (continued)
Lecture 28 Friday, December 7, 2001.
Lecture 27: Optimizations
Relational Algebra Friday, 11/14/2003.
Implementation of Relational Operations
Lecture 24: Query Execution
Lecture 13: Query Execution
Lecture 23: Query Execution
Lecture 28: Size/Cost Estimation, Recovery
Lecture 22: Query Execution
Lecture 14: Database Theory in XML Processing
Lecture 11: B+ Trees and Query Execution
Wednesday, May 22, 2002 XML Publishing, Storage
CPSC-608 Database Systems
Lecture 26: Wednesday, December 4, 2002.
Lecture 27 Wednesday, December 5, 2001.
Lecture 20: Query Execution
Presentation transcript:

1 Lecture 15 Monday, May 20, 2002 Size Estimation, XML Processing

2 Estimating Sizes Need size in order to estimate cost Example: –Cost of partitioned hash-join E1 E2 is 3B(E1) + 3B(E2) –B(E1) = T(E1)/ block size –B(E2) = T(E2)/ block size –So, we need to estimate T(E1), T(E2)

3 Size Estimation Crucial for the optimizer’s performance Idea: –Keep statistics for base tables –Use heuristics to compute size of operator’s results –Works well for the first 1-2 levels, then it’s way off –Hence: don’t expect optimizers to return best plan, but avoid worse plan

4 Estimating Sizes Estimating the size of a projection Easy: T(  L (R)) = T(R) This is because a projection doesn’t eliminate duplicates

5 Estimating Sizes Estimating the size of a selection S =  A=c (R) –T(S) san be anything from 0 to T(R) – V(R,A) + 1 –Mean value: T(S) = T(R)/V(R,A) S =  A<c (R) –T(S) can be anything from 0 to T(R) –Heuristics: T(S) = T(R)/3

6 Estimating Sizes Estimating the size of a natural join, R S When the set of A values are disjoint, then T(R S) = 0 When A is a key in S and a foreign key in R, then T(R S) = T(R) When A has a unique value, the same in R and S, then T(R S) = T(R) T(S) AAAA

7 Estimating Sizes Assumptions: Containment of values: if V(R,A) <= V(S,A), then the set of A values of R is included in the set of A values of S –Note: this indeed holds when A is a foreign key in R, and a key in S Preservation of values: for any other attribute B, V(R S, B) = V(R, B) (or V(S, B)) A

8 Estimating Sizes Assume V(R,A) <= V(S,A) Then each tuple t in R joins some tuple(s) in S –How many ? –On average S/V(S,A) –t will contribute S/V(S,A) tuples in R S Hence T(R S) = T(R) T(S) / V(S,A) In general: T(R S) = T(R) T(S) / max(V(R,A),V(S,A)) AAA

9 Estimating Sizes Example: T(R) = 10000, T(S) = V(R,A) = 100, V(S,A) = 200 How large is R S ? Answer: T(R S) = /200 = 1M AA

10 Estimating Sizes Joins on more than one attribute: T(R S) = T(R) T(S)/max(V(R,A),V(S,A))max(V(R,B),V(S,B)) A,B

11 Histograms Statistics on data maintained by the RDBMS Makes size estimation much more accurate (hence, cost estimations are more accurate)

12 Histograms Employee(ssn, name, salary, phone) Maintain a histogram on salary: T(Employee) = 25000, but now we know the distribution Salary:0..20k20k..40k40k..60k60k..80k80k..100k> 100k Tuples

13 Histograms Ranks(rankName, salary) Estimate the size of Employee Ranks Employee0..20k20k..40k40k..60k60k..80k80k..100k> 100k Ranks0..20k20k..40k40k..60k60k..80k80k..100k> 100k Salary

14 Histograms Assume: –V(Employee, Salary) = 200 –V(Ranks, Salary) = 250 Then T(Employee Ranks) = =  i=1,6 T i T i ’ / 250 = (200x x x x x x2)/250 = …. Salary

15 XML Processing XML publishing XML storage XML query rewriting XML transport and compression

16 XML Publishing XML view defined declaratively –SQL extensions [Exodus] –RXL [SilkRoute] Virtual XML publishing –Accept XML queries (e.g. XML-QL), translate to SQL –Main issue: compose queries Materialized XML publishing –Compute entire XML view – large ! –Main issue: compute a large query efficiently

17 Virtual XML Publishing Eu-Stores US-Stores Products Eu-SalesUS-Sales namecountrynameurl date tax name priceUSD euSidusSid pid Legacy data in E/R:

18 Virtual XML Publishing XML view France Nicolas Blanc de Blanc 10/10/ … … …. … In summary: group by country  store  product-sale

19 allsales country namestore namesale name price date tax url PCDATA * * * ? ? Output “schema”: PCDATA

20 Virtual XML Publishing { let $cl = distinct-values(db/EuStores/country/text()) for $c in $cl return $c/name { for $s in db/EuStores[country/text()=$c] return $s/name/text() { for $l in db/EuSales[euSid=$s/euSid], $p in db/Products[pid=$l/pid] return $p/name/text() $l/date/text() $p/priceUSD/text() } union { let $cl = distinct-values(db/EuStores/country/text()) for $c in $cl return $c/name { for $s in db/EuStores[country/text()=$c] return $s/name/text() { for $l in db/EuSales[euSid=$s/euSid], $p in db/Products[pid=$l/pid] return $p/name/text() $l/date/text() $p/priceUSD/text() } union In SilkRoute

21 Virtual XML Publishing …. /* union */ USA for $s in db/USStores return $s/name/text() $s/url/text() for $l in db/USSales[usSid=$s/usSid] $p in db/Products[pid=$l/pid] return $p/name/text() $l/date/text() $p/priceUSD/text() $l/tax/text() …. /* union */ USA for $s in db/USStores return $s/name/text() $s/url/text() for $l in db/USSales[usSid=$s/usSid] $p in db/Products[pid=$l/pid] return $p/name/text() $l/date/text() $p/priceUSD/text() $l/tax/text()

22 select c.country allsales country namestore namesale nameprice date Tax url Internal Representation * * * ? View Tree: country(c) from EuStores x where x.country=c.country from EuStores x where x.country=c.country from EuSales y, Products z where x.euSid=y.euSid and y.pid = z.pid from EuSales y, Products z where x.euSid=y.euSid and y.pid = z.pid select z.price select z.name select x.name from (select distinct country from EuStores) c from (select distinct country from EuStores) c select z.price from (select distinct country from EuStores) c, EuStores x, EuSales y, Products z where x.country=c.country and x.euSid=y.euSid and y.pid = z.pid select z.price from (select distinct country from EuStores) c, EuStores x, EuSales y, Products z where x.country=c.country and x.euSid=y.euSid and y.pid = z.pid SQL fragment Full SQL

23 Virtual XML Publishing Don’t compute the XML data yet Users ask XML queries System composes with the view, sends to the RDBMS Main issue: compose queries

24 XML Publishing: Virtual View in SilkRoute find names, urls of all stores who sold on 1/1/2000 (in XML-QL / XQuery melange): for $s in /allsales/country/store[sale/date/text()=“1/1/2000”] return $s/name/text() $s/url for $s in /allsales/country/store[sale/date/text()=“1/1/2000”] return $s/name/text() $s/url

25 Query Composition Result (in theory…): ( SELECT S.name, S.url FROM USStores S, USSales L, Products P WHERE S.usSid=L.usSid AND L.pid=P.pid AND L.date=‘1/1/2000’) UNION ( SELECT S2.name, S2.url FROM EUStores S1, EUSales L1, Products P1 USStores S2, USSales L2, Products P2, WHERE S1.usSid=L1.usSid AND L1.pid=P1.pid AND L1.date=‘1/1/2000’ AND S2.usSid=L2.usSid AND L2.pid=P1.pid AND S1.country=“USA” AND S1.euSid = S2.usSid) ( SELECT S.name, S.url FROM USStores S, USSales L, Products P WHERE S.usSid=L.usSid AND L.pid=P.pid AND L.date=‘1/1/2000’) UNION ( SELECT S2.name, S2.url FROM EUStores S1, EUSales L1, Products P1 USStores S2, USSales L2, Products P2, WHERE S1.usSid=L1.usSid AND L1.pid=P1.pid AND L1.date=‘1/1/2000’ AND S2.usSid=L2.usSid AND L2.pid=P1.pid AND S1.country=“USA” AND S1.euSid = S2.usSid)

26 Complexity of XML Publishing But in practice: 5-7 times more joins ! –Need query minimization Could this be avoided ? –No: it is NP-hard

27 XML Publishing Is NP-Hard customer ordercomplaint PCDATA ?? order():- Q1 complaint():- Q2 XML query: The composed SQL query is : Minimizing it is NP hard ! (can be shown…) View Tree: WHERE $x $y RETURN ( ) Q1 JOIN Q2

28 Materialized XML Publishing Efficiently Publishing Relational Data as XML Documents, Shanmugasundaram et al., VLDB’2001 Considers several alternatives, both inside and outside the engine

29 Materialized XML Publishing Create the structure (i.e. nesting): –Early –Late Add tags: –Early –Late Do this: –Inside relational engine –Outside relational engine Note: may add tags only after structuring has completed

30 Example for $S in db/EuStores return $S/name for $O in db/Owners where $S/oID = $O/oID return $O/name for $L in EuSales, $P in Products where $S/euSid = $L/euSid AND $L/pid = $P/pid return $P/name $P/priceUSD for $S in db/EuStores return $S/name for $O in db/Owners where $S/oID = $O/oID return $O/name for $L in EuSales, $P in Products where $S/euSid = $L/euSid AND $L/pid = $P/pid return $P/name $P/priceUSD

31 Early Structuring, Early Tagging The Stored Procedure Approach Advantage: very simple Disadvantage: multiple SQL queries submitted XMLObject result = “ ” SQLCursor C1 = “select S.sid, S.name from EuStore S” FOR x IN C1 DO result = result + “ ” + C1.name + “ ” SQLCursor C2 = “select O.name from Owners O where O.oid=%C1.oid” FOR y IN C2 DO result = result + “ ” + C2.name + “ ” SQLCursor C3 = “select P.name, P.priceUSD from... Where...” FOR z IN C3 DO result = result + “ ” + P.name +... result = result + “ ” XMLObject result = “ ” SQLCursor C1 = “select S.sid, S.name from EuStore S” FOR x IN C1 DO result = result + “ ” + C1.name + “ ” SQLCursor C2 = “select O.name from Owners O where O.oid=%C1.oid” FOR y IN C2 DO result = result + “ ” + C2.name + “ ” SQLCursor C3 = “select P.name, P.priceUSD from... Where...” FOR z IN C3 DO result = result + “ ” + P.name +... result = result + “ ”

32 Early Structuring, Early Tagging The correlated CLOB approach Still nested loops... Create large CLOBs – problem for the engine select XMLAGG(STORE(S.name, XMLAGG(OWNER(select O.oID from Owners O where S.oID = O.oID)), XMLAGG(PRODUCT(select P.name, P.priceUSD from EuSales L, Products P where S.euSid = L.euSid AND L.pid = P.pid))) from EuStores S select XMLAGG(STORE(S.name, XMLAGG(OWNER(select O.oID from Owners O where S.oID = O.oID)), XMLAGG(PRODUCT(select P.name, P.priceUSD from EuSales L, Products P where S.euSid = L.euSid AND L.pid = P.pid))) from EuStores S

33 Early Structuring, Early Tagging The de-correlated CLOB approach GroupBy euSid and XMLAGG (EuStores S1 LEFT OUTER JOIN Owners O ON S1.oId = O.oId) JOIN GroupBy euSid and XMLAGG(EuStores S2 LEFT OUTER JOIN ( SELECT L.euSid, P.name, P.priceUSD FROM EuSales L, Products P WHERE L.pid = P.pid) ON S2.euSid = L.euSid ON S1.euSid = S2.euSid GroupBy euSid and XMLAGG (EuStores S1 LEFT OUTER JOIN Owners O ON S1.oId = O.oId) JOIN GroupBy euSid and XMLAGG(EuStores S2 LEFT OUTER JOIN ( SELECT L.euSid, P.name, P.priceUSD FROM EuSales L, Products P WHERE L.pid = P.pid) ON S2.euSid = L.euSid ON S1.euSid = S2.euSid

34 Early Structuring, Early Tagging The de-correlated CLOB approach Modify the engine to do groupBy’s and taggings Better than nested loops (why ?) Still large CLOBs Early structuring, early tagging

35 Late Tagging Idea: create a flat table first, then nest and tag The flat table consists of outer joins and outer unions: –Unsorted  late structuring –Sorted  early structuring

36 Review of Outer Joins and Outer Unions Left outer join –e.g. R(A,B) S(B,C) = T(A,B,C) AB a1b1 a2b2 a3b3 BC b1c1 b1c2 b3c3 ABC a1b1c1 a1b1c2 a2b2- a3b3c3 =

37 Review of Outer Joins and Outer Unions Outer union –E.g. R(A,B) outer union S(A,C) = T(A, B, C) AB a1b1 a2b2 AC a3c3 a4c4 a5c5 TagABC 1a1b1- 1a2b2- 2a3-c3 2a4-c4 2a5-c5 = outer union

38 Late Tagging, Late Structuring Construct the table: Tagging: –Use main memory hash table to group elements on store ID (EuStores LEFT OUTER JOIN Owners) OUTER UNION (EuStores LEFT OUTER JOIN EuSales JOIN Products) (EuStores LEFT OUTER JOIN Owners) OUTER UNION (EuStores LEFT OUTER JOIN EuSales JOIN Products)

39 Late Tagging, Early Structuring Same table, but now sort by store ID and tag: Constant space tagger (EuStores LEFT OUTER JOIN Owners) OUTER UNION (EuStores LEFT OUTER JOIN EuSales JOIN Products) ORDER BY euSid, tag (EuStores LEFT OUTER JOIN Owners) OUTER UNION (EuStores LEFT OUTER JOIN EuSales JOIN Products) ORDER BY euSid, tag

40 Materialized XML Publishing SilkRoute, SIGMOD’2001 The outer union / outer join query is large Hard to optimize by some RDBMs Split it in smaller queries, then merge sort the tuple streams Idea: use the view tree; each partition defines a plan

41 allsales country namestore namesale namesold date tax url c n n d t u View Tree * * * * ? Q1 =...join Q2 =...left outer join Q3 =...join Q4 =...join Q1 Q2 Q3 Q4

42 In general: –A “1” edge corresponds to a join –A “*” edge corresponds to a left outer join –There are 2 n possible plans Choose best plan using heuristics