Publishing Relational Data in XML David McWherter
What's the issue? ● All Important Data – Relational Models ● All Important Tools – Relational Models ● Inter-Business Data Exchange – XML Graph Models ● What's in the Koolaid? – XML Graph Models
What's the issue? ● All Important Data – Relational Models ● All Important Tools – Relational Models ● Inter-Business Data Exchange – XML Graph Models ● What's in the Koolaid? – XML Graph Models Solution: Make DB2 Output XML
XML Model vs Relational Model ● Hoho ● Customer( id, name, acctset ) AccountSet( id, acctid ) Account( id, name )... Foreign Key Relns Containment Relns
XML Model vs Relational Model ● Hoho ● Customer( id, name, acctset ) AccountSet( id, acctid ) Account( id, name )... Foreign Key Relns Containment Relns Easiest Solution: Too Easy; Unnatural XML
Outline ● Procedure: – INPUT: Relational data – Add Tags (convert to XML text) – Add Nested Structure – OUTPUT: XML ● How do we add Tags and Structure? – Inside/Outside the DBMS? – Early/Late in the query? ● Evaluate simple algorithms – Find interesting tradeoffs
Algo. Parameter Space ● Early Tags, Early Structure – DBMS queries follow XML structure – DBMS munges XML fragments ● Early Tags, Late Structure – DBMS can choose joins – DBMS munges XML fragments ● Late Tags, Late Structure – DBMS can choose joins – DBMS returns tuples – Postprocessing XMLifies output tuples
Early Tags, Early Structure, Outside ● Stored procedures – C=select * from customers; – Foreach c in C { ● A=Select * from accounts where cid=c ● P=Select * from porders where cid=c order by date ● Foreach p in P { – Select * from items in porder_items where pid=p; –... ● Stitch together XML as query is run ● Used Everywhere, But – Forced nested-loop join – Thousands of queries (per-tuple)
Early Tags, Early Structure ● Stored procedures – C=select * from customers; – Foreach c in C { ● A=Select * from accounts where cid=c ● P=Select * from porders where cid=c order by date ● Foreach p in P { – Select * from items in porder_items where pid=p; –... ● Stitch together XML as query is run ● Used Everywhere, But – Forced nested-loop join – Thousands of queries (per-tuple) Early Tags: Tags output ASAP Early Structure: Queries follow result form
Early Tags, Early Structure ● Correlated CLOBS – Build a horrible nested SQL query. – Aggregator XML() to preserve XML file order – Constructors for XML formatting ● Select cust.name, CUST(cust.id,cust.name, (select XML(ACCT(acct.id,acct.acctnum) from Acct...), (select XML(PORD(...))) ● CUST(cid, cname, acctlst, porderlst ) => {cname} {acctlst} {porderlst}
Early Tags, Early Struct (cont) ● Correlated CLOBS – Good: Fewer queries – Bad: ● Force a nested loops join ● Copying and concatenation of CLOBs ● Decorrelated CLOBS – Find all table paths from XML root to leaves ● cust->acct; cust->porder->item; cust->porder->paymt – Join these tables and make XML fragments ● Reuse common join subexpressions – Join XML fragments on parent keys
Late Tags, Late Structure ● Stupidest Solution – Join all sort tables ● Cust |> <| Paymt – Return tuples to app/dbms – Filter and Tag the result ● Good: – Free join order – No CLOB munging ● BAD: – Too much redundancy ● Eg: Cust fields copied everywhere
Late Tags, Late Structure ● Path-Outer-Union – Start with Decorrelated CLOB Joins ● (cust,acct), (cust,porder,item), (cust,porder,paymt) – Return sets of tuples, not XML: ● (xml-level, col1, col2, col3... ) – (0, custid, cust.b, null, null, null,... ) – (1, custid, null, acct.a, acct.b, null,... ) – (2, custid, null, null, null, porder.a, paymt.a,...) – Tag the result ● Good: – No CLOB munging ● Bad: – So many nulls (Need null-compression)
Late Tags: How to Tag ● Tagging (Textifying) can be hard ● Two solutions: – Hashing ● In-mem Hashtable ● Remember (id/idref) pairs, entity structures ● Output at EOF – Sorting ● Sort Path-Outer-Union result in DBMS ● Entities can occur in XML-order! ● Cute trick ● Makes it “Late-Tag, Early-Structure”
Late Tags: How to Tag ● The Tagging (Textifying) can be hard ● Two solutions: – Hashing ● In-mem Hashtable ● Remember (id/idref) pairs, entity structures ● Output at EOF – Sorting ● Sort Path-Outer-Union results ● Can do it so that entities occur in XML-order ● Cute trick Hashing: Good iff table fits in-memory Sorting: Good otherwise
Performance Points ● Stored Procedures – “Devestating” (2x worse than best) ● Decorrelated Queries – Good, but need null-compression ● CLOBs – CLOB overhead only with deep trees ● Outer-Union Solutions – ~ Decorrelated CLOBS ● Computation in DBMS – Binding data to applications is slow – Pipelining can reduce the pain