LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Cost-based XML to Relational “Shredding” Jerome Simeon Bell Labs – Lucent Technologies joint work with: Juliana Freire (Bell Labs, OGI) Jayant Haritsa (IISc, Bangalore) Maya Ramanath (IISc, Bangalore) Prasan Roy (Bell Labs, ITT Bombay)
LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 XML Data Management Application is based on XML – XML documents to represent information – XML Schema to model information – XPath / XQuery to access and manipulate information XML Documents Xquery Engine XML queries XML results Find the title, year and box office proceeds for all 2001 movies for $v in document(“imdbdata”)/imdb/show where $v/year=2001 return($v/title, $v/year, $v/box_office) RDBMS
LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 Why Storing XML in a RDBMS? For some applications it makes sense to use a relational database backend: – Leverage many years of development of relational technology – concurrency control/transaction support – Scalability – Safety (crash recovery, duplication) – Integrate with existing data stored in an RDBMS – Performance matters!! But storing and querying XML data in an RDBMS is a non-trivial task
LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 XML and Relational Databases Mismatch between the relational model and XML. How to store XML data into relational tables? – Mapping (“Shredding”) XML data into flat and regular tables How to evaluate XML queries over relational tables? – Mapping XQuery into SQL (or SQL-XML?)... Litterature filled with various mapping proposals: – Many variations over binary tables (Florescu et al, Grust et al). – Shanmugasundaram et. All try to inline as much nested elements as possible in the same table. There are many alternative mappings!
LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 Various mappings... TABLE Show (show_id INT, title STRING, year INT, box_office INT, seasons INT) TABLE Review (review_id INT, tilde STRING, review STRING, parent_Show INT) (I) Inline as many elements as possible TABLE Show (show_id INT, title STRING, year INT, box_office INT, seasons INT) TABLE NYTReview (review_id INT, review STRING, parent_Show INT) TABLE Review (review_id INT, tilde STRING, review STRING, parent_Show INT) (II)Partition reviews table-one for NYT,one for the rest TABLE Show1 (show1_id INT, title STRING, year INT, box_office INT) TABLE Show2 (show2_id INT, title STRING, year INT, seasons INT) TABLE Review (review_id INT, tilde STRING, review STRING, parent_Show INT) (III)Split Show table into TV and Movies Define group Show { element show { element title type string, element year type integer?, element review { element * type string* }, (element box type string | element seasons type string) }
LegoDB 1 Data Binding Workshop, Avaya Labs, June have various performances No given mapping is best Q1: Simple selection query Q2: Join involving a fragment of the reviews Q3: publishing query
LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 The LegoDB Storage "Shredding" Engine An optimization approach: – automatically explores a space of possible mappings – selects the mapping which has the lowest cost for a given application Important features: – Application-driven: takes into account schema, data statistics and query workload – Logical/physical independence: interface is XML-based (XML Schema, XQuery, XML data statistics) – Leverage existing technology: XML standards; XML-specific operations for generating space of mappings; relational optimizer for evaluating configurations
LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 Mapping an XML-Schema into Relations Define group Show { element show { element title type string, element year type integer, group Reviews*,... } define group Reviews { element review type string } XML Schema TABLE Show_Table( Show_id INT, title STRING, year INT ) TABLE Review_Table ( Review_id INT, review STRING, parent_Show INT ) Relational schema
LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 A word about mapping queries XQuery for $x in //show/review where contains($x/review, “Potter”) SQL select review from Show_Table, Review_Table where Parent_show = Show_id // join here! and review contains “Potter”
LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 Transforming Schemas Key idea: A given document can be validated by different XML Schemas: – Different but equivalent regular expressions can be used to define an element – The presence or absence of a type name does not change the semantics of an XML Schema Applying transformations that manipulate the types (but preserve the element structure of schema) leads to a space of distinct relational configurations Define XML Schema transformations that – Exploit the structure of the schema, and – lead to useful relational configurations
LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 Adding a group corresponds to splitting a table define group Show { element show { element title type string, element year type integer, element review type string*,... Original schema Define group Show { element show { element title type string, element year type integer, group Reviews*,... } define group Reviews { element review type string } Transformed schema TABLE Show( Show_id INT, title STRING, year INT ) TABLE Review (Review_id INT, review STRING, parent_Show INT ) Transformed Relational schema TABLE Show( Show_id INT, title STRING, year INT, review STRING) Relational schema
LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 Regular expression rewritings (group A|group B),group C ==> group D|group E horizontal table partition define group D=(group A,group C) define group E=(group B,group C) group A+ ==> put first item in a table, put the group A, group A* rest in a different table element * type string extracts certain element ==> in separate table element a type string | element (^a) type string
LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Storage Design Physical Schema Transformation Query/Schema/ Stats Translation Traditional Relational Query Optimizer XML data statistics XML Schema Relational schema, stats and workload Good configuration Physical schema XQuery workload cost estimate Physical Schema Generation Physical schema
LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 Searching for a good configuration Cost is key: use a relational optimizer as a black box – Support different cost-models – Quality of selected configuration depends on the accuracy of the optimizer! Set of possible configurations that result from applying the rewritings is very large- possibly infinite! How to search for the optimal solution? – LegoDB use a greedy search Importance of statistics for cost evaluation – Large collections vs. small collections (e.g., many new york time review or not?) – Selectivity of predicates – XML structure distribution (distribution of children / parent relation)
LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 LegoDB: Runtime Support good configuration and mapping specification XQuery workload DB Loader XML document Query Translation mapping specification XML result Commercial RDBMS tuplesSQL query/results
LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 Conclusion Cost-based approach for generating relational storage for XML – takes application characteristics into account » schema + data 'statistics' + queries Performance shows that storage significant performance improvements The same is likely to be true for other indexing techniques: – Full text index on XML (best for text queries?) – Native XML indexes (best for Xpath queries without schema?) – Files!!! (Best when a lot of small documents an no need for concurrency control) There is no one best way to store XML... We should try to hide that from the user!
LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 A few other things I've done... Mapping XQuery to SQL/XML – Work with Jonathan Robie – SQL/XML is an extension of SQL to XML » Standard mapping from table to XML document » Standard mapping from relational schema to XML Schema » Extension of SQL query language to builds XML elements – Goal is to evaluation XQueries on top of an ODBC driver supporting XQuery – Approach: identify a fragment of XQuery which has a direct syntactic mapping into SQL / XML – Surprise: the syntactic approach worked really well, because of the way XQuery is designed. (i.e., FLWOR close to SQL statement). Galax : XQuery 1.0 implementation
LegoDB 1 Data Binding Workshop, Avaya Labs, June 2003 My take on the Data Binding problem The main problem will be about the mismatch between type systems: SQL vs. XML Schema vs. Object-Oriented I've never seen a good proposal that brings any two type systems together (then builds a language one top of it) Where I see the best chance of success: – Use XML has the data model (can represent pretty much anything). – XML Schema can represent relational schema (see SQL/XML standard) – Can it represent Object-Oriented?