Main challenges in XML/Relational mapping Juha Sallinen Hannes Tolvanen
Agenda Introduction: XML and databases Objectives of the study Findings Conclusions
Introduction: XML and databases
Basic definitions XML/relational mapping means data transformation between XML and relational data models Mapping method is the way the mapping is done
Native vs. Relational Why to store XML documents in relational database and not in native XML database? –Immaturity of current native XML database technology –Emerging technology - no ”de facto” standard –Well-working relational databases currently in use Efficient and usable May have been in use for years
Mapping dilemma XML data model supports much more flexible data structures than relational model Two fundamental differences: –XML tags –Nested structure of XML elements vs. flat structure of relational tables If an XML document is not originated from another relational data source, it is possible that the data does not fit to relational schema very well
Dichotomy of mapping methods There are two fundamentally different techniques of storing XML documents in a relational database –LOB presentation –Composed presentation
LOB presentation LOB stands for Large Object One XML document is put into a single column of a relational table At least one column for indexing is also needed Does not take full advantage of classical relational database (no XML extensions) –Not possible to use SQL to query XML elements Not a very interesting choice!
Composed presentation Data structure of an XML document is ”shredded” over one or more tables Example: Different elements to different columns Multiple ways to do this –Table-based and object-relational mapping will be introduced later
Objectives of the study
Find and explain the main issues to be considered when converting XML schema to relational schema –In other words: The main challenges that have to be taken into account by Designers of XML/relational mapping methods Users who need to map the data explicitly Find and describe briefly two general mapping methods based on composed presentation
Findings
Issues to consider in mapping Some of the most essential data characteristics –Existence of schema definition document –Stability of the schema –Degree of structure Usage model for data –Queries against the database –Requirement of preserving ”hidden” information DBMS implementation –not covered by the study, because scope was limited to the classical relational model
Data characteristics: Existence of XML schema definition Schema definition says how the structure of XML documents conforming the schema is restricted –XSD (XML Schema Definition) and DTD (Document Type Definition) are currently the dominating standards for defining XML schema. If we have the definition for the schema, conversion to relational schema will be based on it. If we don’t have the schema definition, we have to make guesses how the structure of the given XML vocabulary is restricted. –Guesses are based on the data of instances of the vocabulary (XML documents). In other words we extract the schema from available data. –This is not unproblematic as we see from next example
Data characteristics: Existence of XML schema definition 2 - Example Illustration of the problem of extracting the schema from data: eddy example mannerheimintie 10, helsinki We might deduce from the document, that we wish to restrict the schema to
Data characteristics: Existence of XML schema definition 2 – Example continued But if following document is received from the data source, we either have to extend our relational schema or dismiss the data that relational schema doesn’t support (summer cottage’s address) or combine the two fields: person2 jämeräntaival 10, espoo hiekkatie 7, oulu We can alter the database schema by adding an extra column to table mapped from addressbook element to support the the new information –This solution can’t be however applied if we don’t know the relation between person and summercottage is 1:1. We might get documents containing persons that have many addresses for summer cottages, and again, we would run to the situation that we would have to alter the database schema. We would have to create a property table for the addresses.
Evolving schema If the schema of XML vocabulary is defined, but it experiences changes, respective changes must be made to relational schema Changes are not always such easy to make to relational schema as in previous example (if composed approach is used) It should be evaluated what are the chances for schema to change.
Degree of structure of the XML schema Categorization used in the study: 1.Structured data Data is totally independent from the presentation used to describe it. Document can be navigated without examining it first 2.Semi-structured data Some blocks of the document may contain optionalities 3.Marked-up text Documents require the preservation of ”hidden” information E.g. HTML documents These terms have different meaning in the literature. Information on the following slide is based on the definitions of this slide.
Degree of structure of the XML schema Structured documents can be easily mapped to database using composed presentation. Also semi-structured documents can be decomposed if schema definition is provided. If mixed content is included, it depends on the usage of data whether LOB presentation is better for the mixed content block than further fragmentation. Marked-up text's requirement for “hidden information's” preservation is discussed later.
Storing mixed content to relations Mixed content: Document elements embedded to character data. E.g. example here you have a short example Designing a relational schema to store mixed content –If there are blocks in the content that make sense only as a whole, decomposition of those blocks makes no sense. –If we have strong arguments for decomposing a block containing mixed content, one possible decomposition method is to create one table for the root element and one property table for character data, and a property table for every element that appears in the content.
Mixed content mapping example DTD Example instance: Here we have a nice example ! Relational schema –A(a_pk) –B(a_fk,b, bOrder) –C(a_fk, c, cOrder) –PCDATA(a_fk, pcdata, pcdataOrder)
Usage models for data: Type of queries executed against the database The spectrum of queries –Queries that retrieve XML documents –Queries that retrieve fragments of XML documents –Queries that make transformations on XML data –And even more complex queries...
Query examples 1 Sample documents person1 jämeräntaival 10 espoo hiekkatie 7, oulu person2 smt 10 espoo hiekkatie 7, oulu Query emitting XML fragment: Select the names of persons who live in Espoo person1 person2
Query examples 2 Query making transformation: “select the number of persons living in Espoo” 2
Preservation of “hidden” information The XML document contains “hidden” information that is related to the presentation of the data, not the data itself. –Order of elements –Comments –Whitespaces It might be required that original XML documents can be retrieved –Trivial when LOB presentation is used –If composition presentation is used, all “hidden” information need to be stored to relations
Table-based mapping Listing 1. Required structure of XML document in table-based mapping (Bourret, 2001).
Object-Relational mapping Mapping method for mapping any XML document that has a schema definition. The idea is to convert the schema of document to an object schema, and then convert the object schema to relational schema Step of object/relational conversion is predefined, but XML/object conversion leaves some freedom to define the object view that is mapped from XML data.
Conclusions
The selection between the choice of possible relational representations for XML data include many issues that must be considered. Some of the issues limit the choice to LOB presentation (no schema, rapidly evolving schema, queries include only retrieval of original documents) LOB presentation can be also used for storing blocks of the document where are no references from elsewhere. Usual reason why decomposition method is generally preferred if possible, is the performance gain. Also the data comes more accessible to applications that use the database, but don’t publish any views of data in XML.