You Got XML In My Database? What’s Up With That? Stuart R Ainsworth stuart@codegumbo.com
Purpose Discuss the marriage of XML to relational databases (SQL Server 2005+) Approach from a design perspective Brief history of two approaches
Why Me? Data Architect working in the field of financial information security Manage data flows from 15+ different vendor systems 150 million rows of data per day Prior experience as DBA and a reporting/database developer A chapter leader for AtlantaMDF
Why Me? NOT an XML “expert” Agreed to do this presentation as a learning experience Continues to be a work-in-progress
Assumptions About You Database professional (DBA, database developer, BI specialist) Well-versed in SQL, especially T-SQL Basic understanding of what XML is
Goals Explore XML and relational data XML features in SQL Server 2005+ History Differences Compatibility XML features in SQL Server 2005+ Classic relational design challenges
History Lessons “Set the WABAC machine, Sherman!” -Mr. Peabody
History-Rel Paradigm Relational design based on work of E.F.Codd A Relational Model of Data For Large Shared Data Banks – 1970 ACM Codd’s 12 Rules for Relational DB’s (1985) Implementation Ingres (1974) Relational Software (Oracle; 1979)
History-Rel Paradigm Context: Hierarchical databases prevalent Tree structure Redundant data in attributes Expense of computer hardware Limited storage capability Limited expansion possibilities In a hierarchical data model, data is organized into a tree-like structure in such a way that it cannot have too many relationships. The structure allows repeating information using parent/child relationships. All attributes of a specific record are listed under an entity type. In a database, an entity type is the equivalent of a table; each individual record is represented as a row and an attribute as columns. Entity types are related to each other using 1: N mapping also known as one to many relationships. An example would be: an organization has records of employees in a table (entity type) called Employees. In the table there would be attributes/columns such as First Name, Last Name, Job Name and Wage. The company also has data about the employee’s children in a separate table called Children with attributes such as First Name, Last Name, and DOB. The Employee table represents a parent segment and the Children table represents a Child segment. These 2 segments form a hierarchy where an employee may have many children but each child may only have 1 parent.
History-Rel Paradigm “By the time UNIX began to become popular (1974), a well configured PDP-11 had 768 Kb of core memory, two 200 Mb moving head disks (hard disks), a reel to reel tape drive for backup purposes, a dot-matrix line printer and a bunch of [dumb] terminals. This was a high end machine, and even a minimally configured PDP-11 cost about $40,000. Despite the cost, 600 such installations had been put into service by the end of 1974, mostly at universities.” The History Of Computers During My Lifetime - The 1970's by Jason Patterson http://www.pattosoft.com.au/jason/Articles/HistoryOfComputers/1970s.html
History-Rel Paradigm “In 1973, IBM developed what is considered to be the first true sealed hard disk drive... It used two 30 Mb platters. Over the following decade, sealed hard disks (often called Winchester disks) took their place as the primary data storage medium, initially in mainframes, then in minicomputers, and finally in personal computers starting with the IBM PC/XT in 1983.”
Database Schemas Normalization 1NF 2NF 3NF Atomicity of data Definition of the primary key 3NF Dependency on the primary key
That’s great, but…. Optimize storage of information Reduce redundant information Increase performance for query engine Optimize data validity Maintain relationships on dependent keys Ensure consistent change control (Later) Optimize information security Well-designed security models Who has access to what, when, & how
That’s great, but…. Optimize storage of information Reduce redundant information Increase performance for query engine Optimize data validity Maintain relationships on dependent keys Ensure consistent change control (Later) Optimize information security Well-designed security models Who has access to what, when, & how
History-XML 1970’s – Goldfarb, Mosher, & Lorie defined GML (later SGML – Standard Generalised Markup Language) Isolate content from presentation HTML – most well known SGML SGML is very complex. HTML standards became polluted Netscape vs IE FireFox vs IE
History-XML 1990’s Bosak, Bray, Clark defined eXtensible Markup Language – XML Well-formed 1 root element Matching end tags No overlapping elements Valid DTD (Document Type Definition) XML Schema
History-XML 2002, Microsoft released .NET Response to Java interoperability .NET relies on XML to pass data
That’s great, but…. Isolate content from presentation Defines standards for interpretation Minimal definitions for implementation Suggest data validity XML documents must be well-formed XML documents should be valid
XML commands “Never send a human to do a machine’s job!” -Agent Smith
XML in SQL Server 2000+ Generation: FOR XML RAW, AUTO, EXPLICIT
FOR XML RAW USE AdventureWorks GO SELECT Cust.CustomerID, OrderHeader.CustomerID as ohCustID, OrderHeader.SalesOrderID, OrderHeader.Status, Cust.CustomerType FROM Sales.Customer Cust INNER JOIN Sales.SalesOrderHeader OrderHeader ON Cust.CustomerID = OrderHeader.CustomerID FOR XML RAW
FOR XML RAW <row CustomerID="676" ohCustID="676" SalesOrderID="43659" Status="5" CustomerType="S" /> <row CustomerID="117" ohCustID="117" SalesOrderID="43660" Status="5" CustomerType="S" /> <row CustomerID="442" ohCustID="442" SalesOrderID="43661" Status="5" CustomerType="S" /> <row CustomerID="227" ohCustID="227" SalesOrderID="43662" Status="5" CustomerType="S" /> <row CustomerID="510" ohCustID="510" SalesOrderID="43663" Status="5" CustomerType="S" /> <row CustomerID="397" ohCustID="397" SalesOrderID="43664" Status="5" CustomerType="S" /> <row CustomerID="146" ohCustID="146" SalesOrderID="43665" Status="5" CustomerType="S" /> <row CustomerID="511" ohCustID="511" SalesOrderID="43666" Status="5" CustomerType="S" /> <row CustomerID="646" ohCustID="646" SalesOrderID="43667" Status="5" CustomerType="S" /> <row CustomerID="514" ohCustID="514" SalesOrderID="43668" Status="5" CustomerType="S" /> <row CustomerID="578" ohCustID="578" SalesOrderID="43669" Status="5" CustomerType="S" /> <row CustomerID="504" ohCustID="504" SalesOrderID="43670" Status="5" CustomerType="S" /> <row CustomerID="200" ohCustID="200" SalesOrderID="43671" Status="5" CustomerType="S" /> <row CustomerID="119" ohCustID="119" SalesOrderID="43672" Status="5" CustomerType="S" /> <row CustomerID="618" ohCustID="618" SalesOrderID="43673" Status="5" CustomerType="S" /> <row CustomerID="83" ohCustID="83" SalesOrderID="43674" Status="5" CustomerType="S" />
FOR XML AUTO USE AdventureWorks GO SELECT Cust.CustomerID, OrderHeader.CustomerID, OrderHeader.SalesOrderID, OrderHeader.Status, Cust.CustomerType FROM Sales.Customer Cust INNER JOIN Sales.SalesOrderHeader OrderHeader ON Cust.CustomerID = OrderHeader.CustomerID FOR XML AUTO
FOR XML AUTO <Cust CustomerID="676" CustomerType="S"> <OrderHeader CustomerID="676" SalesOrderID="43659" Status="5" /> </Cust> <Cust CustomerID="117" CustomerType="S"> <OrderHeader CustomerID="117" SalesOrderID="43660" Status="5" /> <Cust CustomerID="442" CustomerType="S"> <OrderHeader CustomerID="442" SalesOrderID="43661" Status="5" /> <Cust CustomerID="227" CustomerType="S"> <OrderHeader CustomerID="227" SalesOrderID="43662" Status="5" />
FOR XML EXPLICIT Beyond the scope of this presentation
XML in SQL Server 2000+ Generation: Translation: FOR XML OPENXML RAW, AUTO, EXPLICIT Translation: OPENXML sp_xml_preparedocument sp_xml_removedocument
XML in SQL Server 2005+ Generation: FOR XML PATH TYPE
FOR XML PATH USE AdventureWorks GO SELECT Cust.CustomerID AS "Customer/@CustomerID", OrderHeader.CustomerID AS "Order/@CustomerID", OrderHeader.SalesOrderID AS "Order/@OrderID", OrderHeader.Status AS "Order/Status", Cust.CustomerType AS "Customer/@Type" FROM Sales.Customer Cust INNER JOIN Sales.SalesOrderHeader OrderHeader ON Cust.CustomerID = OrderHeader.CustomerID FOR XML PATH
FOR XML PATH <row> <Customer CustomerID="676" /> <Order CustomerID="676" OrderID="43659"> <Status>5</Status> </Order> <Customer Type="S" /> </row> <Customer CustomerID="117" /> <Order CustomerID="117" OrderID="43660">
XML in SQL Server 2005+ Generation: Translation: FOR XML xml datatype PATH TYPE Translation: xml datatype XQuery
XML Translation xml datatype Well-formed fragments (no root required) 2 GB maximum Cannot be compared or sorted Supports conversion to (n)varchar(max) Required for XQuery
XML Translation XQuery Complete query language outside of SQL Server SQL Server 2005+ implements limited subset xml methods query() value() exist() nodes() modify() (beyond scope of presentation)
.query() DECLARE @myDoc XML SET @myDoc = '<Root> <ProductDescription ProductID="1" ProductName="Road Bike"> <Features> <Warranty>1 year parts and labor</Warranty> <Maintenance>3 year parts and labor extended maintenance is available</Maintenance> </Features> </ProductDescription> </Root>' SELECT @myDoc.query('/Root/ProductDescription/Features')
.query() <Features> <Warranty>1 year parts and labor</Warranty> <Maintenance>3 year parts and labor extended maintenance is available</Maintenance> </Features>
.exist() & .value() DECLARE @myDoc XML SET @myDoc = '<Root> <ProductDescription ProductID="1" ProductName="Road Bike"> <Features> <Warranty>1 year parts and labor</Warranty> <Maintenance>3 year parts and labor extended maintenance is available</Maintenance> </Features> </ProductDescription> </Root>' IF @Mydoc.exist('/Root/ProductDescription/Features/Warranty') = 1 BEGIN SELECT @Mydoc.value('(/Root/ProductDescription/Features/Warranty)[1]', 'varchar(100)') AS Warranty END
.nodes() DECLARE @x xml SET @x='<Root> <row id="1"><name>Larry</name><oflw>some text</oflw></row> <row id="2"><name>moe</name></row> <row id="3" /> </Root>' SELECT T.c.query('..') AS result FROM @x.nodes('/Root/row') T(c)
.nodes() <row id="1"><name>Larry</name><oflw>some text</oflw></row> <row id="2"><name>moe</name></row> <row id="3" />
.nodes() DECLARE @x xml SET @x='<Root> <row id="1"><name>Larry</name><oflw>some text</oflw></row> <row id="2"><name>moe</name></row> <row id="3" /> </Root>' SELECT T.c.query('.').value('(//name)[1]', 'varchar(10)') AS result FROM @x.nodes('/Root/row') T(c)
.nodes() Larry moe NULL
XML in SQL Server 2005+ Generation: Translation: T-SQL: FOR XML PATH TYPE Translation: xml datatype XQuery T-SQL: APPLY operator
T-SQL: APPLY The APPLY operator allows you to invoke a table-valued function for each row returned by an outer table expression of a query. The table-valued function acts as the right input and the outer table expression acts as the left input. The right input is evaluated for each row from the left input and the rows produced are combined for the final output. The list of columns produced by the APPLY operator is the set of columns in the left input followed by the list of columns returned by the right input.
T-SQL: APPLY Important parts: Requires a table-valued function “joins” TVF with a table CROSS :: INNER JOIN OUTER :: LEFT OUTER JOIN
Pulling it together DECLARE @T TABLE (LastName varchar(10), Stooges xml) INSERT INTO @T (LastName, Stooges) VALUES ('Howard', '<Stooge>Moe</Stooge> <Stooge>Curly</Stooge> <Stooge>Shemp</Stooge>'), ('Fine', '<Stooge>Larry</Stooge>') SELECT t.LastName, x.c.query('.').value('(/Stooge)[1]','varchar(10)') as FirstName FROM @T t CROSS APPLY Stooges.nodes('/Stooge') x(c)
Pulling it together LastName FirstName Howard Moe Howard Curly Howard Shemp Fine Larry
XML – SQL Scenarios “If all you have is a hammer, everything looks like a nail” -Bernard Baruch
Whole XML documents Storage of XML documents Application transfers complete document Need not be stored as xml datatype Decide if queried or modified as xml Depending on doc size, may not perform well Large documents require more I/O (disk & network) Multiple rows require more I/O (disk & network) Depending on datatype, may not validate
Classic Design Problems Entity-Attribute-Value (EAV) designs Doesn’t solve problem; adds an option Adding attributes after release Complete datasets as parameters Additional disconnect between layers “object-like” handling set-oriented methodology Requires strict data handling
EAV design Database Storage determined after implementation Requires multiple subqueries to fetch data May cause Performance Problems Brittle Validity
EAV Design IF EXISTS (SELECT * FROM INFORMATION_SCHEMA.tables WHERE TABLE_NAME = 'emp_values') DROP TABLE emp_values ; IF EXISTS (SELECT * FROM INFORMATION_SCHEMA.tables WHERE TABLE_NAME = 'emp') DROP TABLE emp ; create table emp (empno integer primary key ); create table emp_values (empno INT references emp, code varchar(20), value varchar(100)); insert into emp (empno) values (1234); insert into emp_values VALUES (1234, 'NAME','ANDREWS'); insert into emp_values VALUES (1234, 'SAL','1000'); insert into emp_values VALUES (1234, 'JOB','CLERK');
EAV Design SELECT * FROM dbo.emp_values ev empno code value 1234 NAME ANDREWS 1234 SAL 1000 1234 JOB CLERK
EAV Design --structure we want version 1 SELECT e.empno, NAME = ev1.value, SAL = ev2.value, job = ev3.value FROM emp e JOIN emp_values ev1 ON e.empno = ev1.empno AND ev1.code='NAME' JOIN emp_values ev2 ON e.empno = ev2.empno AND ev2.code='SAL' JOIN emp_values ev3 ON e.empno = ev3.empno AND ev3.code='JOB' --structure we want version 2 SELECT e.empno, NAME = (SELECT ev.value FROM emp_values ev WHERE ev.empno = e.empno AND ev.code = 'NAME'), SAL = (SELECT ev.value AND ev.code = 'SAL'), JOB = (SELECT ev.value AND ev.code = 'JOB') FROM emp e
EAV design XML doesn’t solve design issues, but does mitigate the cost May cause performance issues XML indexes Brittle validity Use XML Schema to validate Still requires dynamic SQL to transform data to UI
EAV Design DECLARE @emp TABLE (empno int, eav xml) INSERT INTO @emp (empno, eav) VALUES (1234, '<root> <NAME>Andrews</NAME> <JOB>Judge</JOB> <SAL>1000</SAL> </root>') SELECT e.empno, x.value('local-name(.)','VARCHAR(20)') AS ElementName, x.value('.','VARCHAR(20)') AS ElementValue FROM @emp e CROSS APPLY eav.nodes('/*/*') y(x) empno ElementName ElementValue 1234 NAME Andrews 1234 JOB Judge 1234 SAL 1000
Dataset Parameters Complete & complex datasets “object-like” processing Single order with multiple line items Complex transfers between tables Avoid row-by-row inserts Minimizes network latency Application sends over XML document Stored proc shreds & inserts
Dataset Parameters <e EmployeeID="1" ContactID="1209"> <c FirstName="Guy" LastName="Gilbert" /> </e> <e EmployeeID="2" ContactID="1030"> <c FirstName="Kevin" LastName="Brown" /> <e EmployeeID="3" ContactID="1002"> <c FirstName="Roberto" LastName="Tamburello" />
Dataset Parameters CREATE PROC EmployeePerson (@x XML) AS SELECT T.c.value('(@EmployeeID) [1]', 'int') as EmployeeID, T.c.value('(@ContactID) [1]', 'int') as ContactID FROM @x.nodes('/e') T(c) SELECT T.c.value('(@ContactID) [1]', 'int') as ContactID, T.c.value('(c/@FirstName) [1]', 'varchar(25)') as FirstName, T.c.value('(c/@LastName) [1]', 'varchar(25)') as LastName
Dataset Parameters EmployeeID ContactID 1 1209 2 1030 3 1002 ContactID FirstName LastName 1209 Guy Gilbert 1030 Kevin Brown 1002 Roberto Tamburello
Dataset Parameters Maintains axioms of relational paradigm Security of stored procs Validity of expected inputs Performance If used to minimize row-by-row processing, then yes Otherwise, no real impact.
Resources; Questions? “It’s a miracle that curiosity survives formal education.” – Albert Einstein
Contact Information Stuart R Ainsworth stuart@codegumbo.com http://www.codegumbo.com http://www.twitter.com/stuarta
Resources SQL Server 2008 Books Online Pro T-SQL 2008 Programmer’s Guide Coles, 2008, Apress Professional SQL Server 2005 XML Klein, 2006, Wrox