Schema Normalization & XML

Schema Normalization & XML
Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems October 12, 2004 Some slide content courtesy of Susan Davidson & Raghu Ramakrishnan

Announcements Project description handed out
Decide on 3-person project groups by 1 week from today (10/19) Reading assignment on XML handed out Homework 2 answers posted on Web Midterm: Thursday 10/28 Right after class: Phil Bernstein, MS Research, talk on Model Management, Levine 101

Why Armstrong’s Axioms?
Why are Armstrong’s axioms (or an equivalent rule set) appropriate for FD’s? They are: Consistent: any relation satisfying FD’s in F will satisfy those in F + Complete: if an FD X  Y cannot be derived by Armstrong’s axioms from F, then there exists some relational instance satisfying F but not X  Y In other words, Armstrong’s axioms derive all the FD’s that should hold What is the goal of using these axioms?

Stuff(sid, name, serno, subj, cid, exp-grade)
Decomposition Consider our original “bad” attribute set We could decompose it into: But this decomposition loses information about the relationship between students and courses. Why? Stuff(sid, name, serno, subj, cid, exp-grade) Student(sid, name) Course(serno, cid) Subject(cid, subj)

Lossless Join Decomposition
R1, … Rk is a lossless join decomposition of R w.r.t. an FD set F if for every instance r of R that satisfies F, ÕR1(r) ⋈ ... ⋈ ÕRk(r) = r Consider: What if we decompose on (sid, name) and (serno, subj, cid, exp-grade)? sid name serno subj cid exp-grade 1 Sam 570103 AI 570 B 23 Nitin 550103 DB 550 A

Testing for Lossless Join
R1, R2 is a lossless join decomposition of R with respect to F iff at least one of the following dependencies is in F+ (R1  R2)  R1 – R2 (R1  R2)  R2 – R1 So for the FD set: sid  name serno  cid, exp-grade cid  subj Is (sid, name) and (serno, subj, cid, exp-grade) a lossless decomposition?

Dependency Preservation
Ensures we can check whether a FD X  Y is violated during DB updates, without using a join: FZ, the projection of FD set F onto attribute set Z, is: {X  Y | X  Y  F +, X  Y Í Z} i.e., it is those FDs only applicable to Z’s attributes A decomposition R1, …, Rk is dependency preserving if F + = (FR1 ... FRk)+ (note we need an extra closure!) We don’t lose the ability to test the “cover” of our FDs in a single table, just because we decompose Pay special attention to closure over the union

Example 1 For Schema R(sid, name, serno, cid, subj, exp-grade) and FD set: sid  name serno  cid cid  subj sid, serno  exp-grade Is R1(sid, name) and R2(serno, subj, cid, exp-grade): A lossless decomposition? Is it dependency-preserving? How about R1(sid, name) and R2(sid, serno, subj, cid, exp-grade)? First example: Not lossless: no join key Not dep preserving: sid, serno  … Second example: Lossless (sid  name) Dep-preserving

Example 2 Given schema R(name, street, city, st, zip, item, price),
FD set name  street, city street, city  st street, city  zip name, item  price and decomposition R1(name, street, city, st, zip) and R2(name, item, price) Is it lossless? Is it dependency preserving? What if we replaced the first FD with name, street  city? Lossless: name  street, city, st, zip Dep-preserving: name  street, city; street, city  st, zip

A More Disturbing Example…
Given schema R(sid, fid, subj) and FD set: fid  subj sid, subj  fid Consider the decomposition R1(sid, fid) and R2(fid, subj) Is it lossless? Is it dependency preserving? If it isn’t, can you think of a decomposition that is? Can you do this non-redundantly? Lossless: fid  subj Not dep-preserving: sid, subj are split Decomposition would be: R1(sid, fid), R2(fid, sid, subj) which is redundant

Redundancy vs. FDs Ideally, we want a design s.t. for each nontrivial dependency X  Y, X is a superkey for some relation schema in R We just saw that this isn’t always possible in a non-redundant way… Thus we have two kinds of normal forms, Boyce-Codd and Third Normal Form

Two Important Normal Forms
Boyce-Codd Normal Form (BCNF). For every relation scheme R and for every X  A that holds over R, either A  X (it is trivial) ,or or X is a superkey for R Third Normal Form (3NF). For every relation scheme R and for every X  A that holds over R, either A  X (it is trivial), or X is a superkey for R, or A is a member of some key for R BCNF: the only functional dependencies are from the key. Partial dependencies aren’t allowed: if AB  CD and B  D then we fail on B  R. Ditto for our previous example, which was of the form A  C, BC  A. 3NF: the same but we allow for partial dependencies. This allows for the above example to be expressed as (A,B,C) because A  R and B & C both  R. We still disallow partial keys, AB  CD, B  D because B isn’t a key for R. We also disallow transitive keys, AB  C, C  D because C isn’t a key for R.

Normal Forms Compared BCNF is preferable, but sometimes in conflict with the goal of dependency preservation It’s strictly stronger than 3NF Let’s see algorithms to obtain: A BCNF lossless join decomposition (nondeterministic) A 3NF lossless join, dependency preserving decomposition

BCNF Decomposition Algorithm (from Korth et al
BCNF Decomposition Algorithm (from Korth et al.; our book gives a recursive version) result := {R} compute F+ while there is a relation schema Ri in result that isn’t in BCNF { let A  B be a nontrivial FD on Ri s.t. A  Ri is not in F+ and A and B are disjoint result:= (result – Ri)  {(Ri - B), (A,B)} } R = (sid, name, serno, cid, subj, exp-grade) Compute F+: sid  name sid, serno  exp-grade serno  cid, subj cid  subj sid, serno  R (many trival deps, augmented deps) Step 1: result := Ra(sid, serno, cid, subj, exp-grade) U R1(sid, name) Step 2: result := Rb(sid, serno, cid, subj) U R2(sid, serno, exp-grade) U R1(sid, name) Step 3: result := Rc(sid, serno) U R3a(serno, cid, subj) U R2(sid, serno, exp-grade) U R1(sid, name) Step 4: result := Rc(sid, serno) U R4(cid, subj) U R3b(serno, cid) U Can get different decomp with different ordering; can also get redundancy across relations (e.g., Rc)

3NF Decomposition Algorithm
Let F be a minimal cover i:=0 for each FD A  B in F { if none of the schemas Rj, 1 j  i, contains AB { increment i Ri := (A, B) } if no schema Rj, 1  j  i contains a candidate key for R { Ri := any candidate key for R return (R1, …, Ri) Build dep.- preserving decomp. R = (sid, name, serno, cid, subj, exp-grade) Minimal cover: sid  name serno  cid cid  subj sid, serno  exp-grade R1 = (sid, name) R2 = (serno, cid) R3 = (cid, subj) R4 = (sid, serno, exp-grade) Ensure lossless decomp.

Summary of Normalization
We can always decompose into 3NF and get: Lossless join Dependency preservation But with BCNF we are only guaranteed lossless joins BCNF is stronger than 3NF: every BCNF schema is also in 3NF The BCNF algorithm is nondeterministic, so there is not a unique decomposition for a given schema R

XML: A Semi-Structured Data Model

Why XML? XML is the confluence of several factors:
The Web needed a more declarative format for data Documents needed a mechanism for extended tags Database people needed a more flexible interchange format “Lingua franca” of data It’s parsable even if we don’t know what it means! Original expectation: The whole web would go to XML instead of HTML Today’s reality: Not so… But XML is used all over “under the covers”

Why DB People Like XML Can get data from all sorts of sources
Allows us to touch data we don’t own! This was actually a huge change in the DB community Interesting relationships with DB techniques Useful to do relational-style operations Leverages ideas from object-oriented, semistructured data Blends schema and data into one format Unlike relational model, where we need schema first … But too little schema can be a drawback, too!

XML Anatomy Processing Instr. Open-tag Element Attribute Close-tag
<?xml version="1.0" encoding="ISO " ?> <dblp> <mastersthesis mdate=" " key="ms/Brown92"> <author>Kurt P. Brown</author> <title>PRPL: A Database Workload Specification Language</title> <year>1992</year> <school>Univ. of Wisconsin-Madison</school> </mastersthesis> <article mdate=" " key="tr/dec/SRC "> <editor>Paul R. McJones</editor> <title>The 1995 SQL Reunion</title> <journal>Digital System Research Center Report</journal> <volume>SRC </volume> <year>1997</year> <ee>db/labs/dec/SRC html</ee> <ee> </article> Open-tag Element Attribute Close-tag

Well-Formed XML A legal XML document – fully parsable by an XML parser
All open-tags have matching close-tags (unlike so many HTML documents!), or a special: <tag/> shortcut for empty tags (equivalent to <tag></tag> Attributes (which are unordered, in contrast to elements) only appear once in an element There’s a single root element XML is case-sensitive

XML as a Data Model XML “information set” includes 7 types of nodes:
Document (root) Element Attribute Processing instruction Text (content) Namespace: Comment XML data model includes this, plus typing info, plus order info and a few other things

XML Data Model Visualized (and simplified!)
root attribute p-i element Root text ?xml dblp mastersthesis article mdate mdate key key 2002… author title year school editor title journal volume year ee ee 2002… 1992 1997 ms/Brown92 The… tr/dec/… PRPL… Digital… db/labs/dec Kurt P…. Univ…. Paul R. SRC…

What Does XML Do? Serves as a document format (super-HTML)
Allows custom tags (e.g., used by MS Word, openoffice) Supplement it with stylesheets (XSL) to define formatting Data exchange format (must agree on terminology) Marshalling and unmarshalling data in SOAP and Web Services

XML as a Super-HTML (MS Word)
<h1 class="Section1"> <a name="_top“ />CIS 550: Database and Information Systems</h1> <h2 class="Section1">Fall 2003</h2> <p class="MsoNormal"> <place>311 Towne</place>, Tuesday/Thursday <time Hour="13" Minute="30">1:30PM – 3:00PM</time> </p>

XML Easily Encodes Relations
Student-course-grade sid serno exp-grade 1 570103 B 23 550103 A <student-course-grade> <tuple><sid>1</sid><serno>570103</serno><exp-grade>B</exp-grade></tuple> <tuple><sid>23</sid><serno>550103</serno><exp-grade>A</exp-grade></tuple> </student-course-grade>

But XML is More Flexible… “Non-First-Normal-Form” (NF2)
<parents> <parent name=“Jean” > <son>John</son> <daughter>Joan</daughter> <daughter>Jill</daughter> </parent> <parent name=“Feng”> <daughter>Felicity</daughter> …

XML and Code Web Services (.NET, recent Java web service toolkits) are using XML to pass parameters and make function calls Why? Easy to be forwards-compatible Easy to read over and validate (?) Generally firewall-compatible Drawbacks? XML is a verbose and inefficient encoding! XML is used to represent: SOAP: the “envelope” that data is marshalled into XML Schema: gives some typing info about structures being passed WSDL: the IDL (interface def language) UDDI: provides an interface for querying about web services

Integrating XML: What If We Have Multiple Sources with the Same Tags?
Namespaces allow us to specify a context for different tags Two parts: Binding of namespace to URI Qualified names <tag xmlns:myns=“ <thistag>is in namespace myns</thistag> <myns:thistag>is the same</myns:thistag> <otherns:thistag>is a different tag</otherns:thistag> </tag>

XML Isn’t Enough on Its Own
It’s too unconstrained for many cases! How will we know when we’re getting garbage? How will we query? How will we understand what we got? We also need: Some idea of the structure Our focus next Presentation, in some cases – XSL(T) We’ll talk about this soon Some way of interpreting the tags…? We’ll talk about this later in the semester

Structural Constraints: Document Type Definitions (DTDs)
The DTD is an EBNF grammar defining XML structure XML document specifies an associated DTD, plus the root element DTD specifies children of the root (and so on) DTD defines special significance for attributes: IDs – special attributes that are analogous to keys for elements IDREFs – references to IDs IDREFS – a nasty hack that represents a list of IDREFs

An Example DTD Example DTD: … Example use of DTD in XML file:
<!ELEMENT dblp((mastersthesis | article)*)> <!ELEMENT mastersthesis(author,title,year,school,committeemember*)> <!ATTLIST mastersthesis(mdate CDATA #REQUIRED key ID #REQUIRED advisor CDATA #IMPLIED> <!ELEMENT author(#PCDATA)> … Example use of DTD in XML file: <?xml version="1.0" encoding="ISO " ?> <!DOCTYPE dblp SYSTEM “my.dtd"> <dblp>…

Representing Graphs in XML
<?xml version="1.0" encoding="ISO " ?> <!DOCTYPE graph SYSTEM “special.dtd"> <graph> <author id=“author1”> <name>John Smith</name> </author> <article> <author ref=“author1” /> <title>Paper1</title> </article> <author ref=“author1” /> <title>Paper2</title> …

Graph Data Model Root graph ?xml !DOCTYPE article article author id
title author title author name author1 Paper1 ref Paper2 ref John Smith author1 author1

Graph Data Model Root graph ?xml !DOCTYPE article article author id
title author title author name author1 Paper1 ref Paper2 ref John Smith

DTDs Aren’t Enough for Some People
DTDs capture grammatical structure, but have some drawbacks: Not themselves in XML – inconvenient to build tools for them Don’t capture database datatypes’ domains IDs aren’t a good implementation of keys Why not? No way of defining OO-like inheritance No way of expressing FDs

XML Schema Aims to address the shortcomings of DTDs Features:
XML syntax Can define keys using XPaths Type subclassing that’s more complex than in a programming language Programming languages don’t consider order of member variables! Subclassing “by extension” and “by restriction” … And, of course, domains and built-in datatypes

Simple Schema Example <xsd:schema xmlns:xsd=" <xsd:element name=“mastersthesis" type=“ThesisType"/> <xsd:complexType name=“ThesisType"> <xsd:attribute name=“mdate" type="xsd:date"/> <xsd:attribute name=“key" type="xsd:string"/> <xsd:attribute name=“advisor" type="xsd:string"/> <xsd:sequence> <xsd:element name=“author" type=“xsd:string"/> <xsd:element name=“title" type=“xsd:string"/> <xsd:element name=“year" type=“xsd:integer"/> <xsd:element name=“school" type=“xsd:string”/> <xsd:element name=“committeemember" type=“CommitteeType” minOccurs=“0"/> </xsd:sequence> </xsd:complexType>

Designing an XML Schema/DTD
Not as formalized as relational data design We can still use ER diagrams to break into entity, relationship sets ER diagrams have extensions for “aggregation” – treating smaller diagrams as entities – and for composite attributes Note that often we already have our data in relations and need to design the XML schema to export them! Generally orient the XML tree around the “central” objects Big decision: element vs. attribute Element if it has its own properties, or if you *might* have more than one of them Attribute if it is a single property – or perhaps not!

XML as a Data Model XML is a non-first-normal-form (NF2) representation Can represent documents, data Standard data exchange format Several competing schema formats – esp., DTD and XML Schema – provide typing information Next time: querying XML

Schema Normalization & XML

Similar presentations

Presentation on theme: "Schema Normalization & XML"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Schema Normalization & XML

Similar presentations

Presentation on theme: "Schema Normalization & XML"— Presentation transcript:

Similar presentations

About project

Feedback