Managing XML and Semistructured Data Lecture 10: Schema Prof. Dan Suciu Spring 2001
In this lecture Schema for unordered data Resources conformance and classification using simulation upper bound and lower bound schema Resources MSL: A Model for W3C XML Schema by Brown, Fuchs, Robie, Wadler, in WWW10, 2001. Subsumption for XML Types by Kuper and Simeon, ICDT'2001. Data on the Web Abiteboul, Buneman, Suciu : chapter 7
Schemas for Relational v.s. SS Data Schemas in relational data: Defined before the data Strictly enforced Schemas in SS data Defined after the data May not be enforced (some data doesn’t have schema)
When the Schema is Created Created by the user Before or after the data Does not need to be unique Extracted from the data Unheard of in relational databases Extracted from the query Like type inference in programming languages Different schema formalisms for each task !
Schemas for Unordered Data OEM Cycles Schema itself will be a graph
Schemas: An Example Some database: &r1 &c1 &c2 &s2 &s3 &s6 &s7 &s10 company name address url “Widget” “Trenton” “Gadget” “www.gp.fr” “Paris” &p2 &p1 &p3 &s0 &s1 &s4 &s5 &s8 &s9 person “Smith” name position phone “Manager” “Jones” “5552121” “Dupont” “Sales” employee manages c.e.o. works-for &a1 &a2 &a3 &a4 &a5 &a6 &a7 description procurement salesrep contact task eval 1997 1998 “on target” “below target”
Graph Schemas Root person company works-for managed-by Employee c.e.o. | employee name | address | url name | phone | position string description Any * Upper Bound Schema
The Two Questions to Ask Conformance: does that data conform to this schema ? Classification: if so, then which objects belong to what classes ?
Graph Simulation Definition Two edge-labeled graphs G1, G2 A simulation is a relation R between nodes: if (x1, x2) in R, and (x1,a,y1) in G1, then exists (x2,a,y2) in G2 (same label) s.t. (y1,y2) in R x1 x2 a R G1 G2 y1 a R y2
Using Simulation Data graph D, schema S conformance: find maximal simulation R from D to S Notation: D S classification: check if (x,c) in R Notation: x c
Example Database Graph Schema &r1 person person company manages person Root manages person person company works-for &p1 employee company c.e.o. &c1 &p2 &c2 c.e.o. &p3 works-for works-for works-for phone position managed-by name position address name name address name name Company Employee &s0 &s1 &s2 &s3 &s4 &s5 &s6 url &s7 &s8 &s9 description c.e.o. | employee “Smith” “Manager” “Widget” “Trenton” “Jones” “5552121” “Gadget” “Paris” “Dupont” “Sales” name | address | url description &s10 name | phone | position string description &a5 &a1 “www.gp.fr” eval 1998 &a4 procurement salesrep 1997 Any task &a7 * &a2 &a3 &a6 contact “below target” “on target” Database Graph Schema
Formally Graph schema S is a graph, s.t.: Nodes are called classes Edges are labeled with unary predicates, p(x) Examples: person = “x=person” name | address = “x=name x=address” * = “true” int = “x int” name = “x name”
Examples of Graph Schemas (What Do They Mean ?) person person S1= description name name age age string int string int name age string int part S3= S1 = there are persons; each can have several names and several ages, of types string and int respectively S2 = same as S1, plus persons can have description edges, under which we can have any edge with any label, except name and age S3 = describes a hierarchy of parts and subparts, arbitrary deep. Leaf subparts (or parts) may have names and prices. S4 = describes ANY database S5 = persons may have name and age of types string, int, and under description there may be any structure. S4= subpart name price person description * S5= name string int age string int “Universal schema”, ST *
D = S = D S person person person person person name age name name string int Smith 55 . . . . . . . D S
Any database conforms to ST ! “Universal schema” * Any database conforms to ST ! “Universal schema”
Schemas in SS Data v.s. Relational Data Each data instance has exactly one schema Semistructured data One data instance has several schemas
The Classification Problem D = person person person person name phone name name phone name email John 1234 Mary string string string string Schema is nondeterministic: creates ambiguous classifications.
The Classification Problem Definition A schema S is deterministic if for every class c and every label a, there is at most one outgoing edge labeled a from c Fact: if S is deterministic and D is a tree, then each node is uniquely classified (When D is not a tree, then this is not true.)
Deterministic Upper Bound Schemas Given a schema S, we can always construct an deterministic approximation, Sd S= person person Sd= person name email name phone phone name email string string string string string string string In general, Sd obtained by powerset constrcut expensive
Lower Bound Schemas Introduced (under a different name and formalism) in: Nestorov, Abiteboul, Motwani, Extracting Schema from Semistructured Data, SIGMOD 98 Goal: extract some “schema” from the data We will see later why these are “lower bound schemas”
Lower Bound Schemas Schema = datalog program with special form Classes = predicates Company(x) :- link(x,”c.e.o.”,y), Employee(y), link(x,”name”,z), String(z), link(x,”address”,u), String(u) Employee(x) :- link(x,”works-for”,y), Company(y), link(x,”managed-by”,z), Employee(z), link(x,”name”,u), String(u), link(x,”address”,v), String(v), Root(x) :- link(x,”company”,y), Company(y), link(x,”person”,z), Employee(z) Maximal fixpoint, rather than minimal fixpoint ! (next)
Datalog Maximal Fixpoint Standard datalog semantcis. Transform program to: Company(x) y. z. u.(link(x,”c.e.o.”,y), Employee(y), link(x,”name”,z), String(z), link(x,”address”,u), String(u) Employee(x) y.z.u.v.(link(x,”works-for”,y), Company(y), link(x,”managed-by”,z), Employee(z), link(x,”name”,u), String(u), link(x,”address”,v), String(v)) Root(x) y.z.(link(x,”company”,y), Company(y), link(x,”person”,z), Employee(z)) Compute minimal model. What is it ? the empty set
Datalog Minimal Fixpoint Standard datalog semantcis. Transform program to: Company(x) y. z. u.(link(x,”c.e.o.”,y), Employee(y), link(x,”name”,z), String(z), link(x,”address”,u), String(u) Employee(x) y.z.u.v.(link(x,”works-for”,y), Company(y), link(x,”managed-by”,z), Employee(z), link(x,”name”,u), String(u), link(x,”address”,v), String(v)) Root(x) y.z.(link(x,”company”,y), Company(y), link(x,”person”,z), Employee(z)) Answer: the empty set ! Compute maximal model.
Lower-Bound Schemas Equivalent representation of schema: Root person company works-for managed-by Employee Company c.e.o. address name name string
Simulation Strikes Back person person company manages person company works-for Root person &p1 employee c.e.o. &c1 &p2 &c2 c.e.o. &p3 company works-for works-for name works-for position name phone address name address position name name managed-by &s0 &s1 &s2 &s3 &s4 &s5 &s6 url &s7 &s8 &s9 description Company Employee c.e.o. “Smith” “Manager” “Widget” “Trenton” “Jones” “5552121” “Gadget” “Paris” “Dupont” “Sales” description &s10 address name name &a5 string “www.gp.fr” eval 1998 &a1 &a4 procurement salesrep 1997 &a7 &a3 task &a2 &a6 contact “below target” “on target” Lower Bound Database A model of program P is precisely a simulation (and vice versa)
Summary: Lower v.s. Upper Bound Schemas Tells us what edges are allowed Lower bound schemas Tell us what edges are required