Download presentation
Presentation is loading. Please wait.
1
Management of XML and Semistructured Data
Lecture 9: Schemas Friday, April 27, 2001
2
Outline Schemas for semistructured data Schemas for XML unordered
3
Schemas for Relational v.s. SS Data
Schemas in relational data: Defined before the data Strictly enforced Schemas in SS data Defined after the data May not be enforced (some data doesn’t have schema)
4
When the Schema is Created
Created by the user Before or after the data Does not need to be unique Extracted from the data Unheard of in relational databases Extracted from the query Like type inference in programming languages Different schema formalisms for each task !
5
Schemas for Unordered Data
OEM Cycles Schema itself will be a graph
6
Schemas: An Example Some database: &r1 &c1 &c2 &s2 &s3 &s6 &s7 &s10
company name address url “Widget” “Trenton” “Gadget” “ “Paris” &p2 &p1 &p3 &s0 &s1 &s4 &s5 &s8 &s9 person “Smith” name position phone “Manager” “Jones” “ ” “Dupont” “Sales” employee manages c.e.o. works-for &a1 &a2 &a3 &a4 &a5 &a6 &a7 description procurement salesrep contact task eval 1997 1998 “on target” “below target”
7
Graph Schemas Root person company works-for managed-by Employee
c.e.o. | employee name | address | url name | phone | position string description Any * Upper Bound Schema
8
The Two Questions to Ask
Conformance: does that data conform to this schema ? Classification: if so, then which objects belong to what classes ?
9
Graph Simulation Definition Two edge-labeled graphs G1, G2
A simulation is a relation R between nodes: if (x1, x2) in R, and (x1,a,y1) in G1, then exists (x2,a,y2) in G2 (same label) s.t. (y1,y2) in R x1 x2 a R G1 G2 y1 a R y2
10
Using Simulation Data graph D, schema S
conformance: find maximal simulation R from D to S Notation: D S classification: check if (x,c) in R Notation: x c
11
Example Database Graph Schema &r1 person person company manages person
Root manages person person company works-for &p1 employee company c.e.o. &c1 &p2 &c2 c.e.o. &p3 works-for works-for works-for phone position managed-by name position address name name address name name Company Employee &s0 &s1 &s2 &s3 &s4 &s5 &s6 url &s7 &s8 &s9 description c.e.o. | employee “Smith” “Manager” “Widget” “Trenton” “Jones” “ ” “Gadget” “Paris” “Dupont” “Sales” name | address | url description &s10 name | phone | position string description &a5 &a1 “ eval 1998 &a4 procurement salesrep 1997 Any task &a7 * &a2 &a3 &a6 contact “below target” “on target” Database Graph Schema
12
Formally Graph schema S is a graph, s.t.: Nodes are called classes
Edges are labeled with unary predicates, p(x) Examples: person = “x=person” name | address = “x=name x=address” * = “true” int = “x int” name = “x name”
13
Examples of Graph Schemas (What Do They Mean ?)
person person S1= description name name age age string int string int name age string int part S3= S1 = there are persons; each can have several names and several ages, of types string and int respectively S2 = same as S1, plus persons can have description edges, under which we can have any edge with any label, except name and age S3 = describes a hierarchy of parts and subparts, arbitrary deep. Leaf subparts (or parts) may have names and prices. S4 = describes ANY database S5 = persons may have name and age of types string, int, and under description there may be any structure. S4= subpart name price person description * S5= name string int age string int “Universal schema”, ST *
14
D = S = D S person person person person person name age name name
string int Smith 55 D S
15
Any database conforms to ST ! “Universal schema”
* Any database conforms to ST ! “Universal schema”
16
Schemas in SS Data v.s. Relational Data
Each data instance has exactly one schema Semistructured data One data instance has several schemas
17
The Classification Problem
D = person person person person name phone name name phone name John 1234 Mary string string string string Schema is nondeterministic: creates ambiguous classifications.
18
The Classification Problem
Definition A schema S is deterministic if for every class c and every label a, there is at most one outgoing edge labeled a from c Fact: if S is deterministic and D is a tree, then each node is uniquely classified (When D is not a tree, then this is not true.)
19
Deterministic Upper Bound Schemas
Given a schema S, we can always construct an deterministic approximation, Sd S= person person Sd= person name name phone phone name string string string string string string string In general, Sd obtained by powerset constrcut expensive
20
Lower Bound Schemas Introduced (under a different name and formalism) in: Nestorov, Abiteboul, Motwani, Extracting Schema from Semistructured Data, SIGMOD 98 Goal: extract some “schema” from the data We will see later why these are “lower bound schemas”
21
Lower Bound Schemas Schema = datalog program with special form
Classes = predicates Company(x) :- link(x,”c.e.o.”,y), Employee(y), link(x,”name”,z), String(z), link(x,”address”,u), String(u) Employee(x) :- link(x,”works-for”,y), Company(y), link(x,”managed-by”,z), Employee(z), link(x,”name”,u), String(u), link(x,”address”,v), String(v), Root(x) :- link(x,”company”,y), Company(y), link(x,”person”,z), Employee(z) Maximal fixpoint, rather than minimal fixpoint ! (next)
22
Datalog Minimal Fixpoint
Standard datalog semantcis. Transform program to: Company(x) y. z. u.(link(x,”c.e.o.”,y), Employee(y), link(x,”name”,z), String(z), link(x,”address”,u), String(u) Employee(x) y.z.u.v.(link(x,”works-for”,y), Company(y), link(x,”managed-by”,z), Employee(z), link(x,”name”,u), String(u), link(x,”address”,v), String(v)) Root(x) y.z.(link(x,”company”,y), Company(y), link(x,”person”,z), Employee(z)) Compute minimal model. What is it ?
23
Datalog Minimal Fixpoint
Standard datalog semantcis. Transform program to: Company(x) y. z. u.(link(x,”c.e.o.”,y), Employee(y), link(x,”name”,z), String(z), link(x,”address”,u), String(u) Employee(x) y.z.u.v.(link(x,”works-for”,y), Company(y), link(x,”managed-by”,z), Employee(z), link(x,”name”,u), String(u), link(x,”address”,v), String(v)) Root(x) y.z.(link(x,”company”,y), Company(y), link(x,”person”,z), Employee(z)) Answer: the empty set ! Compute maximal model.
24
Lower-Bound Schemas Equivalent representation of schema: Root person
company works-for managed-by Employee Company c.e.o. address name name string
25
Simulation Strikes Back
person person company manages person company works-for Root person &p1 employee c.e.o. &c1 &p2 &c2 c.e.o. &p3 company works-for works-for name works-for position name phone address name address position name name managed-by &s0 &s1 &s2 &s3 &s4 &s5 &s6 url &s7 &s8 &s9 description Company Employee c.e.o. “Smith” “Manager” “Widget” “Trenton” “Jones” “ ” “Gadget” “Paris” “Dupont” “Sales” description &s10 address name name &a5 string “ eval 1998 &a1 &a4 procurement salesrep 1997 &a7 &a3 task &a2 &a6 contact “below target” “on target” Lower Bound Database A model of program P is precisely a simulation (and vice versa)
26
Summary: Lower v.s. Upper Bound Schemas
Tells us what edges are allowed Lower bound schemas Tell us what edges are required
27
Schema Extraction (From Data)
Problem statement given data instance D find the “most specific” schema S for D In practice: S too large, need to relax [Nestorov, Abiteboul, Motwani 1998]
28
Schema Extraction: Sample Data
Example database D = &r employee employee employee employee employee employee employee employee manages manages manages manages manages &p1 &p2 &p3 &p4 &p5 &p6 &p7 &p8 managedby managedby managedby managedby managedby worksfor worksfor worksfor company worksfor worksfor worksfor worksfor worksfor &c
29
Lower Bound Schema Extraction
[NAM’98] approach: Start with the schema given by the data (S = D): Each node = a predicate = a class Compute maximal fixpoint (PTIME) Declare two classes equal iff they are equal sets E.g. p4={&p1,&p4,&p6}, p6={&p1,&p4,&p6}, hence p1=p4 Equivalently, p=p’ iff p(&p’) and p’(&p) . . . p4(x) :- link(x, manages, y), p5(y), link(x, worksfor, z), c(z) p5(x) :- link(x, managed-by, y), p4(y), link(x, worksfor, z), c(z)
30
Lower Bound Schema Extraction
Result S = Root &r employee company employee Bosses &p1,&p4,&p6 manages Regulars &p2,&p3,&p5,&p7,&p8 worksfor managedby Company &c worksfor
31
Lower Bound Schema Extraction
Equivalently: Compute the maximal simulation D D Can do in time O(m2) Two nodes p, p’ are equivalent iff x x’ and x’ x Schema consists of equivalence classes Remark: could use the bisimulation relation instead (perhaps is even better)
32
Upper Bound Schema Extraction
The extracted lower bound schema S is also an upper bound schema ! But: nondeterministic Convert S Sd Alternatively, convert directly D Dd = Sd These are data guides [McHugh and Widom]
33
Upper Bound Schema Extraction
Result Sd = Root &r employee Employees &p1,&p1,&p3,P4 &p5,&p6,&p7,&p8 company managedby manages worksfor Bosses &p1,&p4,&p6 manages Regulars &p2,&p3,&p5,&p7,&p8 worksfor managedby Company &c worksfor
34
XML Document Type Definitions
part of the original XML specification an XML document may have a DTD terminology for XML: well-formed: if tags are correctly closed valid: if it has a DTD and conforms to it validation is useful in data exchange
35
DTDs as Grammars <!DOCTYPE paper [
<!ELEMENT paper (section*)> <!ELEMENT section ((title,section*) | text)> <!ELEMENT title (#PCDATA)> <!ELEMENT text (#PCDATA)> ]> <paper> <section> <text> </text> </section> <section> <title> </title> <section> … </section> <section> … </section> </section> </paper>
36
DTDs as Schemas Not so well suited:
impose unwanted constraints on order <!ELEMENT person (name,phone)> references cannot be constrained can be too vague: <!ELEMENT person ((name|phone| )*)>
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.