Typing Semistructured Data By, Keshava Reddy Kottapally Goutham Chinnapolamada Source: Serge Abiteboul, Dan Suciu, Peter Buneman, Data on the web: From relations to semistructured data and XML, Morgan Kaufmann Series, ISBN X, 1999
Typing Semistructured Data Introduction: Schema for Semistructured data Motivation for typing Semistructured data Schema formalisms: –First-order logic –Datalog –Graph simulations Extracting schemas from data Inferring schemas from queries Path constraints
What is semistructured data..? Semistructured data has some structure, but is difficult to describe with a predefined, rigid schema –Irregularity –Continual evolution –Structure that is implicit or unknown to the user
What is typing..? Typing is about finding the structure of semistructured data The idea of structuring semistructured data is still an area of much research activity Typing involves finding methods to provide schemas for semistructured data Typing for SSD differ from those for relational or object-oriented data and hence needs separate methods
Uses of typing SSD To optimize query evaluation Example: Original query: select X.title from biblio._X where X.*.zip = “12345” Optimized form: select X.title from biblio.book X where X.address.zip = “12345”
C1C2C3C4 C5 bibliobook titlestring author first name last name string street city zip title journal year paper address
Uses of typing continued... To facilitate the task of integrating several data sources To improve storage –Better clustering may reduce number of page fetches, thus improving query performance To construct indexes To describe the database content to users and facilitate query formulation To proscribe certain updates
Two ways of typing.. Schema extraction –Given one particular data instance, finding the most specific schema for it –With semistructured data we may specify the type after the database is populated –A data instance may have more than one type Schema inference –Finding the most specific schema by analyzing the query –This process is similar to type inference in programming languages
The problem Given a database and a type, –does the database conform to this type…? Classification of objects –Which objects belong to each class..? Typing involves description of the structure of each class and its relationships with other classes
Difference between typing SSD and Object Databases Classes are defined less precisely. As a consequence, objects may belong to several classes Some objects may not belong to any class or may have properties that do not pertain to any class The typing may be approximate. For example, we may accept in a class an object that does not quite conform to the specification of that class.
Schema formalisms First-order logic Datalog Simulation
First-order logic Example: Consider three kinds of objects in the database –Root object(s) have Outgoing edges labeled company to company objects and person to person objects –Person objects have Outgoing edges labeled name and position to string objects Outgoing edges labeled worksfor to company objects Incoming edges labeled manager and employee from company objects –Company objects have Outgoing edges labeled name and address to string objects Outgoing edges labeled manager and employee to person objects Incoming edges labeled worksfor from person objects
If : –if an object has a-edges to strings and b-edges from c’ objects, then it is a c-object. – Y, Z(ref(X,a,Y) ^ string(Y) ^ c’(Z) ^ ref(Z,b,X))) c(X) Only-if: – Any c-object has some a-edges to strings and some b-edges from c’ objects: – Y, Z(ref(X,a,Y) ^ string(Y) ^ c’(Z) ^ ref(Z,b,X))) c(X) If and only if: – Y, Z(ref(X,a,Y) ^ string(Y) ^ c’(Z) ^ ref(Z,b,X))) c(X) Consequence: –c(X) ^ ref(Z,b,X) c’ (Z) –c(X) ^ ref(X,a,Y) string(Y) –c(X) ^ ref(X,L,Y) ^ L a ^ L b false
Problem definition with first-order logic The previous questions on typing can be restated in terms of first-order logic –Does D satisfy T, noted D |= T, that is, is there a model of T that coincides with D over the extensional predicates..? –If D |= T, what is the classification that is induced..? First-order logic leads to very general typings, probably too general for what is needed in semistructured data It could also lead to undecidability or intractability
Datalog: A rule-based language Datalog allows us to state that if a conjunction of facts holds, then some new fact can be derived Datalog rules allow us to define classes by specifying what incoming and outgoing edges are required Example: –r(X) :- ref(X, person, Y), p(Y), ref(X, company, Z), c(Z) –p(X) :- c(Y), ref(Y, manager, X), c(Z), ref(Z, employee, X), ref(X, worksfor, U), c(U), ref(X, name, N), string(N), ref(X, position, P), string(P) –c(X) :- p(Z), ref(Z, worksfor, X), p(Z), ref(Z, worksfor, X), ref(X, manager, M), p(M), ref(X, employee, E), p(E), ref(X, name, N), string(N), ref(X, address, A), string(A)
Fixpoint semantics Least fixpoint semantics –We start from an empty set of facts and derive nothing. Hence, the empty set of facts is the least fixpoint for this program Greatest fixpoint semantics –Typing the largest set of objects The goal is to find the greatest fixpoint for a given data graph. The desired model is the greatest fixpoint containing D.
Consider the following data graph D: &o1 {company: &o2{name: &o5 “o2”, address: &o6 “Versailles”, manager: &o3, employee: &o3, employee: &o4 }, person: &o3 { name: &o7 “Francois”, position: &o8 “CEO”, worksfor: &o2 }, person: &o4 { name: &o9 “Lucien”, position: &o10 “programmer”, worksfor: &o2 } } ref(&o1, company, &o2), ref(&o2, name, &o5), etc. string(&o5, string(&o6), etc.
Deriving the greatest fixpoint The desired model M can be derived by starting from a model containing D and all possible typing facts. Let J o = D U { r(&o1), r(&o2), r(&o3), r(&o4), p(&o1), p(&o2), p(&o3), p(&o4), c(&o1), c(&o2), c(&o3), c(&o4), } Deriving from J0 until a fixpoint is reached will get to the desired model M = J2 = J1 = D U {r(&o1), c(&o2), p(&o3), p(&o4)}
Simulation The aim is to produce a schema graph for a data graph whose semantics lead to a listing of all permitted labels. A schema graph is similar to a data graph with the following changes –Labels can be alterations (like address | name | url ) or underscore –Atomic values are type names, like string, int, float, etc. –Oids of complex objects are called as classes, like Person, Company, etc.
&r1 &p1&c1&p2&c2&p3 &s0&s1&s2&s3&s4&s5&s6&s7&s8&s9 &a1 &a2 &a3 &a4 &a5 &a6 &a7 person company personcompany person manager mgremp name positionaddrphone addrposition &s10 url worksfor emp description procurement salesrep contact task description performance “Smith”“Mgr” “Widget”“Trent”“Joe”
Schema graph Root Person Company String Any companyperson employee manager worksfor name|address|url name|phone|position description manager -
Simulation is defined as follows: Given graphs G 1 = (V 1, E 1 ), G 2 = (V 2, E 2 ), a relation R on V 1,V 2 is a simulation if it satisfies l L x 1,y 1 V 1 x 2 V 2 (x 1 [l]y 1 ^ x 1 Rx 2 y 2 V 2 (y 1 Ry 2 ^ x 2 [l]y 2 )) The rule says that every edge in G 1 must have a “corresponding” edge in G 2 under the simulation x1 y1y2 x2 R R G1 G2 [l]
To define a simulation between a semistructured data instance and a schema graph, we add the following additional requirements: –The roots must be in the simulation: r R r’ –Whenever x R y, if y is an atomic type (like string, int), then x must be an atomic node too and have a value of that type. We say the simulation is typed
Data nodeSchema node &r1Root &c1, &c2Company &p1, &p2, &p3Person &s0,&s1,&s2,&s3…string &a1,&a2,&a3,&a4….Any The relation R defined by the example data graph and the given schema graph is a simulation
Back to the typing problem…. When does a data graph D conform to a schema graph S..? –When there exists a rooted, typed simulation between the data and the schema Which objects belong to each class..? –The principle is that oid ‘o’ should belong to class ‘c’ if o R c. In this way, a rooted simulation R will always classify all objects. –However, the classification need not be unique!, which leads to finding maximal simulation
string book titleauthor book titleauthor publisher book titleauthor year &o &b1 &b2 D = S =
Maximal simulation G 1 < =R G 2 : R is a simulation from G 1 to G 2 Fact: –if G 1 < =R 1 G 2 and G 1 < =R 2 G 2 then G 1 < =R 1 UR 2 G 2 –For any data graph D conforming to some schema graph S, there is always a maximal simulation from D to S. Back to the problem: Which objects belong to each class…? –An object ‘o’ belongs to some class ‘c’ if oRc, where R is the maximal solution between the OEM data and schema graph