Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Web Data and the Resurrection of Database Theory Dan Suciu University of Washington

“In theory there is no difference between theory and practice. In practice there is.” Jan L.A. van de Snepscheut September 12, 1953 - February 23, 1994

Short History of Database Theory The legendary beginnings, 1970-1971: Relational databases are the brainchild of a theoretician (Codd) Heavily debated at the time (against CODASYL) It took several years for the concept to be validated in practice Theory driving the industry

Short History of Database Theory The golden years (end of 70s, early 80s) Relational theory –Functional dependencies –Query containment Transactions Access methods Theory listening to the industry

Short History of Database Theory Refined decadence (end of 80s, early 90s) Descriptive complexity Logic databases Complex objects Constraint databases Divorce ?

“Database Metatheory: Asking the Big Queries” Christos Papadimitriou, in PODS, 1995 Theory is inevitable: CS is a science of the artificial, and its artifact is being changed by the very act of studying it Immature science Normal science Crisis Revolution Kuhn’s paradigm principle, for natural sciences

Is DB Theory in a Crisis Today ? Industry’s focus: –one particular data model: relational/SQL –one particular application (client-server) Theory’s focus is on Logic: –New data models, query languages (query containment, complex objects, recursion) –New applications (incomplete information, query rewriting using views)

One Example of Unused Theory Containment of conjunctive queries is NP complete [Chandra and Merlin’77] Dozens of extensions: With union and difference [Sagiv and Yannakakis’81] With order predicates [Klug’88, van den Meyden’92] With complex objects [Levy and Suciu’97] With regular expressions [Florescu, Levy and Suciu’98]

Query Containment The query: Minimization not used by RDBMs today Q 1 = SELECT DISTINCT x.name, x.phone FROM Person x, Person y, Person z WHERE x.department = y.department AND x.manager = z.manager Q 2 = SELECT DISTINCT x.name, x.phone FROM Person x Is minimized to: The following can be checked: Q 1  Q 2 and Q 1  Q 2 …hence Q 1 =Q 2

Why Today Things Are Changing Just one reason: The Web More precisely: A new data model –Semistructured data –XML syntax New applications –Transformation –Integration

Web Data Management Who creates the new rules –W3C working groups –Sometimes the industry The new artifacts are not concepts, but standards The double role of theory –Long term: conceptualize/rationalize E.g. keys for XML [Buneman, Davidson, Fan, Hara, Tan’01] –Short term: answer technical questions

Some Questions for Database Theory XML publishing Typechecking XML transformations XML storage Data distribution

Warehouse application relational data Transform Integrate XML DataWEB (HTTP) application legacy data object-relational Warehouse XML Publishing XML Storage XML Typechecking XML Distribution

XML Publishing Today: Legacy data –fragmented into many flat relations –3rd normal form –proprietary XML data –nested –un-normalized –public (450 schemas at www.biztalk.org)

XML Publishing: an Example Eu-Stores US-Stores Products Eu-SalesUS-Sales namecountrynameurl date tax name priceUSD euSidusSid pid Legacy data in E/R:

XML Publishing: an Example XML view France Nicolas Blanc de Blanc 10/10/2000 12/10/2000 … … … …. … In summary: group by country store product

allsales country namestore nameproduct namesold datetax url PCDATA * * * * ? ? Output “schema”:

{ FROM EuStores $S, EuSales $L, Products $P WHERE $S.euSid = $L.euSid AND $L.pid = $P.pid CONSTRUCT $S.country $S.name $P.name $P.priceUSD } /* union….. */ { FROM EuStores $S, EuSales $L, Products $P WHERE $S.euSid = $L.euSid AND $L.pid = $P.pid CONSTRUCT $S.country $S.name $P.name $P.priceUSD } /* union….. */ XML Publishing …. /* union */ { FROM USStores $S, EuSales $L, Products $P WHERE $S.usSid = $L.euSid AND $L.pid = $P.pid CONSTRUCT USA $S.name $S.url $P.name $P.priceUSD $L.tax } …. /* union */ { FROM USStores $S, EuSales $L, Products $P WHERE $S.usSid = $L.euSid AND $L.pid = $P.pid CONSTRUCT USA $S.name $S.url $P.name $P.priceUSD $L.tax } In SilkRoute [Fernandez, Suciu, Tan ’00]

Non-recursive datalog (SELECT DISTINCT … ) allsales() country(c) name(c)store(c,x) name(n)product(c,x,y) name(n)sold(c,x,y,d) date(c,x,y,d) Tax(c,x,y,d,t) url(c,x,u) c n n d t u Internal Representation country(c) :-EuStores(x,_,c), EuSales(x,y,_), Products(y,_,_) country(“USA”) :- store(c,x) :- EuStores(x,_,c), EuSales(x,y,_), Products(y,_,_) store(c,x) :- USStores(x,_,_), USSales(x,y,_), Products(y,_,_), c=“USA” url(c,x,u):-USStores(x,_,u), USSales(x,y,_),Products(y,_,_) allsales():- Large query (x100 lines), large XML answer (x100 MB) * * * * ? View Tree:

Users Ask Specific XML Queries find names, urls of all stores who sold on 1/1/2000 (in XML-QL / XQuery melange): WHERE 1/1/2000 $X $Y RETURN $X, $Y WHERE 1/1/2000 $X $Y RETURN $X, $Y Small query, small answer

name(c) name(n) Tax(c,x,y,d,t) date(c,x,y,d) allsales() country(c) store(c,x) name(n)product(c,x,y) sold(c,x,y,d) url(c,x,u) c n n d t u Query Composition allsales country store product sold date url 1/1/2000 name $X $Y View Tree XML-QL Query Pattern $n1 $n2 $n3 $n4 $n5 $Z “Evaluate” the XML pattern(s) on the view tree, combine all datalog rules

Query Composition Result (in theory…): ( SELECT S.name, S.url FROM USStores S, USSales L, Products P WHERE S.usSid=L.usSid AND L.pid=P.pid AND L.date=‘1/1/2000’) UNION ( SELECT S2.name, S2.url FROM EUStores S1, EUSales L1, Products P1 USStores S2, USSales L2, Products P2, WHERE S1.usSid=L1.usSid AND L1.pid=P1.pid AND L1.date=‘1/1/2000’ AND S2.usSid=L2.usSid AND L2.pid=P1.pid AND S1.country=“USA” AND S1.euSid = S2.usSid) ( SELECT S.name, S.url FROM USStores S, USSales L, Products P WHERE S.usSid=L.usSid AND L.pid=P.pid AND L.date=‘1/1/2000’) UNION ( SELECT S2.name, S2.url FROM EUStores S1, EUSales L1, Products P1 USStores S2, USSales L2, Products P2, WHERE S1.usSid=L1.usSid AND L1.pid=P1.pid AND L1.date=‘1/1/2000’ AND S2.usSid=L2.usSid AND L2.pid=P1.pid AND S1.country=“USA” AND S1.euSid = S2.usSid)

Complexity of XML Publishing But in practice: 5-7 times more joins ! –Need query minimization Could this be avoided ? –We thought hard and couldn’t find a better way –Asked students to re-implement: same problem –It is NP-hard !

XML Publishing Is NP-Hard customer ordercomplaint PCDATA ?? order():- Q1 complaint():- Q2 XML query: The composed SQL query is : Minimizing it is NP hard ! (can be shown…) View Tree: WHERE $x $y RETURN ( ) Q1 JOIN Q2

Recent Advancements in Query Containment Definition FO k = First Order Logic with k variables Fact If Q2  FO k and k “is small”, then Q1  Q2 can be checked efficiently [Kolaitis, Vardi’98], [Vardi’00], [Chekuri, Ramajaran’97]

XML Publishing: Finale Prediction techniques based on FO k and/or query width will be deployed in XML publishing in the future (perhaps under different names)

XML Typechecking Purpose: ensure that the generated XML conforms to the desired DTD (or XML Schema) Two kinds: Dynamic typechecking –Easy: lots of XML validating parsers available Static typechecking –Hard: need complex analysis of the XML generation program

XML Typechecking XML generation programs: Publishing: RDBMS  XML (e.g. SilkRoute) Transformation: XML  XML (e.g. XSL, Xquery) Integration: XML + XML  XML This talk: XML  XML

The XML Typechecking Problem Given an XML  XML transformation f: Type Checking Problem Given DTDs  1,  2, check  D  1, f(D)  2 sometimes  1 = any: check  D, f(D)  2

Today’s Systems Try to Do Type Inference Type Inference Problem Given DTD  1, find the DTD f(  1 ) = {f(D) | D  1 } Today’s systems: “Compute” f(  1 ) Check f(  1 )   2 (which is possible) sometimes  1 = any: compute f(any) check f(any)   2

Theory’s Role: Send a Warning This approach fails in general ! But it may work OK in most “practical” cases...

Why XML Type Inference Fails Xquery f = “Inferred” (wrong) DTD f(any): RETURN (FROM Employee $x RETURN ), (FROM Employee $x RETURN ) RETURN (FROM Employee $x RETURN ), (FROM Employee $x RETURN ) “Real” output “DTD” Fails to typecheck f(any)   2 when  2 =

The Typechecking Problem in Theory and Practice In practice, we care about typechecking Question for theory: is this possible ? Positive result [Milo, Suciu, Vianu, 2000]: –Decidable for k-pebble tree tansducers –Hence: decidable for: Join-free XQuery Simple XSLT programs Negative result [Alon, Milo, Neven, Suciu, Vianu 2001]: –Undecidable for transformations with value joins

The Typechecking: Finale Prediction: systems will continue to use type inference, but will never be as robust as type checking in programming languages Need to understand well their applicability

XML Storage Problem: Given: a (large) XML data instance Goal: store/process it in a RDBMS Problem: find the relational schema ! Current approaches: –Generic schema [Florescu, Kossman 99] –Derive schema from DTD [Shanmungasudaram et al 99] –Derive schema from XML data[Deutsch, Fernandez, Suciu 99]

The Theory of XML Storage The simplest case: flat, unique subelements M = How do we cover all 1’s most economically ? –R1(E2, E3, E4), R2(E1, E5, E9, E12), … OidE1E2E3E4…E5000 &11001…0 &20110…0 &30101…0 &40111…0 &51010…0 &61100…0 ……… &o1000000001000

The Theory of XML Storage XML storage and matrix rank M = Can store XML data in k relations  rank(M)=k Conversely: if rank(M)=k  what about storage ? OidE1E2E3E4…E5000 &11001…0 &20111…0 &30111…0 &40111…0 &51100…0 &61100…0 &70001...… ……… &100000001001…0

XML Storage: Finale Prediction: we will see several clever XML storage techniques discovered in the near future

The Data Distribution Many data consumers, many places to cache Data can be replicated, transformed –How to transform it ? The view selection problem –Where to place it ? The data distribution problem. NP-complete Prediction: no predictions here (too early…)

Conclusions: Resurrection of Database Theory Is theory irrelevant ? –[Papadimitriou, 95]: wrong question to ask Respect for practice: only a recent development in human culture Applicability pressure in CS: annoying trend of last 10 years or so Database theory: are we in a revolution ? –The past: researchers created artifacts for the industry –Today: society (Web, W3C) is creating artifacts for researchers to study, improve Prediction: there will be no difference between theory and practice… at least, in theory !

Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Similar presentations

Presentation on theme: "Web Data and the Resurrection of Database Theory Dan Suciu University of Washington."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Web Data and the Resurrection of Database Theory Dan Suciu University of Washington.

Similar presentations

Presentation on theme: "Web Data and the Resurrection of Database Theory Dan Suciu University of Washington."— Presentation transcript:

Similar presentations

About project

Feedback