Presentation is loading. Please wait.

Presentation is loading. Please wait.

Summary Graphs for Relational Database Schemas Xiaoyan Yang (NUS) Cecilia M. Procopiuc, Divesh Srivastava (AT&T)

Similar presentations


Presentation on theme: "Summary Graphs for Relational Database Schemas Xiaoyan Yang (NUS) Cecilia M. Procopiuc, Divesh Srivastava (AT&T)"— Presentation transcript:

1 Summary Graphs for Relational Database Schemas Xiaoyan Yang (NUS) Cecilia M. Procopiuc, Divesh Srivastava (AT&T)

2 Motivation  Complex database schemas in large enterprise systems – 1000+ tables, 10000+ columns, 100000+ PK/FK edges  Prior work to help users understand complex schemas – Customized views (forms) to hide database schema – Present informative tables to simplify schema understanding  Goal: schema graph summary connecting user’s query tables – Needs to be succinct – Needs to preserve informative join paths 2

3 Complex Schema Graph Example 3  Complex database schema in a large real enterprise system – Too complex for illustrative purposes

4 TPC-E Benchmark Schema Graph 4

5 Useless TPC-E Schema Summary Graph 5 security trade customer status_type Graph weight = 4.5572034455  Not very informative: all query tables have a status_type field – Succinct graph does not mean informative graph!

6 Informative TPC-E Schema Summary Graph 6 customer customer_account holding_summary Graph weight = 1.6917276155  Very informative: securities held by, trades made by customer – Larger graph, smaller graph weight, union of shortest paths security trade

7 Useless TPC-E Schema Summary Graph 7  Union of pairwise shortest paths is not the answer – Small graph weight, but verbosity hinders understandability

8 Succinct TPC-E Schema Summary Graph 8 commission_ratecustomer_taxrate broker industry customer_account exchange 0.7298749340 0.29474104281.09442494631.9574738210 1.42363985111.26749946780.7470561327 Graph weight = 7.5147101957  Informative & succinct: customer_account, exchange are hubs – Slightly larger graph weight, but informative and succinct

9 Outline  Motivation  Problem statement  Our solution – Defining schema edge weights – Computing summary graphs  Experimental results 9

10 Desiderata  Schema graph summary must be informative and succinct  Need a formal definition of “informative” – Use Information Theory  Need a formal definition of “succinct” – Use Graph Summarization 10

11 Problem Statement 1: Informative Edges  Given schema graph G = (R, E) and database instance D  Problem 1: define schema edge weights, wt: E  R + – More informative join edges have smaller weights (≥ 0) – Extend wt(R 1, R 2 ) = weight of shortest path between R 1 and R 2 11

12 Problem Statement 2: Succinct Graph  Given schema G = (R, E), weight wt, user-specified tables Q  Problem 2: compute summary graph (R s, E s ) – Q  R s  R, |R s | ≤ |Q|+B, for a given small budget B – Meta-edges E s  {(R 1, R 2 ) | exists path between R 1 and R 2 in G} – (R s, E s ) must preserve shortest paths between Q tables in G – Optimize: (R s, E s ) has the minimum sum of meta-edge weights 12

13 Outline  Motivation  Problem statement  Our solution – Defining schema edge weights – Computing summary graphs  Experimental results 13

14 Informative Edges: Column Graph  Build an edge weighted column graph G C = (N C, E C ) where – N C consists of all primary and foreign key columns in all tables – Intra-table edges in E C = {(R.P, R.F) | R.P is a PK column of R} – Inter-table edges in E C = {(R.P, R 1.F) | R 1.F is a foreign key to R.P} – Edge weights based on mutual information between columns 14 A B C DE F R S T 0.28 0.5 0.1 0.6 0.05 0.21

15 Informative Edges: Table Graph  Induce an edge weighted table graph G T = (N T, E T ) where – N T consists of all tables – E T = {(R, R 1 ) | R 1.F is a foreign key to R.P} – Edge weight = min sum of weights on path between PK columns 15 A B C DE F R S T R S T 0.28 0.5 0.1 0.6 0.05 0.21 0.38 0.26 1.1

16 Edge Weight: Using Mutual Information  Mutual information I(X;Y) =  x  y p(x,y) log 2 (p(x,y)/p(x)p(y)) – Mutual information captures strength of linkage between X, Y  D(X,Y) = 1 – H(X,Y)/I(X;Y) is a distance function, H() is entropy – D(X,Y) = 0 iff X, Y are identical; D(X,Y) = 1 iff X, Y are independent 16 X1234 Y2213 i(x;y)1.0 2.0 I(X;Y) = 1.5H(X,Y) = 2.0, D(X,Y) = 0.25 i(x;y) H(X|Y) I(X;Y) H(Y|X) H(X) H(X,Y) H(Y)

17 Outline  Motivation  Problem statement  Our solution – Defining schema edge weights – Computing summary graphs  Experimental results 17

18 Summary Graph  Given schema graph G = (R, E), edge weight wt: E  R +, and user-specified tables Q, compute summary graph (R s, E s ) – Q  R s  R, |R s | ≤ |Q|+B, for a given small budget B – Meta-edges E s  {(R 1, R 2 ) | exists path between R 1 and R 2 in G} – (R s, E s ) must preserve shortest paths between Q tables in G – Optimize: (R s, E s ) has the minimum sum of meta-edge weights 18 R S BT 0.3 0.2 R S ABT 0.5 0.10.20.3 0.1 Total weight = 1.2 Total weight = 1.1 R S AT 0.10.5 0.1 Total weight = 0.7

19 Properties of Summary Graphs  Theorem: Computing the optimal summary graph is NP-hard Proof uses reduction from Clique in (n – 4)-regular graphs  Proposition (towards an elegant solution formulation): – It is sufficient to compute an optimal summary graph for the smaller graph consisting of shortest paths between Q nodes – Endpoints of meta-edges in optimal summary graph have to appear together on at least one shortest path between Q nodes 19

20 Efficient Computation of Summary Graphs  It is sufficient to compute an optimal summary graph for the smaller graph consisting of shortest paths between Q nodes  Elegant solution: formulate an integer program; use CPLEX 20

21 Outline  Motivation  Problem statement  Our solution – Defining schema edge weights – Computing summary graphs  Experimental results 21

22 Experimental Setup  Data: use 2 instances of TPC-E benchmark database schema – Simulates an OLTP workload of a brokerage firm – Well-specified schema, including PK/FK constraints  Quality: use measures based on the TPC-E transaction logs – Table coverage: relative frequency of summary graph tables in log – Join coverage: relative frequency of summary graph joins in log – Summary graph density: reflects complexity of summary graph 22

23 Comparing Weight Functions  Compare MI-based and MAF-based [YPS09] edge weights – Fixed B, varying |Q|; fixed |Q|, varying B – MI-based weight is superior: higher coverage, lower density 23

24 Choosing Budget Tables  Effect of our strategy for choosing budget tables – Use coordinated summary graphs for fixed |Q|+B – Budget nodes reduce complexity, improve quality 24

25 Summary  Complex database schemas in large enterprise systems – 1000+ tables, 10000+ columns, 100000+ PK/FK edges  Novel schema graph summary is informative and succinct – Define schema graph edge weights using mutual information – Compute succinct summary graph that preserves query table shortest paths and minimizes graph weight, for a given budget – Experimental study validates weight definition, summary model  Future work: approximations for schema graph summaries 25


Download ppt "Summary Graphs for Relational Database Schemas Xiaoyan Yang (NUS) Cecilia M. Procopiuc, Divesh Srivastava (AT&T)"

Similar presentations


Ads by Google