Summary Graphs for Relational Database Schemas Xiaoyan Yang (NUS) Cecilia M. Procopiuc, Divesh Srivastava (AT&T)
Motivation Complex database schemas in large enterprise systems – tables, columns, PK/FK edges Prior work to help users understand complex schemas – Customized views (forms) to hide database schema – Present informative tables to simplify schema understanding Goal: schema graph summary connecting user’s query tables – Needs to be succinct – Needs to preserve informative join paths 2
Complex Schema Graph Example 3 Complex database schema in a large real enterprise system – Too complex for illustrative purposes
TPC-E Benchmark Schema Graph 4
Useless TPC-E Schema Summary Graph 5 security trade customer status_type Graph weight = Not very informative: all query tables have a status_type field – Succinct graph does not mean informative graph!
Informative TPC-E Schema Summary Graph 6 customer customer_account holding_summary Graph weight = Very informative: securities held by, trades made by customer – Larger graph, smaller graph weight, union of shortest paths security trade
Useless TPC-E Schema Summary Graph 7 Union of pairwise shortest paths is not the answer – Small graph weight, but verbosity hinders understandability
Succinct TPC-E Schema Summary Graph 8 commission_ratecustomer_taxrate broker industry customer_account exchange Graph weight = Informative & succinct: customer_account, exchange are hubs – Slightly larger graph weight, but informative and succinct
Outline Motivation Problem statement Our solution – Defining schema edge weights – Computing summary graphs Experimental results 9
Desiderata Schema graph summary must be informative and succinct Need a formal definition of “informative” – Use Information Theory Need a formal definition of “succinct” – Use Graph Summarization 10
Problem Statement 1: Informative Edges Given schema graph G = (R, E) and database instance D Problem 1: define schema edge weights, wt: E R + – More informative join edges have smaller weights (≥ 0) – Extend wt(R 1, R 2 ) = weight of shortest path between R 1 and R 2 11
Problem Statement 2: Succinct Graph Given schema G = (R, E), weight wt, user-specified tables Q Problem 2: compute summary graph (R s, E s ) – Q R s R, |R s | ≤ |Q|+B, for a given small budget B – Meta-edges E s {(R 1, R 2 ) | exists path between R 1 and R 2 in G} – (R s, E s ) must preserve shortest paths between Q tables in G – Optimize: (R s, E s ) has the minimum sum of meta-edge weights 12
Outline Motivation Problem statement Our solution – Defining schema edge weights – Computing summary graphs Experimental results 13
Informative Edges: Column Graph Build an edge weighted column graph G C = (N C, E C ) where – N C consists of all primary and foreign key columns in all tables – Intra-table edges in E C = {(R.P, R.F) | R.P is a PK column of R} – Inter-table edges in E C = {(R.P, R 1.F) | R 1.F is a foreign key to R.P} – Edge weights based on mutual information between columns 14 A B C DE F R S T
Informative Edges: Table Graph Induce an edge weighted table graph G T = (N T, E T ) where – N T consists of all tables – E T = {(R, R 1 ) | R 1.F is a foreign key to R.P} – Edge weight = min sum of weights on path between PK columns 15 A B C DE F R S T R S T
Edge Weight: Using Mutual Information Mutual information I(X;Y) = x y p(x,y) log 2 (p(x,y)/p(x)p(y)) – Mutual information captures strength of linkage between X, Y D(X,Y) = 1 – H(X,Y)/I(X;Y) is a distance function, H() is entropy – D(X,Y) = 0 iff X, Y are identical; D(X,Y) = 1 iff X, Y are independent 16 X1234 Y2213 i(x;y) I(X;Y) = 1.5H(X,Y) = 2.0, D(X,Y) = 0.25 i(x;y) H(X|Y) I(X;Y) H(Y|X) H(X) H(X,Y) H(Y)
Outline Motivation Problem statement Our solution – Defining schema edge weights – Computing summary graphs Experimental results 17
Summary Graph Given schema graph G = (R, E), edge weight wt: E R +, and user-specified tables Q, compute summary graph (R s, E s ) – Q R s R, |R s | ≤ |Q|+B, for a given small budget B – Meta-edges E s {(R 1, R 2 ) | exists path between R 1 and R 2 in G} – (R s, E s ) must preserve shortest paths between Q tables in G – Optimize: (R s, E s ) has the minimum sum of meta-edge weights 18 R S BT R S ABT Total weight = 1.2 Total weight = 1.1 R S AT Total weight = 0.7
Properties of Summary Graphs Theorem: Computing the optimal summary graph is NP-hard Proof uses reduction from Clique in (n – 4)-regular graphs Proposition (towards an elegant solution formulation): – It is sufficient to compute an optimal summary graph for the smaller graph consisting of shortest paths between Q nodes – Endpoints of meta-edges in optimal summary graph have to appear together on at least one shortest path between Q nodes 19
Efficient Computation of Summary Graphs It is sufficient to compute an optimal summary graph for the smaller graph consisting of shortest paths between Q nodes Elegant solution: formulate an integer program; use CPLEX 20
Outline Motivation Problem statement Our solution – Defining schema edge weights – Computing summary graphs Experimental results 21
Experimental Setup Data: use 2 instances of TPC-E benchmark database schema – Simulates an OLTP workload of a brokerage firm – Well-specified schema, including PK/FK constraints Quality: use measures based on the TPC-E transaction logs – Table coverage: relative frequency of summary graph tables in log – Join coverage: relative frequency of summary graph joins in log – Summary graph density: reflects complexity of summary graph 22
Comparing Weight Functions Compare MI-based and MAF-based [YPS09] edge weights – Fixed B, varying |Q|; fixed |Q|, varying B – MI-based weight is superior: higher coverage, lower density 23
Choosing Budget Tables Effect of our strategy for choosing budget tables – Use coordinated summary graphs for fixed |Q|+B – Budget nodes reduce complexity, improve quality 24
Summary Complex database schemas in large enterprise systems – tables, columns, PK/FK edges Novel schema graph summary is informative and succinct – Define schema graph edge weights using mutual information – Compute succinct summary graph that preserves query table shortest paths and minimizes graph weight, for a given budget – Experimental study validates weight definition, summary model Future work: approximations for schema graph summaries 25