Summary Graphs for Relational Database Schemas Xiaoyan Yang (NUS) Cecilia M. Procopiuc, Divesh Srivastava (AT&T)

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Compressing Forwarding Tables Ori Rottenstreich (Technion, Israel) Joint work with Marat Radan, Yuval Cassuto, Isaac Keslassy (Technion, Israel) Carmi.
1 Efficient algorithms on sets of permutations, dominance, and real-weighted APSP Raphael Yuster University of Haifa.
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Fast Algorithms For Hierarchical Range Histogram Constructions
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Wavelength Assignment in Optical Network Design Team 6: Lisa Zhang (Mentor) Brendan Farrell, Yi Huang, Mark Iwen, Ting Wang, Jintong Zheng Progress Report.
Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
Parallel Scheduling of Complex DAGs under Uncertainty Grzegorz Malewicz.
Interchanging distance and capacity in probabilistic mappings Uriel Feige Weizmann Institute.
Efficient Query Evaluation on Probabilistic Databases
Applications of Single and Multiple UAV for Patrol and Target Search. Pinsky Simyon. Supervisor: Dr. Mark Moulin.
Optimization of Pearl’s Method of Conditioning and Greedy-Like Approximation Algorithm for the Vertex Feedback Set Problem Authors: Ann Becker and Dan.
Complexity 16-1 Complexity Andrei Bulatov Non-Approximability.
D ATABASE S YSTEMS I A DMIN S TUFF. 2 Mid-term exam Tuesday, Oct 2:30pm Room 3005 (usual room) Closed book No cheating, blah blah No class on Oct.
Dimensional Modeling Business Intelligence Solutions.
The Complexity of the Network Design Problem Networks, 1978 Classic Paper Reading
Chapter 3 The Relational Model Transparencies © Pearson Education Limited 1995, 2005.
1 Internet Networking Spring 2006 Tutorial 6 Network Cost of Minimum Spanning Tree.
1 Computing Nash Equilibrium Presenter: Yishay Mansour.
Chapter 3. 2 Chapter 3 - Objectives Terminology of relational model. Terminology of relational model. How tables are used to represent data. How tables.
Time-Variant Spatial Network Model Vijay Gandhi, Betsy George (Group : G04) Group Project Overview of Database Research Fall 2006.
1 Internet Networking Spring 2004 Tutorial 6 Network Cost of Minimum Spanning Tree.
Online Data Gathering for Maximizing Network Lifetime in Sensor Networks IEEE transactions on Mobile Computing Weifa Liang, YuZhen Liu.
Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Two Discrete Optimization Problems Problem: The Transportation Problem.
Lecture 2 The Relational Model. Objectives Terminology of relational model. How tables are used to represent data. Connection between mathematical relations.
Chapter 4 The Relational Model Pearson Education © 2014.
The Theory of NP-Completeness 1. Nondeterministic algorithms A nondeterminstic algorithm consists of phase 1: guessing phase 2: checking If the checking.
Distributed Algorithms 2014 Igor Zarivach A Distributed Algorithm for Minimum Weight Spanning Trees By Gallager, Humblet,Spira (GHS)
Algorithms for Provisioning Virtual Private Networks in the Hose Model Source: Sigcomm 2001, to appear in IEEE/ACM Transactions on Networking Author: Amit.
Internet Traffic Engineering by Optimizing OSPF Weights Bernard Fortz (Universit é Libre de Bruxelles) Mikkel Thorup (AT&T Labs-Research) Presented by.
1 The TSP : NP-Completeness Approximation and Hardness of Approximation All exact science is dominated by the idea of approximation. -- Bertrand Russell.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Graph Theory Topics to be covered:
Lecture 22 More NPC problems
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Lecture2: Database Environment Prepared by L. Nouf Almujally & Aisha AlArfaj 1 Ref. Chapter2 College of Computer and Information Sciences - Information.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Towards Efficient Large-Scale VPN Monitoring and Diagnosis under Operational Constraints Yao Zhao, Zhaosheng Zhu, Yan Chen, Northwestern University Dan.
Models in I.E. Lectures Introduction to Optimization Models: Shortest Paths.
In this session, you will learn to: Map an ER diagram to a table Objectives.
Answering pattern queries using views Yinghui Wu UC Santa Barbara Wenfei Fan University of EdinburghSouthwest Jiaotong University Xin Wang.
An Efficient Linear Time Triple Patterning Solver Haitong Tian Hongbo Zhang Zigang Xiao Martin D.F. Wong ASP-DAC’15.
Chapter 2 Introduction to Relational Model. Example of a Relation attributes (or columns) tuples (or rows) Introduction to Relational Model 2.
Chapter 2: Intro to Relational Model. 2.2 Example of a Relation attributes (or columns) tuples (or rows)
CSE 421 Algorithms Richard Anderson Lecture 27 NP-Completeness and course wrap up.
Analysis and algorithms of the construction of the minimum cost content-based publish/subscribe overlay Yaxiong Zhao and Jie Wu
Union-Find  Application in Kruskal’s Algorithm  Optimizing Union and Find Methods.
Flipping letters to minimize the support of a string Giuseppe Lancia, Franca Rinaldi, Romeo Rizzi University of Udine.
The Relational Model. 2 Relational Model Terminology u A relation is a table with columns and rows. –Only applies to logical structure of the database,
OPTIMAL CONNECTIONS: STRENGTH AND DISTANCE IN VALUED GRAPHS Yang, Song and David Knoke RESEARCH QUESTION: How to identify optimal connections, that is,
NP-completeness NP-complete problems. Homework Vertex Cover Instance. A graph G and an integer k. Question. Is there a vertex cover of cardinality k?
Network Partition –Finding modules of the network. Graph Clustering –Partition graphs according to the connectivity. –Nodes within a cluster is highly.
The Theory of NP-Completeness 1. Nondeterministic algorithms A nondeterminstic algorithm consists of phase 1: guessing phase 2: checking If the checking.
CSCI-256 Data Structures & Algorithm Analysis Lecture Note: Some slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. 3.
Chapter 4 The Relational Model Pearson Education © 2009.
Honors Track: Competitive Programming & Problem Solving Seminar Topics Kevin Verbeek.
Practical Database Design and Tuning
Graphs: Definitions and Basic Properties
By: Sibo Wang, Xiaokui Xiao, Yin Yang, Wenqing Lin
Randomized Algorithms CS648
Consensus Partition Liang Zheng 5.21.
Algorithms for Budget-Constrained Survivable Topology Design
Mobile-Assisted Localization in Sensor Network
Presentation transcript:

Summary Graphs for Relational Database Schemas Xiaoyan Yang (NUS) Cecilia M. Procopiuc, Divesh Srivastava (AT&T)

Motivation  Complex database schemas in large enterprise systems – tables, columns, PK/FK edges  Prior work to help users understand complex schemas – Customized views (forms) to hide database schema – Present informative tables to simplify schema understanding  Goal: schema graph summary connecting user’s query tables – Needs to be succinct – Needs to preserve informative join paths 2

Complex Schema Graph Example 3  Complex database schema in a large real enterprise system – Too complex for illustrative purposes

TPC-E Benchmark Schema Graph 4

Useless TPC-E Schema Summary Graph 5 security trade customer status_type Graph weight =  Not very informative: all query tables have a status_type field – Succinct graph does not mean informative graph!

Informative TPC-E Schema Summary Graph 6 customer customer_account holding_summary Graph weight =  Very informative: securities held by, trades made by customer – Larger graph, smaller graph weight, union of shortest paths security trade

Useless TPC-E Schema Summary Graph 7  Union of pairwise shortest paths is not the answer – Small graph weight, but verbosity hinders understandability

Succinct TPC-E Schema Summary Graph 8 commission_ratecustomer_taxrate broker industry customer_account exchange Graph weight =  Informative & succinct: customer_account, exchange are hubs – Slightly larger graph weight, but informative and succinct

Outline  Motivation  Problem statement  Our solution – Defining schema edge weights – Computing summary graphs  Experimental results 9

Desiderata  Schema graph summary must be informative and succinct  Need a formal definition of “informative” – Use Information Theory  Need a formal definition of “succinct” – Use Graph Summarization 10

Problem Statement 1: Informative Edges  Given schema graph G = (R, E) and database instance D  Problem 1: define schema edge weights, wt: E  R + – More informative join edges have smaller weights (≥ 0) – Extend wt(R 1, R 2 ) = weight of shortest path between R 1 and R 2 11

Problem Statement 2: Succinct Graph  Given schema G = (R, E), weight wt, user-specified tables Q  Problem 2: compute summary graph (R s, E s ) – Q  R s  R, |R s | ≤ |Q|+B, for a given small budget B – Meta-edges E s  {(R 1, R 2 ) | exists path between R 1 and R 2 in G} – (R s, E s ) must preserve shortest paths between Q tables in G – Optimize: (R s, E s ) has the minimum sum of meta-edge weights 12

Outline  Motivation  Problem statement  Our solution – Defining schema edge weights – Computing summary graphs  Experimental results 13

Informative Edges: Column Graph  Build an edge weighted column graph G C = (N C, E C ) where – N C consists of all primary and foreign key columns in all tables – Intra-table edges in E C = {(R.P, R.F) | R.P is a PK column of R} – Inter-table edges in E C = {(R.P, R 1.F) | R 1.F is a foreign key to R.P} – Edge weights based on mutual information between columns 14 A B C DE F R S T

Informative Edges: Table Graph  Induce an edge weighted table graph G T = (N T, E T ) where – N T consists of all tables – E T = {(R, R 1 ) | R 1.F is a foreign key to R.P} – Edge weight = min sum of weights on path between PK columns 15 A B C DE F R S T R S T

Edge Weight: Using Mutual Information  Mutual information I(X;Y) =  x  y p(x,y) log 2 (p(x,y)/p(x)p(y)) – Mutual information captures strength of linkage between X, Y  D(X,Y) = 1 – H(X,Y)/I(X;Y) is a distance function, H() is entropy – D(X,Y) = 0 iff X, Y are identical; D(X,Y) = 1 iff X, Y are independent 16 X1234 Y2213 i(x;y) I(X;Y) = 1.5H(X,Y) = 2.0, D(X,Y) = 0.25 i(x;y) H(X|Y) I(X;Y) H(Y|X) H(X) H(X,Y) H(Y)

Outline  Motivation  Problem statement  Our solution – Defining schema edge weights – Computing summary graphs  Experimental results 17

Summary Graph  Given schema graph G = (R, E), edge weight wt: E  R +, and user-specified tables Q, compute summary graph (R s, E s ) – Q  R s  R, |R s | ≤ |Q|+B, for a given small budget B – Meta-edges E s  {(R 1, R 2 ) | exists path between R 1 and R 2 in G} – (R s, E s ) must preserve shortest paths between Q tables in G – Optimize: (R s, E s ) has the minimum sum of meta-edge weights 18 R S BT R S ABT Total weight = 1.2 Total weight = 1.1 R S AT Total weight = 0.7

Properties of Summary Graphs  Theorem: Computing the optimal summary graph is NP-hard Proof uses reduction from Clique in (n – 4)-regular graphs  Proposition (towards an elegant solution formulation): – It is sufficient to compute an optimal summary graph for the smaller graph consisting of shortest paths between Q nodes – Endpoints of meta-edges in optimal summary graph have to appear together on at least one shortest path between Q nodes 19

Efficient Computation of Summary Graphs  It is sufficient to compute an optimal summary graph for the smaller graph consisting of shortest paths between Q nodes  Elegant solution: formulate an integer program; use CPLEX 20

Outline  Motivation  Problem statement  Our solution – Defining schema edge weights – Computing summary graphs  Experimental results 21

Experimental Setup  Data: use 2 instances of TPC-E benchmark database schema – Simulates an OLTP workload of a brokerage firm – Well-specified schema, including PK/FK constraints  Quality: use measures based on the TPC-E transaction logs – Table coverage: relative frequency of summary graph tables in log – Join coverage: relative frequency of summary graph joins in log – Summary graph density: reflects complexity of summary graph 22

Comparing Weight Functions  Compare MI-based and MAF-based [YPS09] edge weights – Fixed B, varying |Q|; fixed |Q|, varying B – MI-based weight is superior: higher coverage, lower density 23

Choosing Budget Tables  Effect of our strategy for choosing budget tables – Use coordinated summary graphs for fixed |Q|+B – Budget nodes reduce complexity, improve quality 24

Summary  Complex database schemas in large enterprise systems – tables, columns, PK/FK edges  Novel schema graph summary is informative and succinct – Define schema graph edge weights using mutual information – Compute succinct summary graph that preserves query table shortest paths and minimizes graph weight, for a given budget – Experimental study validates weight definition, summary model  Future work: approximations for schema graph summaries 25