GORDIAN: Efficient and Scalable Discovery of Composite Keys

GORDIAN: Efficient and Scalable Discovery of Composite Keys
Yannis Sismanis Peter J. Haas Berthold Reinwald Paul Brown IBM Almaden Research Center To replace the title / subtitle with your own: Click on the title block -> select all the text by pressing Ctrl+A -> press Delete key -> type your own text 1/17/2019 VLDB 2006

Motivation Key discovery is of fundamental importance for
Data Modeling Data Integration Anomaly Detection Query Formulation & Optimization Indexing Key information is often incomplete/unknown to the DBMS because: Represents an unknown “constraint/dependency” inherent to the domain Keys arise fortuitously from statistical properties of the data Exploited by the application without the DBMS explicitly knowing Cost reasons force the DBA not to explicitly identify/enforce it Computationally Hard Finding the minimum (least number of attributes) key: NP-Hard Finding all the keys: #P-Hard A pot file is a Design Template file, which provides you the “look” of the presentation You apply a pot file by opening the Task Pane with View > Task Pane and select Slide Design – Design Templates. Click on the word Browse… at bottom of Task Pane and navigate to where you stored BlueOnyx Deluxe.pot (black background) or BluePearl Deluxe.pot (white background) and click on Apply. You can switch between black and white background by navigating to that pot file and click on Apply. Another easier way to switch background is by changing color scheme. Opening the Task Pane, select Slide Design – Color Schemes and click on one of the two schemes. All your existing content (including Business Unit or Product Names) will be switched without any modification to color or wording. Start with Blank Presentation, then switch to the desired Design Template Start a new presentation as Blank Presentation You can switch to Blue Onyx Deluxe.pot by opening the Task Pane with View > Task Pane and select Slide Design – Design Templates. Click on the word Browse… at bottom of Task Pane and navigate to where you stored BlueOnyx Deluxe.pot (black background) and click on Apply. Your existing content will take on Blue Onyx’s black background, and previous black text will turn to white. You should add your Business Unit or Product Name by modifying it on the Slide Master You switch to the Slide Master view by View > Master > Slide Master. Click on the Title Page thumbnail icon on the left, and click on the Business Unit or Product Name field to modify it. Click on the Bullet List Page thumbnail icon on the left, and click on the Business Unit or Product Name field to modify it. Click on Close Master View button on the floating Master View Toolbar You can turn on the optional date and footer fields by View > Header and Footer Suggested footer on all pages including Title Page: Presentation Title | Confidential Date and time field can be fixed, or Update automatically. It appears to the right of the footer. Slide number field can be turned on as well. It appears to the left of the footer. 1/17/2019 VLDB 2006

GORDIAN in a Nutshell Good “typical-case” behavior on real-world datasets Artificial and unrepresentative datasets used for worst-case performance Scales well with the number of attributes For Zipfian data time complexity is polynomial on number of attributes Basic idea Cube computation Many optimizations since we do not deal with storing/indexing or even fully computing the cube Interleave the computation with the discovery of non-keys A non-key can be discovered by looking at a subset of the entities A non-key cannot be invalidated as more entities are examined Since any subset of the attributes of a non-key is also a non-key apply “Apriori”-like pruning techniques Compute the complement yielding the desired keys Sampling extensions for the discovery of high-quality approximate keys 1/17/2019 VLDB 2006

Keys & Non-Keys <EmpNo>, <Last Name,Phone>, <First Name,Phone>,… are keys <First Name,Last Name>, <First Name>,… are non-keys A non-key K covers a non-key K’ iff K’K A set of non-keys {K1,K2,…} is non-redundant iff KiKj ,i≠j 1/17/2019 VLDB 2006

Cube Operator / NoKeys …
If any count >1 then the corresponding projection is a non-key Computing the whole cube and then start looking for non-keys is very inefficient 1/17/2019 VLDB 2006

Slice FirstName=‘Michael’
Cube Slices/Segments … segments … Slice FirstName=‘Michael’ A selection on the cube returns a “slice” i.e. FirstName=‘Michael’ A slice consists of “segments” GORDIAN is based on a slice-by-slice computation 1/17/2019 VLDB 2006

Slice Subsumption & Non-keys
… … Slice F : First Name = ‘Michael’ Slice L : Last Name=‘Thompson’ A slice is L “subsumed” by another slice F (LF) iff all the segments of L contain only tuples that can be extracted from the tuples of F by just projecting out some attributes Subsumption Lemma: If LF then every non-key of L is redundant to some non-key of F <LastName> is redundant to <FirstName,LastName> Clarify that we are showing some segments of the slices Define subsumption *very* clearly 1/17/2019 VLDB 2006

GORDIAN Overview 1/17/2019 VLDB 2006

Prefix tree 1/17/2019 VLDB 2006

Slice-by-Slice Computation
Slice corresponds to a path from the root to a node I.e. FirstName=“Michael” and LastName=“Thompson” corresponds to the path (1),(2) Segments of the slice are computing by “merging” sub-trees (next slide) Slice-by-Slice computation = Systematic Traversal of the prefix Tree Discovers subsumed slices 1/17/2019 VLDB 2006

Compute segments by “merging”
(1) Michael First Name [0] (2) Thompson Last Name [1] (3) 3478 6791 Phone [2] (4) (5) 10 1 50 1 Emp No [3] Add animation to show merging Discuss space efficiency Add 8(d) 50 1 10 (M1) Slice FirstName=‘Michael’ and LastName=‘Thompson’ Path (1),(2) to node (3) Compute segment <EmpNo> by merging the children trees of (3) 1/17/2019 VLDB 2006

GORDIAN in action (1) Michael Sally First Name [0] (2) (8) Thompson
Spencer Kwan Last Name [1] (3) (6) (9) 3478 6791 5237 3478 Phone [2] (4) (5) (7) (10) 10 1 50 1 90 1 20 1 Emp No [3] Thompson Spencer Kwan (3) After merging nodes (2) and (8): (6) (9) (M4) 3478 6791 5237 After merging nodes (3) and (6): (4) (5) (7) (M2) 10 1 After merging nodes (4) and (5): 50 (M1) After merging nodes (4), (5) and (7): (M3) 90 50 1 10 10 1 20 3478 6791 5237 (5) (7) After merging nodes (3),(6) and (9): (M5) (M6) 1 (M7) After merging nodes (M6), (5) and (7): 10 20 50 90 <First Name> <First Name, Last Name> <Phone> 1/17/2019 VLDB 2006

Subsumption in Prefix-Trees
(a) Correlation (b) Sparsity Clarify (Ref1) and (Ref2) have been previously traversed Remind equivalence of slice = root-to-node path Subsumption appears because of Correlations (Ref1), (Ref2) have been previously traversed and point to subsumed slices Sparsity A root-to-node path points to a sparse area with a small number of tuples 1/17/2019 VLDB 2006

Complexity Zipfian Datasets Singleton pruning only due to sparsity
No correlations (conservative approach) Time complexity: Where: s the number of non-redundant non-keys d the number of attributes T the number entities and θ the skew of the data 1/17/2019 VLDB 2006

Sampling A key for the whole dataset is a key in any sample
No missing keys False “positives” Discovery of approximate keys Strength of an approximate key = Tight Lower Bound on Strength : where: N is the sample size Dv the distinct count 1/17/2019 VLDB 2006

Experiment Setup GORDIAN implemented as a UDF for DB2 v8.2
Performance comparison w.r.t Number of Entities Number of Attributes Pruning Evaluation Accuracy Evaluation (Approximate keys) 1/17/2019 VLDB 2006

Time Comparison 1/17/2019 VLDB 2006

Attribute Scalability
1/17/2019 VLDB 2006

Pruning Effect 1/17/2019 VLDB 2006

Accuracy Evaluation (Sampling)
1/17/2019 VLDB 2006

Conclusions GORDIAN has excellent “typical-case” behavior on real-world datasets Very good scalability w.r.t the number of attributes For Zipfian data time complexity is polynomial on number of attributes Innovation Formulate as a Cube computation problem Interleave the computation with the discovery of non-keys Novel Subsumption Pruning Extra Apriori-like pruning Sampling extensions for the discovery of high-quality approximate keys 1/17/2019 VLDB 2006

Questions? 1/17/2019 VLDB 2006

GORDIAN: Efficient and Scalable Discovery of Composite Keys

Similar presentations

Presentation on theme: "GORDIAN: Efficient and Scalable Discovery of Composite Keys"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GORDIAN: Efficient and Scalable Discovery of Composite Keys

Similar presentations

Presentation on theme: "GORDIAN: Efficient and Scalable Discovery of Composite Keys"— Presentation transcript:

Similar presentations

About project

Feedback