GORDIAN: Efficient and Scalable Discovery of Composite Keys

Slides:



Advertisements
Similar presentations
Project 1 ACSM PowerPoint.
Advertisements

Microsoft ® Office PowerPoint ® 2007 Training Discover the power of custom layouts Sweetwater ISD presents:
MS® PowerPoint.
Introduction Headers & Footers. You will learn how to: Create, Format, Edit and Delete Create Different Header/Footer in a Document Create a First Page.
Office 2003 Advanced Concepts and Techniques M i c r o s o f t PowerPoint Project 4 Modifying Visual Elements and Presentation Formats.
Course ILT Modifying presentations Unit objectives Create a presentation based on a template and apply a template to an existing presentation Identify.
Course ILT Modifying presentations Unit objectives Work with design templates Work with slide masters Set transitions, and fine-tune a presentation’s pace.
Office 2003 Introductory Concepts and Techniques M i c r o s o f t PowerPoint Project 2 Using the Outline Tab and Clip Art to Create a Slide Show.
Chapter 10—Creating Presentations
Chapter 2 Creating a Research Paper with Citations and References
 Found on Home Tab  Layout Option under Slides › Title › Title and Content › Section Header › 2 content › Comparison › Title Only › Blank, etc.
XP New Perspectives on Microsoft Office Access 2003, Second Edition- Tutorial 2 1 Microsoft Office Access 2003 Tutorial 2 – Creating And Maintaining A.
PowerPoint The Basics. Where is it?  Hard Drive / Microsoft Office / PowerPoint.
Creating And Maintaining A Database. 2 Learn the guidelines for designing databases When designing a database, first try to think of all the fields of.
PowerPoint Tutorial 1: Creating a Presentation
FIRST COURSE PowerPoint. XP New Perspectives on Microsoft Office 2007: Windows XP Edition2 What Is PowerPoint? PowerPoint is a powerful presentation graphics.
Key Applications Module Lesson 19 — PowerPoint Essentials
Creating a Presentation
XP Deleting Slides In Slide Sorter view or in the Slides tab in Normal view, right-click the slide thumbnail of the slide you want to delete; or in the.
Chapter 6 Generating Form Letters, Mailing Labels, and a Directory
XP New Perspectives on Microsoft Office Access 2003, Second Edition- Tutorial 2 1 Microsoft Office Access 2003 Tutorial 2 – Creating And Maintaining A.
Business Documents with Word
© 2012 The McGraw-Hill Companies, Inc. All rights reserved. word 2010 Chapter 3 Formatting Documents.
1 Word Lesson 3 Formatting Documents Microsoft Office 2010 Fundamentals Story / Walls.
1 What to do before class starts??? Download the sample database from the k: drive to the u: drive or to your flash drive. The database is named “FormBelmont.accdb”
Preset and custom animation
Pollison’s Pointers on PowerPoint Business Department Ms. Pollison.
Chapter 4. Learning outcomes This Chapter will partially cover the learning outcome No. 2 i.e. Design presentations that use animation effects. (L02)
1 2 Left Click 3 Left Click 4 Left Click NOTE: Your screen may look different; however, find the Start button (lower left corner); then All Programs; ending.
PowerPoint: Animation Randy Graff HSC IT Center Training
Microsoft Word 2010 Chapter 2 Creating a Research Paper with Citations and References.
Computer Literacy for IC 3 Unit 2: Using Productivity Software Chapter 9: Creating a Presentation © 2010 Pearson Education, Inc. | Publishing as Prentice.
Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.
Microsoft PowerPoint 101 Andy Meyers Ticor and Chicago Title
Desktop Publishing Lesson 3 — Formatting Pages. Lesson 3 – Formatting Pages2 Objectives  Set up pages.  Set guides.  Use master pages.  Insert page.
Creating a Presentation
Computer Fundamentals 1
Before you begin If a yellow security bar appears at the top of the screen in PowerPoint, click Enable Editing. You need PowerPoint 2010 to view this.
Microsoft Office PowerPoint 2003
Microsoft Access 2007 – Level 2
Planning and Building a Presentation
Exploring Microsoft Office Access
SE Linux Implementation
SSO and DCS Petr Habarta SSO+DCS MU FIT.
Chapter II Creating Your First PowerPoint Presentation
Relational Algebra Chapter 4, Part A
Translation of ER-diagram into Relational Schema
© Paradigm Publishing, Inc.
Database Applications – Microsoft Access
Avishai Mandelbaum, Technion Sergey Zeltyn, IBM Research Lab, Haifa
Shared Services CMS – RFS Horizontal Manila & Dubuque GDF Q&As
Lesson 21 Getting Started with PowerPoint Essentials
Museum Digital Guide & Administration System
Access Lesson 2 Creating a Database
Management and reporting of IT services: CSO
Chapter 2 Creating a Research Paper with References and Sources
Indexing and Hashing Basic Concepts Ordered Indices
Relational Algebra Chapter 4, Sections 4.1 – 4.2
PowerPoint Lesson 2 Creating and Enhancing PowerPoint Presentations
Word offers a number of features to help you streamline the formatting of documents. In this chapter, you will learn how to use predesigned building blocks.
A Diagnostic Method for Detecting and Assessing the Impact of Physical Design Optimizations on Routing Robert Lembach Rafael A. Arce-Nazario Donald Eisenmenger.
Learning PowerPoint.
Microsoft Office Access 2003
NOTES ON USING THIS TEMPLATE
Fred Douglis IBM T.J. Watson Research Center
Computer Literacy BASICS
Lesson 1 – PowerPoint Essentials
Unit J: Creating a Database
Lab 08 Introduction to Spreadsheets MS Excel
Foundation Level Course
Presentation transcript:

GORDIAN: Efficient and Scalable Discovery of Composite Keys Yannis Sismanis Peter J. Haas Berthold Reinwald Paul Brown IBM Almaden Research Center To replace the title / subtitle with your own: Click on the title block -> select all the text by pressing Ctrl+A -> press Delete key -> type your own text 1/17/2019 VLDB 2006

Motivation Key discovery is of fundamental importance for Data Modeling Data Integration Anomaly Detection Query Formulation & Optimization Indexing Key information is often incomplete/unknown to the DBMS because: Represents an unknown “constraint/dependency” inherent to the domain Keys arise fortuitously from statistical properties of the data Exploited by the application without the DBMS explicitly knowing Cost reasons force the DBA not to explicitly identify/enforce it Computationally Hard Finding the minimum (least number of attributes) key: NP-Hard Finding all the keys: #P-Hard A pot file is a Design Template file, which provides you the “look” of the presentation You apply a pot file by opening the Task Pane with View > Task Pane and select Slide Design – Design Templates. Click on the word Browse… at bottom of Task Pane and navigate to where you stored BlueOnyx Deluxe.pot (black background) or BluePearl Deluxe.pot (white background) and click on Apply. You can switch between black and white background by navigating to that pot file and click on Apply. Another easier way to switch background is by changing color scheme. Opening the Task Pane, select Slide Design – Color Schemes and click on one of the two schemes. All your existing content (including Business Unit or Product Names) will be switched without any modification to color or wording. Start with Blank Presentation, then switch to the desired Design Template Start a new presentation as Blank Presentation You can switch to Blue Onyx Deluxe.pot by opening the Task Pane with View > Task Pane and select Slide Design – Design Templates. Click on the word Browse… at bottom of Task Pane and navigate to where you stored BlueOnyx Deluxe.pot (black background) and click on Apply. Your existing content will take on Blue Onyx’s black background, and previous black text will turn to white. You should add your Business Unit or Product Name by modifying it on the Slide Master You switch to the Slide Master view by View > Master > Slide Master. Click on the Title Page thumbnail icon on the left, and click on the Business Unit or Product Name field to modify it. Click on the Bullet List Page thumbnail icon on the left, and click on the Business Unit or Product Name field to modify it. Click on Close Master View button on the floating Master View Toolbar You can turn on the optional date and footer fields by View > Header and Footer Suggested footer on all pages including Title Page: Presentation Title | Confidential Date and time field can be fixed, or Update automatically. It appears to the right of the footer. Slide number field can be turned on as well. It appears to the left of the footer. 1/17/2019 VLDB 2006

GORDIAN in a Nutshell Good “typical-case” behavior on real-world datasets Artificial and unrepresentative datasets used for worst-case performance Scales well with the number of attributes For Zipfian data time complexity is polynomial on number of attributes Basic idea Cube computation Many optimizations since we do not deal with storing/indexing or even fully computing the cube Interleave the computation with the discovery of non-keys A non-key can be discovered by looking at a subset of the entities A non-key cannot be invalidated as more entities are examined Since any subset of the attributes of a non-key is also a non-key apply “Apriori”-like pruning techniques Compute the complement yielding the desired keys Sampling extensions for the discovery of high-quality approximate keys 1/17/2019 VLDB 2006

Keys & Non-Keys <EmpNo>, <Last Name,Phone>, <First Name,Phone>,… are keys <First Name,Last Name>, <First Name>,… are non-keys A non-key K covers a non-key K’ iff K’K A set of non-keys {K1,K2,…} is non-redundant iff KiKj ,i≠j 1/17/2019 VLDB 2006

Cube Operator / NoKeys … If any count >1 then the corresponding projection is a non-key Computing the whole cube and then start looking for non-keys is very inefficient 1/17/2019 VLDB 2006

Slice FirstName=‘Michael’ Cube Slices/Segments … segments … Slice FirstName=‘Michael’ A selection on the cube returns a “slice” i.e. FirstName=‘Michael’ A slice consists of “segments” GORDIAN is based on a slice-by-slice computation 1/17/2019 VLDB 2006

Slice Subsumption & Non-keys … … Slice F : First Name = ‘Michael’ Slice L : Last Name=‘Thompson’ A slice is L “subsumed” by another slice F (LF) iff all the segments of L contain only tuples that can be extracted from the tuples of F by just projecting out some attributes Subsumption Lemma: If LF then every non-key of L is redundant to some non-key of F <LastName> is redundant to <FirstName,LastName> Clarify that we are showing some segments of the slices Define subsumption *very* clearly 1/17/2019 VLDB 2006

GORDIAN Overview 1/17/2019 VLDB 2006

Prefix tree 1/17/2019 VLDB 2006

Slice-by-Slice Computation Slice corresponds to a path from the root to a node I.e. FirstName=“Michael” and LastName=“Thompson” corresponds to the path (1),(2) Segments of the slice are computing by “merging” sub-trees (next slide) Slice-by-Slice computation = Systematic Traversal of the prefix Tree Discovers subsumed slices 1/17/2019 VLDB 2006

Compute segments by “merging” (1) Michael First Name [0] (2) Thompson Last Name [1] (3) 3478 6791 Phone [2] (4) (5) 10 1 50 1 Emp No [3] Add animation to show merging Discuss space efficiency Add 8(d) 50 1 10 (M1) Slice FirstName=‘Michael’ and LastName=‘Thompson’ Path (1),(2) to node (3) Compute segment <EmpNo> by merging the children trees of (3) 1/17/2019 VLDB 2006

GORDIAN in action (1) Michael Sally First Name [0] (2) (8) Thompson Spencer Kwan Last Name [1] (3) (6) (9) 3478 6791 5237 3478 Phone [2] (4) (5) (7) (10) 10 1 50 1 90 1 20 1 Emp No [3] Thompson Spencer Kwan (3) After merging nodes (2) and (8): (6) (9) (M4) 3478 6791 5237 After merging nodes (3) and (6): (4) (5) (7) (M2) 10 1 After merging nodes (4) and (5): 50 (M1) After merging nodes (4), (5) and (7): (M3) 90 50 1 10 10 1 20 3478 6791 5237 (5) (7) After merging nodes (3),(6) and (9): (M5) (M6) 1 (M7) After merging nodes (M6), (5) and (7): 10 20 50 90 <First Name> <First Name, Last Name> <Phone> 1/17/2019 VLDB 2006

Subsumption in Prefix-Trees (a) Correlation (b) Sparsity Clarify (Ref1) and (Ref2) have been previously traversed Remind equivalence of slice = root-to-node path Subsumption appears because of Correlations (Ref1), (Ref2) have been previously traversed and point to subsumed slices Sparsity A root-to-node path points to a sparse area with a small number of tuples 1/17/2019 VLDB 2006

Complexity Zipfian Datasets Singleton pruning only due to sparsity No correlations (conservative approach) Time complexity: Where: s the number of non-redundant non-keys d the number of attributes T the number entities and θ the skew of the data 1/17/2019 VLDB 2006

Sampling A key for the whole dataset is a key in any sample No missing keys False “positives” Discovery of approximate keys Strength of an approximate key = Tight Lower Bound on Strength : where: N is the sample size Dv the distinct count 1/17/2019 VLDB 2006

Experiment Setup GORDIAN implemented as a UDF for DB2 v8.2 Performance comparison w.r.t Number of Entities Number of Attributes Pruning Evaluation Accuracy Evaluation (Approximate keys) 1/17/2019 VLDB 2006

Time Comparison 1/17/2019 VLDB 2006

Attribute Scalability 1/17/2019 VLDB 2006

Pruning Effect 1/17/2019 VLDB 2006

Accuracy Evaluation (Sampling) 1/17/2019 VLDB 2006

Conclusions GORDIAN has excellent “typical-case” behavior on real-world datasets Very good scalability w.r.t the number of attributes For Zipfian data time complexity is polynomial on number of attributes Innovation Formulate as a Cube computation problem Interleave the computation with the discovery of non-keys Novel Subsumption Pruning Extra Apriori-like pruning Sampling extensions for the discovery of high-quality approximate keys 1/17/2019 VLDB 2006

Questions? 1/17/2019 VLDB 2006