1 Extending Dependencies with Conditions Loreto Bravo University of Edinburgh Wenfei Fan University of Edinburgh & Bell Laboratories Shuai Ma University.

Slides:



Advertisements
Similar presentations
CS848: Topics in Databases: Foundations of Query Optimization Topics covered  Introduction to description logic: Single column QL  The ALC family of.
Advertisements

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 16 Relational Database Design Algorithms and Further Dependencies.
1 Constraint Satisfaction Problems A Quick Overview (based on AIMA book slides)
Announcements Read 6.1 – 6.3 for Wednesday Project Step 3, due now Homework 5, due Friday 10/22 Project Step 4, due Monday Research paper –List of sources.
Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.
New Models for Graph Pattern Matching Shuai Ma ( 马 帅 )
Mapping Data in Peer-to-Peer Systems: Semantics and Algorithmic Issues By A. Kementsietsidis, M. Arenas and R.J. Miller Presented by Md. Anisur Rahman:
Manipulating Functional Dependencies Zaki Malik September 30, 2008.
1 Conjunctions of Queries. 2 Conjunctive Queries A conjunctive query is a single Datalog rule with only non-negated atoms in the body. (Note: No negated.
Timed Automata.
Review: Constraint Satisfaction Problems How is a CSP defined? How do we solve CSPs?
The Theory of NP-Completeness
1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 
Properties of SLUR Formulae Ondřej Čepek, Petr Kučera, Václav Vlček Charles University in Prague SOFSEM 2012 January 23, 2012.
CPSC 322, Lecture 12Slide 1 CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12 (Textbook Chpt ) January, 29, 2010.
Solving Partial Order Constraints for LPO termination.
Chapter 7: Relational Database Design. ©Silberschatz, Korth and Sudarshan7.2Database System Concepts Chapter 7: Relational Database Design First Normal.
Ryan Kinworthy 2/26/20031 Chapter 7- Local Search part 1 Ryan Kinworthy CSCE Advanced Constraint Processing.
Logic in Computer Science Transparency No Chapter 3 Propositional Logic 3.6. Propositional Resolution.
CMSC424: Database Design Instructor: Amol Deshpande
4/25/08Prof. Hilfinger CS164 Lecture 371 Global Optimization Lecture 37 (From notes by R. Bodik & G. Necula)
1 CMSC424, Spring 2005 CMSC424: Database Design Lecture 9.
Data Flow Analysis Compiler Design Nov. 8, 2005.
1 Conditional Dependencies Wenfei Fan University of Edinburgh and Bell Laboratories.
Programming Language Semantics Denotational Semantics Chapter 5 Part III Based on a lecture by Martin Abadi.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Prof. Bodik CS 164 Lecture 16, Fall Global Optimization Lecture 16.
Towards Certain Fixes with Editing Rules and Master Data Wenfei Fan Shuai Ma Nan Tang Wenyuan Yu University of Edinburgh Jianzhong Li Harbin Institute.
SAT Solver Math Foundations of Computer Science. 2 Boolean Expressions  A Boolean expression is a Boolean function  Any Boolean function can be written.
Making Pattern Queries Bounded in Big Graphs 11 Yang Cao 1,2 Wenfei Fan 1,2 Jinpeng Huai 2 Ruizhe Huang 1 1 University of Edinburgh 2 Beihang University.
©Silberschatz, Korth and Sudarshan7.1Database System Concepts Chapter 7: Relational Database Design First Normal Form Pitfalls in Relational Database Design.
1.1 Chapter 1: Introduction What is the course all about? Problems, instances and algorithms Running time v.s. computational complexity General description.
Semantic Matching Pavel Shvaiko Stanford University, October 31, 2003 Paper with Fausto Giunchiglia Research group (alphabetically ordered): Fausto Giunchiglia,
1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and.
Virtual Network Mapping: A Graph Pattern Matching Approach Yang Cao 1,2, Wenfei Fan 1,2, Shuai Ma University of Edinburgh 2 Beihang University.
1 Propagating Functional Dependencies with Conditions Wenfei Fan University of Edinburgh & Bell Laboratories Shuai Ma University of Edinburgh Yanli HuNational.
1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.
Relational Database Design by Relational Database Design by Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING.
SAT and SMT solvers Ayrat Khalimov (based on Georg Hofferek‘s slides) AKDV 2014.
February 18, 2015CS21 Lecture 181 CS21 Decidability and Tractability Lecture 18 February 18, 2015.
The Bernays-Schönfinkel Fragment of First-Order Autoepistemic Logic Peter Baumgartner MPI Informatik, Saarbrücken.
RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ
Hande ÇAKIN IES 503 TERM PROJECT CONSTRAINT SATISFACTION PROBLEMS.
Logical Database Design (1 of 3) John Ortiz Lecture 6Logical Database Design (1)2 Introduction  The logical design is a process of refining DB schema.
1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.
Functional Dependencies. FarkasCSCE 5202 Reading and Exercises Database Systems- The Complete Book: Chapter 3.1, 3.2, 3.3., 3.4 Following lecture slides.
Christoph F. Eick: Functional Dependencies, BCNF, and Normalization 1 Functional Dependencies, BCNF and Normalization.
NP-Complete Problems. Running Time v.s. Input Size Concern with problems whose complexity may be described by exponential functions. Tractable problems.
Reasoning about Record Matching Rules Wenfei Fan 1, 2 Xibei Jia 1 Shuai Ma 1 1 University of Edinburgh 2 Bell Labs Jianzhong Li Harbin Institute of Technology.
1 Reasoning with Infinite stable models Piero A. Bonatti presented by Axel Polleres (IJCAI 2001,
Functional Dependencies CIS 4301 Lecture Notes Lecture 8 - 2/7/2006.
Rensselaer Polytechnic Institute CSCI-4380 – Database Systems David Goldschmidt, Ph.D.
MIS 3053 Database Design And Applications The University Of Tulsa Professor: Akhilesh Bajaj Normal Forms Lecture 1 © Akhilesh Bajaj, 2000, 2002, 2003.
Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.
1 Extending and Inferring Functional Dependencies in Schema Transformation Qi He Tok Wang Ling Dept. of Computer Science School of Computing National Univ.
CS411 Database Systems Kazuhiro Minami 04: Relational Schema Design.
Assumption-based Truth Maintenance Systems: Motivation n Problem solvers need to explore multiple contexts at the same time, instead of a single one (the.
Relational Database Design Algorithms and Further Dependencies.
Chapter 8 Relational Database Design. 2 Relational Database Design: Goals n Reduce data redundancy (undesirable replication of data values) n Minimize.
Normalization and FUNctional Dependencies. Redundancy: root of several problems with relational schemas: –redundant storage, insert/delete/update anomalies.
Normal Forms Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems June 18, 2016 Some slide content courtesy of Susan Davidson.
CPT-S Advanced Databases 11 Yinghui Wu EME 49 ADB(ln23)
Chapter 15 Relational Design Algorithms and Further Dependencies
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Computing Full Disjunctions
3.1 Functional Dependencies
CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12
Propagating Functional Dependencies with Conditions
Chen Li Information and Computer Science
Relational Database Theory
Presentation transcript:

1 Extending Dependencies with Conditions Loreto Bravo University of Edinburgh Wenfei Fan University of Edinburgh & Bell Laboratories Shuai Ma University of Edinburgh

2 Outline  Why Conditional Dependencies? Data Cleaning Schema Matching  Conditional Inclusion Dependencies (CINDs) Definition Static Analysis Satisfiability Problem Implication Problem Inference System Static Analysis of CFDs+CINDs  Satisfiability Checking Algorithms (CFDs+CINDs)  Summary and Future Work

3 Motivation  Data Cleaning Real life data is dirty! Specify consistency using integrity constraints Inconsistencies emerge as violations of constraints Constraints considered so far: traditional Functional Dependencies - FD Inclusion Dependencies - IND...  Schema matching: needed for data exchange and data integration Pairings between semantically related source schema attributes and target schema attributes expressed as inclusion dependencies (e.g., Clio)

4 Example: Amazon database  Schema: book(id, isbn, title, price, format) CD(id, title, price, genre) order(id, title, type, price, country, county) idtitletypepricecountrycounty a23H. Porterbook17.99USDL a12J. DenverCD7.94UKReyden idisbntitleprice a23b32H. Porter17.99 a56b65Snow white7.94 idtitlepricegenre a12J. Denver17.99country a56Snow White7.94a-book order book CD

5 Data cleaning with inclusion dependencies idisbntitlepriceformat a23b32H. Porter17.99Hard covert3 a56b65Snow White17.94audiot4 book order idtitletypepricecountrycounty a23H. Porterbook17.99USDLt1 a12J. DenverCD7.94UKReydent2  Example Inclusion dependency: book[id, title, price]  order[id, title, price]  Definition of Inclusion Dependencies (INDs) R1[X]  R2[Y], for any tuple t1 in R1, there must exist a tuple t2 in R2, such that t2[Y]=t1[X]

6 Data cleaning meets conditions idisbntitlepriceformat a23b32H. Porter17.99Hard covert3 a56b65Snow White17.94audiot4 book order idtitletypepricecountrycounty a23H. Porterbook17.99USDLt1 a12J. DenverCD7.94UKReydent2 This inclusion dependency does not make sense!  How to express? Every book in order table must also appear in book table  Traditional inclusion dependencies: order[id, title, price]  book[id, title, price]

7 Data cleaning meets conditions idisbntitlepriceformat a23b32H. Porter17.99Hard covert3 a56b65Snow White17.94audiot4 book order idtitletypepricecountrycounty a23H. Porterbook17.99USDLt1 a12J. DenverCD7.94UKReydent2  Conditional inclusion dependency: order[id, title, price, type =‘ book’ ]  book[id, title, price]

8 Schema matching with inclusion dependencies  Traditional inclusion dependencies: book[id, title, price]  order[id, title, price] CD[id, title, price]  order[id, title, price]  Schema Matching: Pairings between semantically related source schema attributes and target schema attributes, which are de facto inclusion dependencies from source to target (e.g., Clio) idtitletypepricecountrycounty idisbntitlepriceidtitlepricegenre order bookCD

9 Schema matching meets conditions  Traditional inclusion dependencies: order[id, title, price]  book[id, title, price] order[id, title, price]  CD[id, title, price] These inclusion dependencies do not make sense! idtitletypepricecountrycounty idisbntitlepriceidtitlepricegenre order bookCD

10 Schema matching meets conditions idtitletypepricecountrycounty idisbntitlepriceidtitlepricegenre Conditional inclusion dependencies: order[id, title, price; type =‘ book’]  book[id, title, price] order[id, title, price; type = ‘CD’]  CD[id, title, price] The constraints do not hold on the entire order table  order[id, title, price]  book[id, title, price] holds only if type = ‘book’  order[id, title, price]  CD[id, title, price] holds only if type = ‘CD’ order bookCD

11  (R1[X; Xp]  R2[Y; Yp], Tp): R1[X]  R2[Y]: embedded traditional IND from R1 to R2 attributes: X  Xp  Y  Yp Tp: a pattern tableau tuples in Tp consist of constants and unnamed variable _  Example : CD[ id, title, price; genre = ‘a-book’]  book[ id, title, price; format = ‘audio’]  Corresponding CIND : (CD[id, title, price; genre]  book[id, title, price; format], Tp) Conditional Inclusion Dependencies (CINDs) idtitlepricegenreidtitlepriceformat ___a-book___audio Tp

12 R1[X]  R2[Y]  X: [A1, …, An]  Y : [B1, …, Bn] As a CIND: (R1[X; nil]  R2[Y; nil], Tp)  pattern tableau Tp: a single tuple consisting of _ only CINDs subsume traditional INDs INDs as a special case of CINDs A1…AnB1…Bn ______

13 Static Analysis of CINDs  Satisfiability problem INPUT: Give a set Σ of constraints Question: Does there exist a nonempty instance I satisfying Σ?  Whether Σ itself is dirty or not  For INDs the problem is trivially true  For CFDs (to be seen shortly) it is NP-complete  Good news for CINDs Proposition: Any set of CINDs is always satisfiable I╠ ΣI╠ Σ

14 Static Analysis of CINDs  Implication problem INPUT: set Σ of constraints and a single constraint φ Question: for each instance I that satisfies Σ, does I also satisfy φ ?  Remove redundant constraints  PSPACE-complete for traditional inclusion dependencies Theorem. Complexity bounds for CINDs Presence of constants PSPACE-complete in the absence of finite domain attributes Good news – The same as INDs EXPTIME-complete in the general setting Σ╠ φ

15 Finite axiomatizability of CINDs 1-Reflexivity 2-Projection and Permutation 3-Transitivity IND Counterparts Finite Domain Attributes Sound and Complete in the Absence of Finite Attributes Theorem. The above eight rules constitute a sound and complete inference system for implication analysis of CINDs 7-F-reduction 8-F-upgrade 4-Downgrading 5-Augmentation 6-Reduction  φ is implied by Σ iff it can be computed by the inference system INDs have such Inference System Good news: CINDs too!

16 Axioms for CINDs: finite domain reduction  New CINDs can be inferred by axioms  (R1[X; A]  R2[Y; Yp], Tp), dom(A) = { true, false} XAYYp _true_dtp1 _false_dtp2 then (R1[X; Xp]  R2[Y; Yp], tp), XYYp __d Tp

17 Static analyses: CIND vs. IND satisfiabilityimplicationfinite axiom’ty CINDO(1)EXPTIME-completeyes INDO(1)PSPACE-completeyes In the absence of finite-domain attributes: satisfiabilityimplicationfinite axiom’ty CINDO(1)PSPACE-completeyes INDO(1)PSPACE-completeyes General setting with finite-domain attributes: CINDs retain most complexity bounds of their traditional counterpart

18 An extension of traditional FDs Example: cust([country = 44, zip]  [street]) Conditional Functional Dependencies (CFDs) Namecountryzipstreet Bob Tree Ave. Joe Tree Ave. Ben Elem Str. Jim Oak Ave.

19 Static analyses: CFD + CIND vs. FD + IND satisfiabilityimplicationfinite axiom’ty CFD + CINDundecidable No FD + INDO(1)undecidableNo CINDs and CFDs properly subsume FDs and INDs Both the satisfiability analysis and implication analysis are beyond reach in practice This calls for effective heuristic methods

20 Satisfiability Checking Algorithms  Before using a set of CINDs for data cleaning or schema matching we need to make sure that they make sense (that they are clean)  We need to find heuristics to solve the satisfiability problem Input: A set Σ of CFDs and CINDs Output: true / false  We modified and extended techniques used for FDs and INDs For example: Chase, to build a “canonical” witness instance, i.e., I ╠ Σ

21 Chase CFDs+CINDs – Terminate case   = {  1, ψ1}  1=(R2(G → H), (_ || c))- CFD ψ1=(R2[G; nil]  R1[F; nil], (_ || _) ) - CIND EFGH V G1 V H1 R1R2 EFGH V E1 V G1 c R1R2 EFGH V G1 c R1R2 Done! 11 ψ1ψ1

22 Chase CFDs+CINDs – Loop case   = {  1, ψ1, ψ2}  1=(R2(G → H), (_ || c))- CFD ψ1=(R2[G; nil]  R1[F; nil], (_ || _) ) - CIND ψ2=(R1[E; nil]  R2[G; nil], (_ || _) ) EFGH V G1 c R1R2 EFGH V E1 V G1 c R1R2 EFGH V E1 V G1 c V E1 c EFGH V G1 c V E2 V E1 c Infinite application of ψ1 and ψ2 Loop! ψ1ψ1 ψ2ψ2ψ1ψ1 ψ2ψ2

23 More about the checking algorithms  Simplification of the chase: The fresh variables are taken from a finite set We avoid the infinite loop of the chase by limiting the size of the witness instance  If the algorithm returns: True: we know the constraints are satisfiable False: there may be false negative answers – the problem is undecidable and the best we can get is a heuristic  In order to improve accuracy of the algorithm we use: Optimization techniques

24 Example optimization techniques Unsatisfiability Propagation  IF CFDs on R4 is unsatisfiable There is a CIND Ψ4: (R3[X; nil]  R4[Y; Yp], tp)  THEN R3 must be empty! R3 R4 R1R2R5 Ψ4Ψ4 Ψ2, Ψ3 Ψ5Ψ5 Ψ1Ψ1 R6 Node( Relation): related to CFDs Edge: related to CINDS

25 CFDs+CINDs satisfiability checking - experiments  Experimental Settings Accuracy tested for satisfiable sets of CFDs and CINDs The data sets where generated by ensuring the existence of a witness database that satisfies them Scalability tested for random sets of CFDs and CINDs Each experiment was run 6 times and the average is reported # of constraints: up to 20,000 # of relations: up to 100 Ratio of finite attributes: up to 25% An Intel Pentium D 3.00GHz with 1GB memory

26 CFDs+CINDs satisfiability checking - experiments Accuracy testing is based satisfiable sets of CFDs and CINDs Algorithm: 1. Chase : modified version Chase 2. DG+Chase: graph optimization based Chase

27 CFDs+CINDs satisfiability checking - experiments Scalability testing is based on random sets of CFDs and CINDs

28 Summary and future work  New constraints: conditional inclusion dependencies for both data cleaning and schema matching complexity bounds of satisfiability and implication analyses a sound and complete inference system  Complexity bounds for CFDs and CINDs taken together  Heuristic satisfiability checking algorithms for CFDs and CINDs  Open research issues: Deriving schema mapping from the constraints Repairing dirty data based on CFDs + CINDs Discovering CFDs + CINDs Towards a practical method for data cleaning and schema matching