Chen Li Information and Computer Science

Slides:



Advertisements
Similar presentations
Information Integration Using Logical Views Jeffrey D. Ullman.
Advertisements

Outline  Introduction  Background  Distributed DBMS Architecture  Distributed Database Design  Semantic Data Control ➠ View Management ➠ Data Security.
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
Of 27 lecture 7: owl - introduction. of 27 ece 627, winter ‘132 OWL a glimpse OWL – Web Ontology Language describes classes, properties and relations.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3 The Basic (Flat) Relational Model.
1 Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems Chen Li Information and Computer Science University of California,
1 Relational Model. 2 Relational Database: Definitions  Relational database: a set of relations  Relation: made up of 2 parts: – Instance : a table,
Rutgers University Relational Algebra 198:541 Rutgers University.
Michael F. Price College of Business Chapter 6: Logical database design and the relational model.
©Silberschatz, Korth and Sudarshan6.1Database System Concepts Chapter 6: Integrity and Security Domain Constraints Referential Integrity Assertions Triggers.
A Z Approach in Validating ORA-SS Data Models Scott Uk-Jin Lee Jing Sun Gillian Dobbie Yuan Fang Li.
DBSQL 3-1 Copyright © Genetic Computer School 2009 Chapter 3 Relational Database Model.
FALL 2004CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
RRXS Redundancy reducing XML storage in relations O. MERT ERKUŞ A. ONUR DOĞUÇ
1.1 CAS CS 460/660 Introduction to Database Systems Relational Algebra.
ICS 321 Fall 2011 The Relational Model of Data (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 8/29/20111Lipyeow.
CSE314 Database Systems Lecture 3 The Relational Data Model and Relational Database Constraints Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
1 The Relational Data Model David J. Stucki. Relational Model Concepts 2 Fundamental concept: the relation  The Relational Model represents an entire.
CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
N5 Databases Notes Information Systems Design & Development: Structures and links.
COP Introduction to Database Structures
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Methodology Logical Database Design for the Relational Model
Chapter 6: Integrity (and Security)
Module 2: Intro to Relational Model
Databases Chapter 16.
Relational Model By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany)
RELATION.
Foreign Keys Local and Global Constraints Triggers
Entity-Relationship Model
Relational Algebra Chapter 4 1.
Relational Database Design by Dr. S. Sridhar, Ph. D
Introduction to Database Systems, CS420
Chapter 4 Relational Databases
Chapter 2: Intro to Relational Model
CPSC-608 Database Systems
Constraints AND Examples
Relational Algebra Chapter 4, Part A
Associative Query Answering via Query Feature Similarity
 DATAABSTRACTION  INSTANCES& SCHEMAS  DATA MODELS.
Chapter 15 QUERY EXECUTION.
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
CPSC-310 Database Systems
Relational Algebra.
ece 720 intelligent web: ontology and beyond
Relational Algebra 1.
Relational Databases The Relational Model.
Relational Databases The Relational Model.
Module 5: Overview of Normalization
Teaching slides Chapter 8.
Relational Algebra Chapter 4 1.
Relational Algebra Chapter 4, Sections 4.1 – 4.2
SQL: Structured Query Language
Chapter 2: Intro to Relational Model
Chapter 2: Intro to Relational Model
Overview of Query Evaluation
Probabilistic Databases
Chapter 2: Intro to Relational Model
Query Optimization.
Geo-Databases: lecture 6 Data Integrity
SQL – Constraints & Triggers
CPSC-608 Database Systems
CENG 351 File Structures and Data Managemnet
Materializing Views With Minimal Size To Answer Queries
Constraints AND Examples
Chapter 7a: Overview of Database Design -- Normalization
Select-From-Where Statements Multirelation Queries Subqueries
Presentation transcript:

Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems Chen Li Information and Computer Science University of California, Irvine

Constraints in Data Integration Sources of Orange County (OC) housing information house(street,zip,price,sqft,year) Mediator s1(street, zip, price, sqft) s2(street, price, year) Constraints: s1 price at least $250K s2 price at least $280K Query: “Find Orange County houses cheaper than $200K” Answer: empty, without checking the source table instances

Another Example Constraint: OC houses have a unique street Query: Mediator house(street,zip,price,sqft,year) s2(street, price, year) s1(street, zip, price, sqft) Constraint: OC houses have a unique street Query: “Find sqft and year of OC houses” Plan: join two source relations on the “street” attributes The plan is invalid if the constraint is not true.

Importance of Constraints They express a rich amount of information about sources They can be utilized to help query answering

Our contributions There has been a lot of work on how to use constraints in query optimization, e.g.: [Hsu, Knoblock, 2000] [Godfrey, Gryz, Zuzarte, 2001] … To make this optimization possible, we primarily study: how to describe constraints; and how to manipulate constraints between “local” and “global”

Outline Constraints motivation Local constraints Global constraints Conclusions and open problems

Describing various constraints house(street,zip,price,year,sqft) s1(street, zip, price, sqft) s2(street, price, year) Examples: C1: s1.price >= $250K C2: s1.street is unique C3: s2.year < 1995, s2.price >= $280K C4: street is unique for all OC houses C5: all houses in the system are at least $250K Where to put them? Some constraints can be described at sources “locally” (C1,C2,C3) Others are more suitable to be described “globally” (C4,C5)

Local constraints: described at sources Designed by individual sources Common local constraints: Range constraints “price >= $250K” Enumeration constraints “state in {CA, NV, AZ, OR}” Functional dependencies “(street, zip)  (price, year)” Key constraints “(street) is a primary key of house predicate” Foreign-key constraints among tables at a source Inclusions …

Other ways to describe local constraints Could be described using traditional source-view definitions E.g., in the LAV (local-as-view) approach to data integration: “C1: s1.price >= $250K” can be described as s1(street, zip, price, sqft) :- house(street, zip, price, sqft, year),price >= 250K

Then why not use view/query languages to describe local constraints? View/query languages might not be expressive enough E.g., functional-dependency constraints Query answering could become complicated “Conjunctive-query rewriting” is already NP Arithmetic comparisons even require recursive queries Describing constraints separately have advantages: More expressive: can capture common knowledge We care more about those simple, common constraints Easier to understand and do reasoning

Limitations of local constraints house(street,zip,price,year,sqft) s1(street, zip, price, sqft) s2(street, price, year) C4: street is unique for all OC houses Cannot describe C4 using two local constraints: “street is a key at s1” & “street is a key at s1” Wrong! Since they cannot restrict two relations together! In particular, the following is still allowed S1: (main, 92697, $300K, 2100) S2: (main, $380K, 1993)

Global constraints (GC) Some constraints are more suitable to be described “globally” Example: C4: street is unique for all OC houses C5: all houses in the system are at least $250K

Describing global constraints Might need more expressive languages: “street” is unique for houses at s1 and s2 Need to formally define such a “global key” Consider special cases: We have source schemas and mediator schema Mappings exist between them: e.g., LAV, GAV, GLAV, or more general mapping language In this case, global constraints could be easier to describe

Other advantages of GC They “summarize” source contents Mediators can check queries against these conditions, before checking individual sources (thus could avoid unnecessary source checking) Users get an overview of the data (easier to ask queries) Give “outside world” a view of the source contents Especially useful when the mediation system is used as a component of a larger system E.g., peer-based mediation systems (Raccoon, Piazza, Hyperion, PeerDB, …) or hierarchies of mediators

Two kinds of global constraints General global constraints Source-derived global constraints

General global constraints Conditions that should be satisfied by any database instance of a global predicate: e.g., C4: street is unique for all OC houses It can be represented as the general global constraint on the house predicate (street) forms a key of house predicate Introduced during the system design to capture the semantics of the application domain Future new sources expected to satisfy this constraint Thus: may need to check if it is satisfied by existing and new-coming sources

Source-derived global constraints (Example) Local constraints: C1: s1.price >= $250K C3: s2.year < 1995, s2.price >= $200K  Global constraint C5: house.price >= $200K C5 is true only when the system has the two sources In general, there could be a house (not at s1 and s2) that is less than $250K We don’t care about these houses, pretending they didn’t exist When future sources come in, we need to check and update this constraint

Source-derived global constraints It is a condition on global predicates It must be satisfied by any derived database D of any source view instances satisfying those local constraints. Derived database D: certain tuples that can be decided based on the view definitions Depends on the view definition (LAV, GAV, …) s1(street, zip, price, sqft) :- house(street, zip, price, sqft, year), price >= 250K

Computing Source-derived GC Input: Sources with their local constraints Mappings between source schema and the mediator schema Output: source-derived GC on mediator schema We have preliminary results for the LAV approach

Comparisons General GC Source-derived GC When hold? Hold in general Hold for the system with current sources When new sources join.. Assume new sources satisfy the GC. Could be validated. New sources could violate the GC. Need to recalculate

Conclusions Describing constraints in data-integration system is important We classified different types of constraints: Local constraints Global constraints: General Source-derived We showed their advantages and limitations We studied how to manipulate these constraints (e.g., compute source-derived global constraints from local constraints)

Open problems Expressive languages to describe cross-source constraints “tuple-generating dependencies”? Other simpler but powerful languages Manipulating the constraints in general LAV, GAV more expressive mappings between sources and mediators Efficient techniques for testing and re-computing of global constraints as sources change

Work conducted in The RACCOON Project on Peer-based Data Integration and sharing, UC Irvine

Acknowledgements We thank Jia Li for her help on the preparation of these slides.