2005Integration/tsimmis1 Tsimmis The Stanford-IBM Manager of Multiple Information Sources  Overview  Mediator specification  A reduction to Datalog.

Slides:

Advertisements

Similar presentations

First Order Logic Logic is a mathematical attempt to formalize the way we think. First-order predicate calculus was created in an attempt to mechanize.

Advertisements

Chapter 10: Designing Databases

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.

Intermediate Code Generation

Relational Database Design UNIT II 1. 2 Advantages of Using Database Systems Centralized control of a firm’s data Redundancy can be reduced (avoid keeping.

1 CHAPTER 4 RELATIONAL ALGEBRA AND CALCULUS. 2 Introduction - We discuss here two mathematical formalisms which can be used as the basis for stating and.

1 541: Relational Calculus. 2 Relational Calculus  Comes in two flavours: Tuple relational calculus (TRC) and Domain relational calculus (DRC).  Calculus.

Answer Set Programming Overview Dr. Rogelio Dávila Pérez Profesor-Investigador División de Posgrado Universidad Autónoma de Guadalajara

1 Global-as-View and Local-as-View for Information Integration CS652 Spring 2004 Presenter: Yihong Ding.

1 Relational Algebra & Calculus. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.

Chapter 5: Elementary Data Types Properties of types and objects –Data objects, variables and constants –Data types –Declarations –Type checking –Assignment.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.

1 XEM: Managing the Evolution of XML Documents Author: Hong Su, Diane Kramer. Li Chen, Kajal Claypool and Elke A. Rundensteiner Presented by: Li Shuhong.

2005rel-xml-ii1 The SilkRoute system  The system goals  Scenario, examples  View Forests  View forest and query composition  View forest efficient.

1 COS 425: Database and Information Management Systems XML and information exchange.

1 Lecture 13: Database Heterogeneity Debriefing Project Phase 2.

2005Integration-intro1 Data Integration Systems overview The architecture of a data integration system:  Components and their interaction  Tasks  Concepts.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.

Semi-structured Data. Facts about the Web Growing fast Popular Semi-structured data –Data is presented for ‘human’-processing –Data is often ‘self-describing’

1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Rutgers University Relational Algebra 198:541 Rutgers University.

CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:

Rutgers University Relational Calculus 198:541 Rutgers University.

CSCD343- Introduction to databases- A. Vaisman1 Relational Algebra.

1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.

IS432: Semi-Structured Data Dr. Azeddine Chikh. 1. Semi Structured Data Object Exchange Model.

REFACTORING Lecture 4. Definition Refactoring is a process of changing the internal structure of the program, not affecting its external behavior and.

Optimizing Queries and Diverse Data Sources Laura M. Hass Donald Kossman Edward L. Wimmers Jun Yang Presented By Siddhartha Dasari.

DBMS Lecture 9  Object Database Management Group –12 Rules for an OODBMS –Components of the ODMG standard  OODBMS Object Model Schema  OO Data Model.

11 1 Object oriented DB (not in book) Database Systems: Design, Implementation, & Management, 6 th Edition, Rob & Coronel Learning objectives: What.

1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.

DAY 15: ACCESS CHAPTER 2 Larry Reaves October 7,

CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.

Database Management COP4540, SCS, FIU Relational Model Chapter 7.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Calculus Chapter 4, Section 4.3.

A Query Translation Scheme for Rapid Implementation of Wrappers Presented By Preetham Swaminathan 03/22/2007 Yannis Papakonstantinou, Ashish Gupta, Hector.

11 Chapter 11 Object-Oriented Databases Database Systems: Design, Implementation, and Management 4th Edition Peter Rob & Carlos Coronel.

1 Lessons from the TSIMMIS Project Yannis Papakonstantinou Department of Computer Science & Engineering University of California, San Diego.

1 Relational Algebra. 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of data from a database. v Relational model supports.

M1G Introduction to Database Development 2. Creating a Database.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.

1 Relational Algebra & Calculus Chapter 4, Part A (Relational Algebra)

1 Relational Algebra and Calculas Chapter 4, Part A.

1.1 CAS CS 460/660 Introduction to Database Systems Relational Algebra.

Relational Algebra.

1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.

Scaling Heterogeneous Databases and Design of DISCO Anthony Tomasic Louiqa Raschid Patrick Valduriez Presented by: Nazia Khatir Texas A&M University.

The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.

CSCD34-Data Management Systems - A. Vaisman1 Relational Algebra.

Information Integration By Neel Bavishi. Mediator Introduction A mediator supports a virtual view or collection of views that integrates several sources.

Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.

Raluca Paiu1 Semantic Web Search By Raluca PAIU

Database Management Systems, R. Ramakrishnan1 Relational Calculus Chapter 4, Part B.

SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.

COP Introduction to Database Structures

More SQL: Complex Queries, Triggers, Views, and Schema Modification

Relational Calculus Chapter 4, Section 4.3.

Logical Database Design and the Rational Model

Relational Calculus Chapter 4, Part B

Lecture 2 The Relational Model

Chapter 3 The Relational Database Model

The Entity-Relationship Model

Semi-Structured data (XML Data MODEL)

INSTRUCTOR: MRS T.G. ZHOU

Relational Algebra & Calculus

Relational Calculus Chapter 4, Part B 7/1/2019.

Semi-Structured data (XML)

Presentation transcript:

2005Integration/tsimmis1 Tsimmis The Stanford-IBM Manager of Multiple Information Sources  Overview  Mediator specification  A reduction to Datalog  Using object id’s for information fusion  Querying and query processing

2005Integration/tsimmis2  Overview A GAV system: global data defined in terms of sources simple, schema-less (semi-structured ) data model self-describing data -- precursor of XML Supports a notion of object identity – used for proper fusion of data from multiple sources Relationship between sources (wrappers) and mediator specified by a declarative language – variant of Datalog Query execution planning & optimization tailored for integration environment Semi-declarative mechanism for wrapper construction

2005Integration/tsimmis3 Data model: OEM (Object Exchange Model) Each piece of data describes itself, no schema, no fixed structure An object:, where: label is a description of what this data is type – the type of the value; can be atomic, or set value – the value of the object o-id – an identifier that uniquely identifies it Example (atomic types):

2005Integration/tsimmis4 Example (with a set type): <&e1, employee, set {&f1, &l1, &t1, &rep1} A set type is used to represent an object with sub-objects Here, a record structure In other cases, the sets may be real sets, with the same label repeating many times. Note: no order on oid’s in a set (contrast to XML)

2005Integration/tsimmis5 On object id’s: Usually temporary id’s assigned during query processing –Used for relating an object to its sub-objects –Valid only for duration of a query –Of no interest to the user Can be used by mediator writer for specifying data fusion (later) Notation (for queries, examples): Types are usually omitted – can be inferred from the data; hence objects written as triples when o-id’s are irrelevant -- write, or even <label, value)

2005Integration/tsimmis6 Interim summary: relational data can be exported in this format The format allows records that have some common fields, but each may have extra fields (semi-structured data) – like XML The lack of any schema seems, in retrospect, a disadvantage

2005Integration/tsimmis7  Mediator specification Each source is assumed to be wrapped by a wrapper, that exports data in the OEM format A mediator specification determines how source data is imported to the mediator and combined with data from other sources. The language MSL (Mediator Specification Language) is an adaptation of non-recursive Datalog to this data model, and the needs of integration

2005Integration/tsimmis8 An example: Two sources export data (via wrappers) on university people (both related to the CS dept): CS : a relational source, with two tables: employee(first_name, last_name, title, reports_to) student(first_name, last_name, year) Whois : A university facility that contains information about employees and students; usually name, dept are given but fields change between records

2005Integration/tsimmis9 Some data from CS : <&e1, employee, set {&f1, &l1, &t1, &rep1} <&e2, employee, set {&f2, &l2, &t2} <&s3, student, set {&f2, &l3, &y3}

2005Integration/tsimmis10 Some data from whois : <&p1, person, set {&n1, &d1, &rel1, &em1} <&p2, person, set {&n2, &d2, &rel2, &y2}

2005Integration/tsimmis11 A comparison of the sources: Domain mismatch: Different representations for name in the two sources (The resolution of such issues is the responsibility of the mediator) Schematic discrepancy: employee, student are relation names in CS, data in whois In one source ( whois ), there is no fixed schema – different objects may have different fields; but, we would like in some cases to import all data about a person, w/o knowing what data exists The sources (e.g. CS ) may evolve; we would like the mediator spec to be insensitive to most changes

2005Integration/tsimmis12 Specification of a mediator med (by examples) (MS0) : Show in mediator names & relationship of CS people that exist in both sources:, med :-,, whois,, CS, decompose_name( N, LN, FN) External: decompose_name(string, string, string)(b,f,f)  name_to_lnfn decompose_name(string, string, string)(f,b,b)  lnfn_to_name Explanation: Capital letter – variables External: a (conversion) function (implemented in some pl)  : implemented by, b – bound (in), f – free (out) o-id, and type were omitted!

2005Integration/tsimmis13, med :-,, whois,, CS, More explanation: {, } represent sets In body:,, } means that there is an object with –label person, –value that is a set that contains at least objects with labels name, dept, relation, possibly more In head: These are the elements that go into the mediated object

2005Integration/tsimmis14 How are the problems addressed? Domain mismatch: Different representations for names – use conversion functions Schematic discrepancy: employee, student are relation names in CS, data in whois – variables can range on both data and labels (see the variable R in query) (same now in XQuery) In one source ( whois ), there is no fixed schema – same fields will be retrieved The sources may evolve; we would like the mediator spec to be insensitive to most changes -- same as previous point

2005Integration/tsimmis15 (MS1) : similar, but now we want all fields from both sources,, Rest1 med :-,, | whois,, | CS, decompose_name( N, LN, FN) External: decompose_name(string, string, string)(b,f,f)  name_to_lnfn decompose_name(string, string, string)(f,b,b)  lnfn_to_name Explanation: Rest variables distinguished in body by occurring after | Bound to the fields in the object not mentioned explicitly (~ set difference) The language is called MSL (mediator specification language)

2005Integration/tsimmis16 An object generated by the mediator for MS1 in med : <&cp1, cs_person, set {&mn1, &mrel1, &t1, &rep1, &em1} Note: this is a virtual object; materialization only for user queries

2005Integration/tsimmis17 Q: How is it generated? Match each body atom (a pattern ) with objects in the specified source –If label is a constant – can match only this constant –Same for value –A variable matches any label/value –{…} match only sets –Rest matches any components not matched explicitly A successful match binds the matched variables This is essentially a (flexible) notion of a valuation from a query body to data o-ids for the result are generated by med, since here they are not specified explicitly Q: can you generate a few other objects from the given data?

2005Integration/tsimmis18 How are the problems addressed? In one source ( whois ), there is no fixed schema – different objects may have different fields, but we want all fields The sources may evolve; we would like the mediator spec to be insensitive to most changes – the variables Rest1, Rest2 range are bound to the set of all the sub-objects not explicitly specified

2005Integration/tsimmis19 The rest variables can be med :-,,, whois, R1-l notin {name, dept, relation},, CS, R2-l notin {first_name, last_name} decompose_name( N, LN, FN) External: ……… Note: notin can be replaced here by a conjunction of neq

2005Integration/tsimmis20  A reduction to Datalog I. Model each source as a relational database: top(src, oid) – the object identified by oid is top-level in src object(src, oid, lab, val) – the object identified by oid exists in src, has label lab and atomic value val object(src, oid, lab, set) – the object identified by oid exists in src, has label lab and a set value set is here a special constant member(src, o1, o2) – in src, o1 has a set value, o2 is in the set The original OO database is essentially a graph of objects and relationships; the above captures this graph, relationally

2005Integration/tsimmis21 Some obvious integrity constraints: If member(src, o1, o2) then also object(src, o1, lab1, set) and object(src, o2, lab2, v2) hold for some lab1, lab2, v2 Any more?

2005Integration/tsimmis22 II. Translate MSL rules to use these relations: (MS0), med :- (1),, whois, (2), CS, (3) decompose_name( N, LN, FN) …….. (1)  top( whois, &P1), object( whois, &P1, person, set), object( whois, &N1, name, N), object( whois, &D1, dept, ‘CS’), object( whois, &Rel1, relation, R) (2)-- similar : top( CS, &P2), …. (3) what is your suggestion?

2005Integration/tsimmis23, med :- (1),, whois, (2), CS, (3) decompose_name( N, LN, FN) …….. head  top( med, f(&P1,&P2)), object( med, f(&P1, p2), cs_person, set), …….. Here f is a new function symbol (a ‘syntactic’ function) The term f(&P1, &P2) states that the new object id is determined by that of the two objects retrieved from whois and CS But, it seems we generate a multi-head rule?

2005Integration/tsimmis24 There is no inherent difficulty with multi-head rules: We can introduce an intermediate relation binds(..) to collect all the bindings from the (translation of) the body And a new rule with one atom in head, and binds(..) as the body, for each of the components of the head. The term f(&P1, &P2) ensures that the facts that are generated refer to the same object

2005Integration/tsimmis25  Using object id’s for information fusion So far, oid’s -- (almost) an implementation feature: enabling references to sub-objects But, they can be used logically: if several rules use the same oid in the head, then the information produced by the rules is fused together, into a collection of sub-objects for a unique object For this to work, the oid in the head must be a function of some of the variables (or constants) in the body (safety) ; then each tuple of bindings for these variables produces a unique oid id-based object fusion Semantic oid’s

2005Integration/tsimmis26 Example: Two sources about technical reports, use same report numbers; one has a title, the other -- the postscript (MS3): cs :- cs1 cs :- cs2 If report #5 occurs in both sources, the first rule attaches the title to the fused object, the second rule attaches the postscript If it occurs only in cs1, then only a title field will be attached If it occurs only in cs2, then only a ps field will be attached

2005Integration/tsimmis27 We can retrieve all the fields from both sources, w/o having to know their labels cs :- cs :- cs2 Variable V binds to a set of objects (provided one has a report_number field) In this example, if both sources contain a title field, the mediated object will have both The mediated object certainly has two report_number fields! Can this be avoided? Hint: use the same idea for the object with report_number label

2005Integration/tsimmis28 We can select to retrieve fields from cs1, and only fields not there from cs2 (MS5): cs :- cs1 provided(RN, F) :- cs1 cs :- not provided(RN, F), cs2 Use of predicates, in addition to objects, is useful This is a case of stratified negation, has well-defined semantics When evaluating against cs1, makes sense to collect the bindings for the first two rules together

2005Integration/tsimmis29 Assume reports have a field related that references another report, how can we transform them to mediator objects? (MS6): cs :- cs1, L neq related cs :- cs1 But, this solution assumes We know which sub-objects contain references They are at a fixed, known, depth In XML, both assumptions may fail Assuming we have a construct like // can we address these issues?

2005Integration/tsimmis30  Querying and query processing Can use a variety of languages for querying We illustrate querying using MSL Example: Find all info about Joe Chung: (Q1) JC :- JC: med New feature: object variable JC When Q1 is processed, it binds to any object with name ‘Joe Chung’ Each such binding inserts the object into the answer

2005Integration/tsimmis31 Processing the query: Remove the object variable }> :- med Note: L neq name is not needed; why? Match the body condition with head of rule defining med (after the rest variables were also removed, p. med :-,,, whois, R1-l notin {name, dept, relation},, CS, R2-l notin {first_name, last_name} decompose_name( N, LN, FN)

2005Integration/tsimmis32 A match: }>,,  N is replaced by ‘Joe Chung’ (L and V match each of the fields)  The med :-,,, whois, R1-l notin {name, dept, relation},, CS, R2-l notin {first_name, last_name} decompose_name(Joe Chung, LN, FN)

2005Integration/tsimmis33 So, we have replaced a view by its definition, with query bindings accounted for Add an object id to the head : f(Joe chung) (we do not show it) why is it needed? Now, decompose the result into source queries: (1),,, whois, R1-l notin {name, dept, relation} (2),, CS, R2-l notin {first_name, last_name} And a glue: (3)decompose_name(Joe Chung, LN, FN) What are the options for query processing?

2005Integration/tsimmis34 I. Obtain R bindings from whois, and FN, LN bindings from decompose, then use these in queries on CS II. Obtain FN, LN bindings from decompose, use these in (two) queries on CS, also use the given bindings to query whois then join III. Query cs, then use decompose and the results to query whois The selection of the query processing strategy requires an optimizer

2005Integration/tsimmis35 Summary Tssimis combines Semi-structured data Semantic object id’s for object fusion A GAV approach Advantages: Offers a solution the schematic discrepancy problem Can deal with source evolution, unknown structure, … Semantics oid’s are a nice mechanism for information fusion Some problems: Does not provide easy access to deeply nested data Nor to data whose depth is variable/unknown