CS246 Query Translation. Mind Your Vocabulary Q: What is the problem? A: How to integrate heterogeneous sources when their schema & capability are different.

Slides:

Advertisements

Similar presentations

CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.

Advertisements

Optimizing Join Enumeration in Transformation-based Query Optimizers ANIL SHANBHAG, S. SUDARSHAN IIT BOMBAY VLDB 2014

Propositional and First Order Reasoning. Terminology Propositional variable: boolean variable (p) Literal: propositional variable or its negation p 

Disjunctive Normal Form CS 680: Formal Methods Jeremy Johnson.

Methods of Proof Chapter 7, second half.. Proof methods Proof methods divide into (roughly) two kinds: Application of inference rules: Legitimate (sound)

Methods of Proof Chapter 7, Part II. Proof methods Proof methods divide into (roughly) two kinds: Application of inference rules: Legitimate (sound) generation.

Efficient Query Evaluation on Probabilistic Databases

Logic in general Logics are formal languages for representing information such that conclusions can be drawn Syntax defines the sentences in the language.

NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.

Intelligent Information Retrieval CS 336 –Lecture 2: Query Language Xiaoyan Li Spring 2006 Modified from Lisa Ballesteros’s slides.

Merging Models Based on Given Correspondences Rachel A. Pottinger Philip A. Bernstein.

Decision Tree Algorithm

Logical Agents Chapter 7. Why Do We Need Logic? Problem-solving agents were very inflexible: hard code every possible state. Search is almost always exponential.

Logical Agents Chapter 7. Why Do We Need Logic? Problem-solving agents were very inflexible: hard code every possible state. Search is almost always exponential.

Containment and Equivalence for an XPath Fragment By Gerom e Mikla Dan Suciu Presented By Roy Ionas.

Methods of Proof Chapter 7, second half.

Search in the semantic domain. Some definitions atomic formula: smallest formula possible (no sub- formulas) literal: atomic formula or negation of an.

Last time Proof-system search ( ` ) Interpretation search ( ² ) Quantifiers Equality Decision procedures Induction Cross-cutting aspectsMain search strategy.

Parsing — Part II (Ambiguity, Top-down parsing, Left-recursion Removal)

The information integration wizard (Iwiz) project Report on work in progress Joachim Hammer Presented by Muhammed Al-Muhammed.

ANHAI DOAN ALON HALEVY ZACHARY IVES Chapter 6: General Schema Manipulation Operators PRINCIPLES OF DATA INTEGRATION.

Propositional Calculus CS 680: Formal Methods in Verification Computer Systems Jeremy Johnson.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.

 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. Towards Translating between XML and WSML based on mappings between.

1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.

Validated Model Transformation Tihamér Levendovszky Budapest University of Technology and Economics Department of Automation and Applied Informatics Applied.

Lesson 3 CDT301 – Compiler Theory, Spring 2011 Teacher: Linus Källberg.

1 Learning Chapter 18 and Parts of Chapter 20 AI systems are complex and may have many parameters. It is impractical and often impossible to encode all.

Managing XML and Semistructured Data Lecture 13: XDuce and Regular Tree Languages Prof. Dan Suciu Spring 2001.

1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.

Machine Learning Chapter 2. Concept Learning and The General-to-specific Ordering Tom M. Mitchell.

Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.

DDBMS Distributed Database Management Systems Fragmentation

1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.

Propositional Calculus CS 270: Mathematical Foundations of Computer Science Jeremy Johnson.

Describing and Using Query Capabilities of Heterogeneous Sources Vasilis Vassalos& Yannis Papakonstantinou Presented by Srujan Kothapally.

CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.

LDK R Logics for Data and Knowledge Representation Propositional Logic: Reasoning First version by Alessandro Agostini and Fausto Giunchiglia Second version.

Disjunctive Normal Form CS 270: Math Foundation of CS Jeremy Johnson.

Answering Tree Pattern Queries Using Views Laks V.S. Lakshmanan, Hui (Wendy) Wang, and Zheng (Jessica) Zhao University of British Columbia Vancouver, BC.

© Copyright 2008 STI INNSBRUCK Intelligent Systems Propositional Logic.

CPSC 603 Database Systems Lecturer: Laurie Webster II, M.S.S.E., M.S.E.E., M.S.BME, Ph.D., P.E. Lecture 4 Introduction to a First Course in Database Systems.

CSC 8520 Spring Paula Matuszek DecisionTreeFirstDraft Paula Matuszek Spring,

Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.

CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.

1 Propositional Logic Limits The expressive power of propositional logic is limited. The assumption is that everything can be expressed by simple facts.

The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Light-weight Domain-based Form Assistant: Querying Web Databases On the.

Logical Agents Chapter 7. Outline Knowledge-based agents Propositional (Boolean) logic Equivalence, validity, satisfiability Inference rules and theorem.

Concept Learning and The General-To Specific Ordering

Computational Learning Theory Part 1: Preliminaries 1.

1 Integration of data sources Patrick Lambrix Department of Computer and Information Science Linköpings universitet.

Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.

Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.

Proof Methods for Propositional Logic CIS 391 – Intro to Artificial Intelligence.

Computer Systems Laboratory Stanford University Clark W. Barrett David L. Dill Aaron Stump A Framework for Cooperating Decision Procedures.

Extensions of Datalog Wednesday, February 13, 2001.

Lecture 9: Query Complexity Tuesday, January 30, 2001.

CS 404Ahmed Ezzat 1 CS 404 Introduction to Compiler Design Lecture Ahmed Ezzat.

Logical Agents. Outline Knowledge-based agents Logic in general - models and entailment Propositional (Boolean) logic Equivalence, validity, satisfiability.

Capability-Sensitive Query Processing on Internet Sources

Propositional Calculus: Boolean Functions and Expressions

Propositional Calculus: Boolean Functions and Expressions

Elementary Metamathematics

Logical Agents Chapter 7.

Information Networks: State of the Art

Methods of Proof Chapter 7, second half.

Presentation transcript:

CS246 Query Translation

Mind Your Vocabulary Q: What is the problem? A: How to integrate heterogeneous sources when their schema & capability are different

Bestbookbuys.com How to integrate? Amazon.combn.com Mediator [au = “Clancy, Tom”] [fn = “Tom”] [ln = “Clancy”] ? [au = “Tom, Clancy”] [fn = “Tom”] [ln = “Clancy”]

Framework User expresses a query using a mediator schema Mediator translates the query to source- supported queries Mediator collects and postprocess results from the sources Amazon.combn.com Mediator [fn = “Tom”] [ln = “Clancy”] [fn = “Tom”] [ln = “Clancy”] [au = “Clancy, Tom”]

Difference From Previous Studies? Heterogeneous attributes Different “vocabularies” Semantic translation necessary Previous studies assumed homogeneous attributes for all sources Complex Boolean queries Not just conjunctive queries

Main Challenge How best to translate a query when the mediator and the source use different model/schema? Author  lastname, firstname Western calendar  Chinese lunar calendar

Query Translation Example Q: For the above schema, best translation for [last = “Clancy” & year = “1998” & month = “Jan”] ? A: [author = “Clancy” & date = “winter, 1998”] ? publisher = “publisher” title = “title” author = “last, first?” date = “spring, 2002” Amazon.com publisher = “publisher” title = “title” last = “lastname” first = “firstname” year = “2002” month = “may” Mediator

More Translation Examples More translations for the same schemas: [publisher = “p” & last = “l” & first = “f”]  [publisher = “p” & author = “l, f”] [title = “t” & last = “l” & first = “f”]  [title = “t” & author = “l, f”] Do we have to translate every possible query manually? Is it necessary to have separate rules for the above translations? Can the system automatically translate queries? Any idea?

Observations The system cannot figure out [last = “l” & first = “f”]  [author = “l, f”] No semantic knowledge User needs to provide these types of mappings There seem to exist “basic” mappings However, system may compose “correct” translation using “basic” translations [last = “l” & first = “f”]  [author = “l, f”] [year = “yy” & month = “Jan”]  [date = “spring, yy”]

Framework Human expert provides a set of “basic” rules [last = “l” & first = “f”]  [author = “l, f”] [year = “yy” & month = “Jan”]  [date = “spring, yy”] Mediator Context Source Context Basic rules

Framework Given a query, the system automatically translates the query using the basic rules Basic rules Traslation Algorithm Qm: First = “Tom” Last = “Clancy” Qs: Author = “Clancy, Tom”

Advantage of the Proposed Framework Minimizes manual intervention Human input only for the initial rule writing Can translate any queries Not just “template” queries

Questions How do we know whether a translation is “good” or “correct”? What basic rules are necessary? Do we need a rule for [last = ‘l’ | first = ‘f’]? How do we translate? Algorithm for “good” translation?

Good Translation? Q: Why do we think these are good translations? [last = “Clancy” & first = “Tom”]  [author = “Clancy, Tom”] [year = “2002” & month = “Jan”]  [date = “winter, 2002”] A: Results for the translated queries are “ close ” to the original queries

Minimum Superset Translation Definition of “closeness” in the paper Q : original query  S(Q) : translated query We also use Q and S(Q) to represent results S(Q) : minimal superset of Q expressed in the source terms Q S 1 (Q) S 2 (Q) Minimum superset translation

Minimum Superset Translation Find the minimum superset translation from the original query “Filter out” false positives by applying filtering condition at the mediator

Any Alternative for “Closeness”? What about maximum subset translation? Definition of previous studies Maybe a good definition when result is large or filtering is impossible… Q S 1 (Q) S 2 (Q) Maximum Subset Translation

Any Alternative for “Closeness”? Consider both false positives and false negatives Maximize | S(Q)  Q | / | S(Q)  Q | Other definitions possible depending on scenario Q S(Q) False positive False negative

Questions How do we know whether a translation is “good” or “correct”? Minimal subsuming translation What basic rules are necessary? Do we need a rule for [last = ‘l’ | first = ‘f’]? How do we translate? Algorithm for “good” translation?

Three Main Concepts Query Separability Query Safety Cross matching

Query Separability Q = [ln = “Clancy”] & [fn = “Tom”] & [p = “Wiley”] We still get minimum superset translation if we separately translate [ln = “Clancy”] & [fn = “Tom”] and [p = “Wiley”] Q = C1  C2  C3 (  : & or | ) is separable if S(Q) = S(C1)  S(C2)  S(C3)

Disjunction Separability Theorem [CGM96] Disjunctions are always separable Q = C1 | C2 | C3  S(Q) = S(C1) | S(C2) | S(C3) for any C1, C2 and C3 Assuming minimum superset translation semantics Implication Basic rules are necessary only for conjunctions e.g., [c1 & c2], but not [c1 | c2] Why? Any complex queries can be transformed to DNF Significant simplification for a rule writer

Basic Rules Only conjunction of constraints Separability of conjunctions is determined by a human expert [ln & fn] but not [ln & publisher] User-provided basic rules should be sound and complete Soundness: All mappings are correct (minimal subsuming translation) Completeness: Contains all inseparable simple conjunctions

Questions How do we know whether a translation is “good” or “correct”? What basic rules are necessary? Do we need a rule for [last = ‘l’ | first = ‘f’]? How do we translate? Algorithm for “good” translation?

Translation Algorithm Simple conjunction query Step 1: Find all matching rules ln = “l”  au = “l” ln = “l” & fn = “f”  au = “l, f” p = “p”  p = “p” Q: Rules ln = “l”fn = “f”p = “p” & au = “l”au = “l, f”p = “p”

Translation Algorithm Simple conjunction query Step 2: Remove subset matching Superset matching is more “precise” ln = “l”  au = “l” ln = “l” & fn = “f”  au = “l, f” p = “p”  p = “p” Q: Rules ln = “l”fn = “f”p = “p” & au = “l”au = “l, f”p = “p”

Translation Algorithm Simple conjunction query Step 3: Generate translated query ln = “l”  au = “l” ln = “l” & fn = “f”  au = “l, f” p = “p”  p = “p” Q: Rules ln = “l”fn = “f”p = “p” & au = “l, f”p = “p” &

Translation Algorithm Complex Boolean query? | & Q

Solution 1 (Algorithm DNF) Convert to DNF and translate Disjunctions are always separable We can individually translate each disjunct | & au = “l, f1”p = “p”au = “l, f2”p = “p” Q | && DNF

What’s Wrong with DNF? DNF conversion is exponential DNF parse tree is not compact Global conversion often not necessary Translation of C3 is independent of others x: [fn …] y: [fn …]  z: [ln...]  [p...] independent C1C2C3

Partition conjuncts into independent groups Translate each group separately By rewriting local groups Top level “AND” of C3 is preserved. Group 1: G1 = {C1,C2} Group 2: G2 = {C3} Conjunction Partitioning x: [fn …]y: [fn …]  z: [ln...]  [p...] independent C1C2C3

Independent Groups? Q: How do we know G1 and G2 are “independent”? A: Q = G1 & G2 is separable Q: How do we know Q = G1 & G2 is separable?

Safety Condition Query seperability is difficult to check directly Safety condition : A practical way to check query separability Sufficient condition for query separability But not a necessary condition

Safety Condition for Simple Conjunction M(Q) : Matching rules for Q Q = G1 & G2 G1 and G2 are simple conjunction G1 = [C1 & C2], G2 = [C3 & C4] Q is safe iff M(Q) = M(G1)  M(G2) That is, Q is safe if there is no “cross matching” among G1 and G2 Cross matching: a rule that matches some constraints in G1 and some constraints in G2 Example G1 : [fn=“f1” & fn = “f2”], G2 : [ln = “ln”] Q = G1 & G2 unsafe: cross matching of “fn & ln  au”

Safety Condition for Complex Disjunction M(Q) : Matching rules for Q Q = G1 & G2 G1 and G2 are complex disjunction G1 = [C1 | C2], G2 = [C3 | C4] 1. Disjuntivize Q : Q = [C1 & C3] | [C1 & C4] | [C2 & C3] | [C2 & C4] 2. Q is safe iff every disjunct is safe i.e., if all [C1 & C3], [C1 & C4], [C2 & C3], and [C2 & C4] are safe

Important Theorem A query is separable if it is safe (i.e., query separability  safety) A query is safe if there is no cross matching (i.e., safety  no cross matching) If there is a cross-matching between conjuncts, we cannot separately translate them Put them into the same group

Algorithm TDQM Recursively traverse the query tree in the top- down order At a disjunction node: Separately translate its children At a conjunction node: Put the children with cross matching into the same group and rewrite the query locally in each group

At a disjunction node Separately apply TDQM each child Disjunction separability theorem Algorithm TDQM x:[fn…] y:[fn…]  z:[ln...]  v:[p...]  w:[y...] Recursively traverse the tree top-down

G1G2 C1 C2C3 At a conjunction node Group children by identifying “cross-matchings” No cross-matching between groups (safety condition) Algorithm TDQM x:[fn…] y:[fn…]  z:[ln...]  v:[p...]  w:[y...] {x,z}{y,z} cross-matchings:

For groups with more than one conjunct Locally rewrite into a disjunctive form (not DNF) Algorithm TDQM G1G2 C1 C2C3 x:[fn…] y:[fn…]  z:[ln...]  v:[p...]  w:[y...]    x  zyz G2 C3 v:[p...] G1

For groups with more than one conjunct Locally rewrite into a disjunctive form (not DNF) Algorithm TDQM  w:[y...]    x  zyz G2 v:[p...] G1

Continue tree traversal until we reach simple conjunction and apply basic mappings Algorithm TDQM  w:[y...]    x  zyz v:[p...]

Algorithm TDQM Generates minimum superset translation Resulting translation is “compact” Assuming the original query is “compact” Convert the tree only when it is necessary

TDQM Summary Key concepts Seperability  Safety  cross matching Local rewriting for compact translation

A Few Remarks Final algorithm is straightforward Simply put, separately translate each term if there is no “cross-matching” Many people can come up with the algorithm But the author developed an amazing theory by carefully studying basic questions Initial problem looks rather “trivial” But a mine-field of interesting research topics…

Questions?