1 Oblivious Querying of Data with Irregular Structure.

Slides:



Advertisements
Similar presentations
ICDT 2005 An Abstract Framework for Generating Maximal Answers to Queries Sara Cohen, Yehoshua Sagiv.
Advertisements

Complexity Classes: P and NP
2005conjunctive-ii1 Query languages II: equivalence & containment (Motivation: rewriting queries using views)  conjunctive queries – CQ’s  Extensions.
CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.
Optimizing Join Enumeration in Transformation-based Query Optimizers ANIL SHANBHAG, S. SUDARSHAN IIT BOMBAY VLDB 2014
1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.
1 NP-completeness Lecture 2: Jan P The class of problems that can be solved in polynomial time. e.g. gcd, shortest path, prime, etc. There are many.
1 NP-Complete Problems. 2 We discuss some hard problems:  how hard? (computational complexity)  what makes them hard?  any solutions? Definitions 
Efficient Query Evaluation on Probabilistic Databases
Complexity 15-1 Complexity Andrei Bulatov Hierarchy Theorem.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Computational problems, algorithms, runtime, hardness
Hardness Results for Problems P: Class of “easy to solve” problems Absolute hardness results Relative hardness results –Reduction technique.
Xyleme A Dynamic Warehouse for XML Data of the Web.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Query Languages Aswin Yedlapalli. XML Query data model Document is viewed as a labeled tree with nodes Successors of node may be : - an ordered sequence.
Approximation Algorithms
The Theory of NP-Completeness
Containment and Equivalence for an XPath Fragment By Gerom e Mikla Dan Suciu Presented By Roy Ionas.
TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland,
Analysis of Algorithms CS 477/677
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
1 CSE 417: Algorithms and Computational Complexity Winter 2001 Lecture 22 Instructor: Paul Beame.
Complexity Issues Mark Allen Weiss: Data Structures and Algorithm Analysis in Java Lydia Sinapova, Simpson College.
Inbal Yahav A Framework for Using Materialized XPath Views in XML Query Processing VLDB ‘04 DB Seminar, Spring 2005 By: Andrey Balmin Fatma Ozcan Kevin.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Hardness Results for Problems
4/20/2017.
Full Disjunctions: Polynomial-Delay Iterators in Action Sara Cohen Technion Israel Yaron Kanza University of Toronto Canada Benny Kimelfeld Hebrew University.
Selective and Authentic Third-Party distribution of XML Documents - Yashaswini Harsha Kumar - Netaji Mandava (Oct 16 th 2006)
CMPS 3223 Theory of Computation Automata, Computability, & Complexity by Elaine Rich ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Slides provided.
1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.
Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.
A D ICHOTOMY ON T HE C OMPLEXITY OF C ONSISTENT Q UERY A NSWERING FOR A TOMS W ITH S IMPLE K EYS Paris Koutris Dan Suciu University of Washington.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
1 Ranking Inexact Answers. 2 Ranking Issues When inexact querying is allowed, there may be MANY answers –different answers have a different level of incompleteness.
1 ICS 184: Introduction to Data Management Lecture Note 10 SQL as a Query Language (Cont.)
Querying Structured Text in an XML Database By Xuemei Luo.
Inexact Querying of XML. XML Data May be Irregular Relational data is regular and organized. XML may be very different. –Data is incomplete: Missing values.
Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.
NP-COMPLETENESS PRESENTED BY TUSHAR KUMAR J. RITESH BAGGA.
Database Systems Part VII: XML Querying Software School of Hunan University
Techniques for Proving NP-Completeness Show that a special case of the problem you are interested in is NP- complete. For example: The problem of finding.
CSCI 3160 Design and Analysis of Algorithms Tutorial 10 Chengyu Lin.
1 Relational Algebra and Calculas Chapter 4, Part A.
NP-Complete Problems. Running Time v.s. Input Size Concern with problems whose complexity may be described by exponential functions. Tractable problems.
NP-COMPLETE PROBLEMS. Admin  Two more assignments…  No office hours on tomorrow.
1 Computing Full Disjunctions Yaron Kanza Yehoshua Sagiv The Selim and Rachel Benin School of Engineering and Computer Science The Hebrew University of.
NP-Completeness (Nondeterministic Polynomial Completeness) Sushanth Sivaram Vallath & Z. Joseph.
CS 3343: Analysis of Algorithms Lecture 25: P and NP Some slides courtesy of Carola Wenk.
Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.
CPS Computational problems, algorithms, runtime, hardness (a ridiculously brief introduction to theoretical computer science) Vincent Conitzer.
1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen Relational State Assertions These slides.
1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen Collection Operators These slides are.
NPC.
1 Theory, Practice & Methodology of Relational Database Design and Programming Copyright © Ellis Cohen Collection Operators These slides are.
Finding Regular Simple Paths Sept. 2013Yangjun Chen ACS Finding Regular Simple Paths in Graph Databases Basic definitions Regular paths Regular simple.
Presented by Kyumars Sheykh Esmaili Description Logics for Data Bases (DLHB,Chapter 16) Semantic Web Seminar.
The NP class. NP-completeness Lecture2. The NP-class The NP class is a class that contains all the problems that can be decided by a Non-Deterministic.
Incomplete Answers over Semistructured Data Kanza, Nutt, Sagiv PODS 1999 Slides by Yaron Kanza.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
1 Ranking Inexact Answers. 2 Ranking Issues When inexact querying is allowed, there may be MANY answers –different answers have a different level of incompleteness.
1 Representing and Reasoning on XML Documents: A Description Logic Approach D. Calvanese, G. D. Giacomo, M. Lenzerini Presented by Daisy Yutao Guo University.
COP Introduction to Database Structures
Entity-Relationship Model
Computing Full Disjunctions
NP-Completeness Yin Tat Lee
Alternating tree Automata and Parity games
NP-Complete Problems.
Presentation transcript:

1 Oblivious Querying of Data with Irregular Structure

2 Based on Several Works Queries with Incomplete Answers –Yaron Kanza, Werner Nutt, Shuky Sagiv Flexible Queries –Yaron Kanza, Shuky Sagiv SQL4X –Sara Cohen, Yaron Kanza, Shuky Sagiv Computing Full Disjunctions –Yaron Kanza, Shuky Sagiv

3 Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration

4 Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration

5 The Semistructured Data Model Data is described as a rooted labeled directed graph Nodes represent objects Edges represent relationships between objects Atomic values are attached to atomic nodes

Movie Database Movie Actor T.V. Series Film Actor TitleName Title Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks A Movie Database Example 36 Year Year 21 Actor Name 30 Mark Hamill Léon Movie 13 Title 33 Magnolia

7 Star Wars 1977 Mark Hamill Harrison Ford … Star Wars 1977 Mark Hamill Harrison Ford … XML that Encodes the Semistructured Data

Movie Database Movie Actor T.V. Series Film Actor TitleName Title Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks Consider a Query that Requests Movies, Actors that Acted in the Movies and the Movies’ Year of Release Consider a Query that Requests Movies, Actors that Acted in the Movies and the Movies’ Year of Release 36 Year Year 21 Actor Name 30 Mark Hamill Léon Movie 13 Title 33 Magnolia What Should be the form of the Query?

Movie Database Movie Actor T.V. Series Film Actor TitleName Title Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 36 Year Year 21 Actor Name 30 Mark Hamill Léon Movie 13 Title 33 Magnolia The movie has a year attribute Incomplete Data The year of the movie is missing

Movie Database Movie Actor T.V. Series Film Actor TitleName Title Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 36 Year Year Actor Name 30 Mark Hamill Léon Movie 13 Title 33 Magnolia Variations in Structure 11 Movie below actor Actor below movie

Movie Database Movie Actor T.V. Series Film Actor TitleName Title Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 35 Year Year 21 Actor Name 30 Mark Hamill Léon Movie 13 Title 34 Magnolia A movie labelA film label Ontology Variations Dealing with ontology variations is beyond the scope of this talk Dealing with ontology variations is beyond the scope of this talk

12 Irregular Data Data is incomplete –Missing values of attributes in objects Data has structural variations –Relationships between objects are represented differently in different parts of the database Data has ontology variations –Different labels are used to describe objects of the same type

13 Irregular data does not conform to a strict schema Queries over irregular data should not be rigid patterns Queries over irregular data should not be rigid patterns The schema cannot guide a user in formulating a query The schema cannot guide a user in formulating a query

14 The description of the schema is large (e.g., a DTD of XML) The description of the schema is large (e.g., a DTD of XML) It is difficult to use the schema when formulating queries It is difficult to use the schema when formulating queries Data is contributed by many users in a variety of designs Data is contributed by many users in a variety of designs The query should deal with different structures of data The query should deal with different structures of data The structure of the database is changed frequently The structure of the database is changed frequently Queries should be rewritten frequently Queries should be rewritten frequently In Which Cases is it Difficult to Formulate Queries over Semistructured Data? In Which Cases is it Difficult to Formulate Queries over Semistructured Data?

15 Can Regular Expressions Help in Querying Irregular Data? In many cases, regular expressions can be used to query irregular data Yet, regular expressions are –Not efficient – it is difficult to evaluate regular expressions –Not intuitive – it is difficult for a naïve user to formulate regular expressions

16 More on Using Regular Expressions When querying irregular data, the size of the regular expression could be exponential in the number of labels in the database –For n types of objects, there are n! possible hierarchies –For an object with n attributes, there are 2 n subsets of missing attributes

17 Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration

18 Queries with Incomplete Answers We have developed queries that deal with incomplete data in a novel way and return incomplete answers The queries return maximal answers rather than complete answers Different query semantics admit different levels of incompleteness

19 Queries with Incomplete Answers Queries with complete answers Queries with AND Semantics Queries with Weak Semantics Queries with OR Semantics Increasing level of incompleteness

20 Queries and Matchings The queries are labeled rooted directed graphs Query nodes are variables Matchings are assignments of database objects to the query variables according to –the constraints specified in the query, and –the semantics of the query

21 Root Constraint: Satisfied if the query root is mapped to the db root Edge Constraint: Satisfied if a query edge with label l is mapped to a database edge with label l Constraints On Complete Matchings r1 Query Root Database Root x y ll

Movie Database Movie Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r yx z u Uncredited Actor Name 32 Name Movie Director Uncredited Actor 14 May 1944 Date of birth 35 v Name Date of birth George Lucas A Complete Matching A Complete Matching Producer All the nodes are mapped to non-null values The root constraint and all the edge constraints are satisfied

Movie Database Movie Actor Name Title 3133 Dustin Hoffman Harrison Ford 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r yx z u Uncredited Actor Name 32 Name Movie Director Uncredited Actor 14 May 1944 Date of birth 35 v Name Date of birth Consider the case where Node 35 is removed from the database 14 May 1944 Date of birth 35 George Lucas No Complete Matching Exists! No Complete Matching Exists! Producer Star Wars 1977

Movie Database Movie Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r yx z u Uncredited Actor Name 32 Name Movie Director Uncredited Actor v Name Date of birth George Lucas Not Every Partial Assignment is an Incomplete Matching Not Every Partial Assignment is an Incomplete Matching Producer 1 This is not a matching, since the sequence of labels from the database root to Node 31 is different from any sequence of labels that starts at the query root and ends in variable v u NULL z y x 31

25 The Reachability Constraint on Partial Matchings A query node v that is mapped to a database object o satisfies the reachability constraint if there is a path from the query root to v, such that all edge constraints along this path are satisfied Database x z w y l1l1 r v l3l3 l2l2 l5l5 l4l4 l6l6 Query w y r v l3l3 l5l5 v l1l1 1 l3l3 l5l5 v x z r l2l2 l4l4 l6l l2l2 l4l4 l6l6

26 yx z Director Actor r Producer “And” Matchings A partial matching is an AND matching if –The root constraint is satisfied –The reachability constraint is satisfied by every query node that is mapped to a database node –If a query node is mapped to a database node, all the incoming edge constraints are satisfied

Movie Database Movie Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie An AND Matching George Lucas Director Steven Spielberg Director 12 r yx z u Uncredited Actor Name 32 Name Movie Director Uncredited Actor v Name Date of birth Producer 11 Producer u NULL

28 Uncredited Actor Uncredited Actor 1 11 Movie Database Movie Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r yx z u Name 32 Name Movie Director Uncredited Actor v Name Date of birth Suppose that we remove the edges that are labeled with Uncredited Actor George Lucas Producer In an AND matching, Node z must be null! In an AND matching, Node z must be null!

29 Edge Constraint: Is Weakly Satisfied if it is either Satisfied (as defined earlier), or One (or more) of its nodes is mapped to a null value Weak Satisfaction of Edge Constraints x y ll x y lm null x y lm null x y l

30 Weak Matchings A partial matching is a weak matching if –The root constraint is satisfied –The reachability constraint is satisfied by every query node that is mapped to a database node –Every edge constraint is weakly satisfied

Movie Database Movie Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie A Weak Matching George Lucas Director Steven Spielberg Director 12 r yx z u Name 32 Name Movie Director Uncredited Actor v Name Date of birth Producer 11 Producer u NULL y Edges that are weakly satisfied

32 x y ll x y lm null x y l x y lm null In a weak matching, all four options are permitted In an AND matching, only the first three options are permitted

33 Producer 1 11 Movie Database Movie Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r yx z u Name 32 Name Movie Director Uncredited Actor v Name Date of birth Consider the case where edges labeled with Producer are removed George Lucas Producer In a weak matching, Node z must be null! In a weak matching, Node z must be null!

34 “OR” Matchings A partial matching is an OR matching if –The root constraint is satisfied –The reachability constraint is satisfied by every query node that is mapped to a database node

Movie Database Movie Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie An OR Matching George Lucas Director Steven Spielberg Director 12 r yx z u Name 32 Name Movie Director Uncredited Actor v Name Date of birth Producer u NULL y An edge which is not weakly satisfied

36 Increasing Level of Incompleteness A complete matching is an AND matching An AND matching is a weak matching A weak matching is an OR matching

37 t 1 =(1, 5, 2, null) t 2 =(1, null, 2, null) Maximal Matchings A tuple t 1 subsumes a tuple t 2 if t 1 is the result of replacing some null values in t 2 by non-null values: A matching is maximal if no other matching subsumes it A query result consists of maximal matchings only Matchings are represented as tuples of oid’s and null values

38 On the Complexity of Computing Queries with Incomplete Answers The size of the result can be exponential in the size of the input (database and query) –Note that the same is true when joining relations – the size of the result can be exponential in the size of the input (database and query) Instead of using data complexity (where the runtime depends only on the size of the database), we use input-output complexity

39 Input-Output Complexity In input-output complexity, the time complexity is a function of the size of the query, the size of the database, and the size of the result. In input-output complexity, the time complexity is a function of the size of the query, the size of the database, and the size of the result.

40 The Motivation for Using I/O Complexity Measuring the time complexity with respect to the size of the input does not separate between the following two cases: –An algorithm that does an exponential amount of work simply because the size of the output is exponential in the size of the input –An algorithm that does an exponential amount of work even when the query result is small Either the algorithm is naïve (e.g., it unnecessarily computes subsumed matchings) or the problem is hard

41 I/O Complexity of Query Evaluation (lower bounds are for non-emptiness) Cyclic Query DAG Query Tree Query Path Query Query / Semantics NP- Complete PTIME Complete NP- Complete PTIME AND PTIME Weak PTIME OR Recent Results (PODS’03)

42 Filter Constraints Constraints that filter the results (i.e., the maximal matchings) There are –Weak filter constraints (the constraint is satisfied if a variable in the constraint is null) –Strong filter constraints (all variables must be non-null for satisfaction) Existence constraint: !x is true if x is not null

43 I/O Complexity of Query Evaluation with Existence Constraints (lower bounds are for non-emptiness) Cyclic Query DAG Query Tree Query Path Query Query / Semantics NP- Complete PTIME Complete NP- Complete PTIME AND NP- Complete PTIME Weak NP- Complete PTIME OR

44 I/O Complexity of Query Evaluation with Weak Equality/Inequality Constraints (lower bounds are for non-emptiness) Cyclic Query DAG Query Tree Query Path Query Query / Semantics NP- Complete PTIME Strong NP- Complete PTIME AND NP- Complete PTIME Weak NP- Complete PTIME OR

45 Query Containment Query containments for queries with incomplete answers is defined differently from query containment for queries with complete answers Q 1  Q 2 if for all database D, every matching of Q 1 w.r.t. to D is subsumed by a matchings of Q 2 w.r.t. to D Query containment (query equivalence) is useful for the development of optimization techniques

46 Containment in AND Semantics Homomorphism between the query graphs is necessary and sufficient for containment r y x z l1l1 v l2l2 l2l2 u l3l3 l4l4 Q1Q1 r q p l1l1 v l2l2 u l3l3 l4l4 Q2Q2 homomorphism Deciding whether one query is contained in another is NP-Complete Q 1  Q 2

47 Containment in OR Semantics The following is a necessary and sufficient condition for query containment in OR semantics For every spanning tree T 1 of the contained query, there a spanning tree T 2 of the containing query, such that there is a homomorphism from T 2 to T 1 –is in Π P 2 –NP-Complete if the containee is a tree –polynomial if the container is a tree

48 Containment in Weak Semantics Similar to containment in OR Semantics, with the following difference Instead of checking homomorphism between spanning trees, we check homomorphism between graph fragments –A graph fragment is a restriction of the query to a subset of the variables that includes the query root such that every node in the fragment is reachable from the root

49 Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration

50 Flexible Queries To deal with structural variations in the data, we have developed flexible queries

51 Flexible Queries Rigid Queries Semiflexible Queries Flexible Queries Increasing level of flexibility

52 A query that finds all pairs of actors that acted in the same movie A query that finds all pairs of actors that acted in the same movie However, if in the database, actors are descendents of movies, the query has to be reformulated However, if in the database, actors are descendents of movies, the query has to be reformulated Instead, we propose new ways of matching queries to databases Instead, we propose new ways of matching queries to databases r yx z Actor Movie Movie Database Example

53 Rigid matchings and complete matchings are the same Returning rigid matchings is the usual semantics for queries (e.g., XQuery, Lorel, XML-QL, etc.) Rigid matchings and complete matchings are the same Returning rigid matchings is the usual semantics for queries (e.g., XQuery, Lorel, XML-QL, etc.)

54 Root Constraint: Satisfied if the query root is mapped to the db root Edge Constraint: Satisfied if a query edge with label l is mapped to a database edge with label l Constraints On Rigid Matchings r1 Query Root Database Root x y ll

Movie Database Movie Actor T.V. SeriesActor Title Name Title Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 36 Year Year 21 Actor Name 30 Mark Hamill Léon Movie r x y Actor Movie A Rigid Matching This is not a Rigid Matching

56 A Semiflexible Matching The query root is mapped to the db root y l x 11 l 9 × r1 Query Root DB Root A query node with an incoming label l is mapped to a db node with an incoming label l The image of every query path is embedded in some database path SCC is mapped to SCC

57 A Semiflexible Matching The query root is mapped to the db root A query node with an incoming label l is mapped to a db node with an incoming label l The image of every query path is embedded in some database path SCC is mapped to SCC y l x 11 l 9 r1 Query Root DB Root The last two conditions cannot be verified locally, i.e., by considering one query edge at a time The last two conditions cannot be verified locally, i.e., by considering one query edge at a time

Movie Database Movie Actor T.V. SeriesActor Title Name Title Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 36 Year Year 21 Actor Name 30 Mark Hamill Léon Movie r x y Actor Movie The Semiflexible Matchings We get all the actor-movie pairs We get all the actor-movie pairs

59 r y x Actor Movie r x y Actor Movie Under semiflexible semantics, these two queries are equivalent Under semiflexible semantics, these two queries are equivalent The user does not have to know if movies are above or below actors in the database The user does not have to know if movies are above or below actors in the database

Movie Database Movie Actor T.V. SeriesActor Title Name Title Kyle MacLachlan Natalie Portman Harrison Ford 1977 Dune Star Wars Twin Peaks 36 Year Year 21 Actor Name 30 Mark Hamill Léon Movie r x y Actor Movie Another Example of a Semiflexible Matching Another Example of a Semiflexible Matching We get pairs of actors that acted in the same movie We get pairs of actors that acted in the same movie z Movie Actor Impossible to get this pair by means of a rigid matching, since the query is a dag and the db is a tree Impossible to get this pair by means of a rigid matching, since the query is a dag and the db is a tree

61 A Flexible Matching The query root is mapped to the db root r1 Query Root DB Root x 9y11 ll A query node with an incoming label l is mapped to a db node with an incoming label l An edge is mapped to two nodes on one path Notice that a path in the query is not necessarily mapped to a path in the db

62 An Example of a Flexible Query r x Director A director y Name The director name z Movie A movie of the director v Title The title of the movie u Actor An actor in the movie Name w The name of the actor

Movie Database Movie Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r y x z u Name 32 Name Movie Name Director 14 May 1944 Date of birth 35 v Title Name George Lucas Producer Actor w A query edge is mapped to two db nodes on one path A query edge is mapped to two db nodes on one path This flexible matching is neither a rigid matching nor a semiflexible matching This flexible matching is neither a rigid matching nor a semiflexible matching

Movie Database Movie Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r y x Name 32 Name Movie Producer 14 May 1944 Date of birth 35 George Lucas Producer 1 Why are semiflexible matchings preferred sometimes to flexible matchings? Why are semiflexible matchings preferred sometimes to flexible matchings? In this flexible matching, a producer is given with a movie that he directed but did not produce In this flexible matching, a producer is given with a movie that he directed but did not produce

Movie Database Movie Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r y x Name 32 Name Movie Producer 14 May 1944 Date of birth 35 George Lucas Producer In semiflexible semantics, the problem is solved since the image of a query path is embedded in a database path In semiflexible semantics, the problem is solved since the image of a query path is embedded in a database path Producer

66 Differences Between the Semiflexible and Flexible Semantics On a technical level, in flexible matchings –Query paths are not necessarily embedded in database paths –SCC’s are not necessarily mapped to SCC’s On a conceptual level, in the semiflexible semantics, nodes are “semantically related” if they are on the same path, and hence –Query paths are embedded in database paths In the flexible semantics, this condition is relaxed: –Query edges are embedded in database paths

67 Increasing Level of Flexibility A rigid matching is a semiflexible matching A semiflexible matching is a flexible matching

68 Verifying that Mappings are Semiflexible Matchings Is a given mapping of query nodes to database nodes a semiflexible matching? –Not as simple as for rigid matchings (no local test, i.e., need to consider paths rather than edges) In a dag query, the number of paths may be exponential –Yet, verifying is in polynomial time In a cyclic query, the number of paths may be infinite –Yet, verifying is in exponential time

69 Verifying that a Mapping is a Semiflexible Matching Cyclic Query DAG Query Tree Query Path Query Query / Database No matchings PTIME Path Database No matchings PTIME Tree Database No matchings PTIME DAG Database coNP PTIME Cyclic Database

70 Input-Output Complexity of Query Evaluation for the Semiflexible Semantics Next slide summarizes results about the input-output complexity –Polynomial for a dag query and a tree database (or simpler cases) Rather difficult to prove, even when the query is a tree, since there is no local test for verifying that mappings are semiflexible matchings –Exponential lower bounds for other cases

71 I/O Complexity for SF Semantics (lower bounds are for non-emptiness) Cyclic Query DAG Query Tree Query Path Query Query / Database Result is empty PTIME Path Database Result is empty PTIME Tree Database Result is empty NP- Complete DAG Database NP-Hard (in  P 2 ) NP-Hard (in  P 2 ) NP- Complete Cyclic Database Data Complexity is Polynomial in all Cases

72 Query Evaluation for the Flexible Semantics The database is replaced with a relationship graph which is a graph, such that –The nodes are the nodes of the database –Two nodes are connected by an edge if there is a path between them in the database (the direction of the path is unimportant) The query is evaluated under rigid semantics w.r.t. the relationship graph

73 I/O Complexity of Query Evaluation for the Flexible Semantics Results follow from a reduction to query evaluation under the rigid semantics Tree query –Input-Output complexity is polynomial DAG query –Testing for non-emptiness is NP-Complete

74 Query Containment Q 1  Q 2 if for all database D, the set of matchings of Q 1 w.r.t. to D is contained in the set of matchings of Q 2 w.r.t. to D We assume that –Both queries have the same set of variables

75 Complexity of Query Containment Under the semiflexible semantics, Q 1  Q 2 iff the identity mapping from the variables of Q 2 to the variables of Q 1 is a semiflexible matching of Q 2 w.r.t. Q 1 Thus, containment is –in coNP when Q 1 is a cyclic graph and Q 2 is either a dag or a cyclic graph –in polynomial time in all other cases Under the flexible semantics, query containment is always in polynomial time

76 Database Equivalence D 1 and D 2 are equivalent if for all queries Q, the set of matchings of Q w.r.t. to D 1 is equal to the set of matchings of Q w.r.t. to D 2 Both databases must have the same set of objects and the same root

77 Complexity of Database Equivalence For the semiflexible semantics, deciding equivalence of databases is –in polynomial time if both databases are dags –in coNP if one of the databases has cycles For the flexible semantics, deciding equivalence of databases is polynomial in all cases

78 Database Transformation MDB Actor Movie 68 Actor Movie The databases are equivalent under both the flexible and semiflexible semantics HookStar Wars Dustin Hoffman Harrison Ford Mark Hamill A DAG has become a TREE! MDB Actor Movie 68 Actor Movie Dustin Hoffman Hook Harrison Ford Star Wars Mark Hamill

79 Transforming a Database into a Tree Reasons for transforming a database into an equivalent tree database: –Evaluation of queries over a tree database is more efficient –In a graphical user interface, it is easier to represent trees than DAGs or cyclic graphs –Storing the data in a serial form (e.g., XML) requires no references

80 Transformation into a Tree There are algorithms for –Testing if a database can be transformed into an equivalent tree database, and –Performing the transformation For the semiflexible semantics –The algorithms are polynomial For the flexible semantics –The algorithms are exponential

81 Implementing Flexible Queries Flexible queries were implemented in SQL4X In an SQL4X query, relations and XML documents are queried simultaneously A query result can be either a relation or an XML document

82 QUERY AS RELATION SELECT text(y) as director, text(v) as title FROM x Director of ‘MDB.xml’, y Name of x, z Movie of x, v Title of z An SQL4X Query r y x z Movie Name Director v Title A query under the Flexible Semantics

83 QUERY AS RELATION SELECT text(y) as director, text(v) as title FROM x Director of ‘MDB.xml’, y Name of x, z Movie of x, v Title of x WHERE text(v) = ‘Star Wars’ An SQL4X Query r y x z Movie Name Director v Title A query under the Flexible Semantics Constraints can be added

84 QUERY AS RELATION SELECT text(x) as director, text(v) as title, Budget FROM x Director of ‘MDB.xml’, y Name of x, z Movie of x, v Title of x, FilmBudgets WHERE text(v) = FilmBudgets.Title An SQL4X Query r y x z Movie Name Director v Title A query under the Flexible Semantics Relations and XML Documents can be queried simultaneously BudgetTitle …… …… A relation with data about film budgets FilmBudgets

85 Agenda Why is is difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration

86 Combining the Paradigms In oblivious querying: –The user does not have to know where data is incomplete –The user does not have to know the exact structure of the data The paradigm of flexible queries and the paradigm of queries with incomplete answers should be combined

87 Flexible Queries with Incomplete Answers A flexible query w.r.t. a database is actually a rigid query w.r.t. the relationship graph Evaluating a query in AND-semantics (weak semantics, OR-Semantics) w.r.t. the relationship graph produces a flexible query that returns maximal answers rather than complete answers

Movie Database Movie Actor Name Title 3133 Dustin Hoffman Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Movie Director Steven Spielberg Director 12 r y x z u Name 32 Name Movie Name Director 14 May 1944 Date of birth 35 v Title Name George Lucas Producer Actor w Consider the case where Node 25 and Node 33 are removed Consider the case where Node 25 and Node 33 are removed 25 Actor Name 33 Dustin Hoffman Title Hook

Movie Database Movie Actor Name Title 31 Harrison Ford 1977 Star Wars 24 Year 21 Actor Name 30 Mark Hamill Hook Movie Director Steven Spielberg Director 12 r y x z u Name 32 Name Movie Name Director 14 May 1944 Date of birth 35 v Title Name George Lucas Producer Actor w A Flexible matching which is also an incomplete (maximal) matching A Flexible matching which is also an incomplete (maximal) matching u NULL w

90 Agenda Why is is difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration

91 Full Disjunction Intuitively, the full disjunction of a given set of relations is the join of these relations that does not discard dangling tuples Dangling tuples are padded with nulls Only maximal tuples are retained in the full disjunction (as in the case of QwIA)

92 languageyeartitlem-id English1983Zelig1 English1998Antz2 English1998Armageddon3 English1940Fantasia4 Movies date-of-birthnamea-id 1/12/1935Woody Allen1 19/3/1955Bruce Willis2 28/10/1967Julia Roberts3 Actors rolem-ida-id Zelig11 Z21 Harry32 Acted-in m-ida-id 11 Actors-that-Directed roleDate-of-birthnamea-idlanguageyeartitlem-id Zelig1/12/1935Woody Allen1English1983Zelig1 Z1/12/1935Woody Allen1English1998Antz2 Harry19/3/1955Bruce Willis2English1998Armageddon3  English1940Fantasia4  28/10/1967Julia Roberts3  The Full Disjunction of the Given Relations

93 The Full Disjunction of the Given Relations roleDate-of-birthnamea-idlanguageyeartitlem-id Zelig1/12/1935Woody Allen1English1983Zelig1 Z1/12/1935Woody Allen1English1998Antz2 Harry19/3/1955Bruce Willis2English1998Armageddon3  English1940Fantasia4  28/10/1967Julia Roberts3  roleDate-of-birthnamea-idlanguageyeartitlem-id  English1983Zelig1 The full disjunction does not include subsumed tuples languageyeartitlem-id English1983Zelig1 English1998Antz2 English1998Armageddon3 English1940Fantasia4 Movies This tuple will not be in the full disjunction

94 languageyeartitlem-id English1983Zelig1 English1998Antz2 English1998Armageddon3 English1940Fantasia4 Movies date-of-birthnamea-id 1/12/1935Woody Allen1 19/3/1955Bruce Willis2 28/10/1967Julia Roberts3 Actors rolem-ida-id Zelig11 Z21 Harry32 Acted-in m-ida-id 11 Actors-that-Directed roleDate-of-birthnamea-idlanguageyeartitlem-id Zelig1/12/1935Woody Allen1English1983Zelig1 Z1/12/1935Woody Allen1English1998Antz2 Harry19/3/1955Bruce Willis2English1998Armageddon3  English1940Fantasia4  28/10/1967Julia Roberts3  The Full Disjunction of the Given Relations roleDate-of-birthnamea-idlanguageyeartitlem-id  28/10/1967Julia Roberts3English1940Fantasia4 The full disjunction does not include tuples that are based on Cartesian Product rather than join

95 In the Full Disjunction of a Given Set of Relations: Every tuple of the input is a part of at least one tuple of the output Tuples are joined as in a natural join, padded with null values The result includes only “maximal connected portions”

96 Motivation for Full Disjunctions Full disjunctions have been proposed by Galiando-Legaria as an alternative for outerjoins [SIGMOD’94] Rajaraman and Ullman suggested to use full disjunctions for information integration [PODS’96]

97 Computing Full Disjunctions for γ-acyclic Relation Schemas Rajaraman and Ullman have shown how to evaluate the full disjunction by a sequence of natural outerjoins when the relation schemas are γ-acyclic Hence, the full disjunction can be computed in polynomial time, under input-output complexity, when the relation schemas are γ-acyclic

98 Weak Semantics Generalizes Full Disjunctions Relations can be converted into a semistructured database The full disjunction can be expressed as the union of several queries that are evaluated under weak semantics We have developed an algorithm that uses this generalization to compute full disjunctions in polynomial time under I/O complexity, even when the relation schemas are cyclic We have developed an algorithm that uses this generalization to compute full disjunctions in polynomial time under I/O complexity, even when the relation schemas are cyclic

99 Generalizing Full Disjunctions In a full disjunction, tuples are joined according to equality constraints as in a natural join (or equi-join) We can generalize full disjunctions to support constraints that are not merely equality among attributes

100 Example Movies (m-id, title, year, language, location) Actors (a-id, name, date-of-birth) Acted-in (a-id, m-id, role) Actors-that-Directed (a-id, m-id) Movies (m-id, title, year, language, location) Actors (a-id, name, date-of-birth) Acted-in (a-id, m-id, role) Actors-that-Directed (a-id, m-id) Historical-Events (name, date, description) Historical-Sites (Country, State, City, Site) Historical-Events (name, date, description) Historical-Sites (Country, State, City, Site) The date of the historical event is a date in the year when the movie was released The filming location is near the historical site

101 Another Way of Generalizing Full Disjunctions: Use OR-Semantics OR-semantics is used rather than weak semantics when tuples are joined This relaxes the requirement that every pair of tuples should be join consistent Instead, a tuple of the full disjunction is only required to be generated by database tuples that form a connected subgraph, but need not be pairwise join consistent

102 Employees (e-id, ename, city, dept-no) Departments (dept-no, dname, building) Located-in (building, city, street) Employees (e-id, ename, city, dept-no) Departments (dept-no, dname, building) Located-in (building, city, street) Employee: (007, James Bond, London, 6) Department: (6, MI-6, 10) Located-in: (10, Liverpool, King) streetcitybuilding dnamedept -no dept -no cityenamee-id  10MI-666LondonJames Bond007 KingLiverpool10 MI-66  Example The Full Disjunction

103 Employees (e-id, ename, city, dept-no) Departments (dept-no, dname, building) Located-in (building, city, street) Employees (e-id, ename, city, dept-no) Departments (dept-no, dname, building) Located-in (building, city, street) Employee: (007, James Bond, London, 6) Department: (6, MI-6, 10) Located-in: (10, Liverpool, King) streetcitybuilding dnamedept -no dept -no cityenamee-id KingLiverpool10 MI-666LondonJames Bond007 Example The Full Disjunction under OR-Semantics

104 Integrated Relation Data Source Information Integration from Heterogeneous Sources Query Relation Query Relation Query Relation

105 Integrated Relation Data Source Query Relation Query Relation Query Relation We use queries that combine flexible semantics and weak semantics: -The queries are insensitive to changes in the data - Easy to formulate the query

106 Integrated Relation Data Source Query Relation Query Relation Query Relation The integration of the relations is done with a full disjunction of the computed relations

107 Conclusion Flexible and semiflexible queries facilitate easy and intuitive querying of semistructured databases –Querying the database even when the user is oblivious to the structure of the database –Queries are insensitive to variations in the structure of the database

108 Conclusion (continued) Queries in AND semantics, OR semantics or weak semantics facilitate easy and intuitive querying of incomplete databases –Querying the database even when the user is oblivious to missing data –Queries return maximal answers rather than complete answers

109 Conclusion (continued) The two paradigms of flexible queries and queries with maximal answers can be combined The combination of the paradigms can facilitate integration of information from heterogeneous sources

110 Conclusion (continued) Full disjunctions can be computed using queries in weak semantics Full disjunctions can be generalized so that relations are joined using constraints that are not merely equality constraints

111 Thank You Questions?