Answering Queries Using Views: A Survey

Answering Queries Using Views: A Survey
Paper by Alon Halevy Presentation by Rachel Pottinger

Reminders A view is a stored query
A Datalog query example: q(code):- Airport(code, city), Feature(city, “Beach”) Find all airport codes of cities that have beaches

Answering Queries Using Views – basic definition
Answer a query using a view rather than using the underlying base table Query: q(code):- Airport(code, city), Feature(city, POI) View: feature-code(code,POI):- Airport(code, city), Feature(city,POI) Rewriting using view: q(code):-feature-code(code,POI)

Two distinct problems:
Query optimization Data integration Physical Data Independence

AQUV in Query Optimization Goals
Speed Query Processing Still need exact answers

AQUV in Query Optimization: Closed World Assumption
Views are complete Think of as “If and only if” feature-code(code, POI):- Airport(code, city), Feature(city, POI) retrieves all airport codes for cities with beaches How do we know this holds? Given from problem – can’t tell from view definition

AQUV in Query Optimization: Looking for Equivalent Rewritings
Rewritings must be equivalent Think of as “rewritten query must retrieve exactly the same answers as the original query” Equivalent ex: Query: q(code):- Airport(code, city), Feature(city,POI) View: feature-code(code,POI):- Airport(code, city), Feature(city, POI) Equivalent Rewriting: q(code):-feature-code(code, POI) Non-equivalent ex: Same Query View: Beach-code(code):- Airport(code,city), Feature(city, “Beach”) Non-equivalent (contained) rewriting: q(code):-beach-code(code)

AQUV in Query Optimization: Can still access base relations
Can access views and base relations Ex: Query: q(code, URL):- Airport(code, city), Feature(city,POI), Webinfo(POI, URL) View: feature-code(code,POI):- Airport(code, city), Feature(city, POI) Rewriting: q(code,URL):-feature-code(code,POI), Webinfo(POI, URL)

AQUV in Query Optimization: General Algorithm
Fold into System-R style optimizer It’s just another access path

AQUV in Query Optimization: Discussion
Imagine that you're building a query optimizer. Would you consider it worthwhile to use views when answering queries? Why or why not? Would you try it only for certain kinds of queries or situations? Which ones? Paraphrased from a couple of discussion questions

AQUV in Data Integration: Example: Planning a Beach Vacation

Potential Data Integration Architecture: Local-As-View (LAV)
User Query Mediated Schema Local Schema 1 Local Schema N The user asks queries over the mediated schema. These queries are translated into queries over the local sources. The data is translated back into the mediated schema, and then this data is returned to the user. Local Database 1 Local Database N Expedia Orbitz Local sources are views on mediated schema

Local As View (LAV) LAV: local source is materialized view over mediated schema Mediated Schema: Airport(code, city) Feature(city, attraction) Local Sources/Views: CAA-Air(code, city) :- Airport(code, city) Beaches(code) :- Airport(code, city), Feature(city, “Beach”) Mediated Schema CAA-Air … Beaches This next source, Beaches, is related to the mediated schema by a more complicated view. In this case the source describes information that’s described in the mediated schema by two relations, airport and feature. Each relation in the query is referred to as a subgoal. By repeating the names of variables, we’re equating their values; in this case this means that the cities must be the same. In SQL terms, we’re performing a join on the city values.

Local As View (LAV) LAV: local source is materialized view over mediated schema Mediated Schema: Airport(code, city) Feature(city, attraction) Local Sources/Views: CAA-Air(code, city) :- Airport(code, city) Beaches(code) :- Airport(code, city), Feature(city, “Beach”) Adding new sources is easy Rewriting queries is NP-complete Mediated Schema CAA-Air … Beaches Finally, a few more definitions. First, distinguished variables are those that appear in the head, which are those that the user sees, or in the case of local as view, the variables that are the local source returns the user. The variables that are not returned are the existential variables. Semantically, an existential variable means that the user knows that some value exists, but that value is not returned. In the case of the beaches view definition, the user knows that if a code is returned there is *some* value for which the city both has an airport code and has a beach, but what the city is is unknown. Finally, we have constants, which are represented in quotation marks. Putting this all together, we’re asking for airport codes of cities that have beaches. In SQL terms we’re doing a join on the cities of airport and features, a selection on attractions that are beaches and a projection on codes. As I showed on the previous slide, the advantage of this architecture is that adding new local sources to the data integration system is very easy. For example, if we wished to add a new source about airfares, say Travelocity, then we need only create a new view definition describing the source in terms of the mediated schema. Which is great! But on the down side, recall that the mediated schema, which the user is asking queries over, has no data; in order to answer a user query, the query over the mediated schema must be translated into a query over the local sources. This process turns out to be np-complete in the number of query subgoals, and next I’ll show you why. Non deterministic polynomial i.e., likely exponential time: Have to try all combinations We’ll see how this works on the next slide

AQUV in Data Integration: Assumptions
Open World Assumption Each source only has some of the tuples Read as “if  then” Fodors(city, POI) :- Feature(city, POI) Fodors has some Features This is an assumption – you can’t tell from view definition Can’t access base relations May not be able to find an equivalent rewriting

AQUV in Data Integration: Maximally Contained Rewritings
Query: Dest(code) :- Airport(code, city), Feature(city, “Beach”) Sources/Views: CAA-Air(code, city) :- Airport(code, city) Fodors(city, POI) :- Feature(city, POI) Rewriting: Dest(code):-CAA-Air(code, city), Fodors(city, “Beach”) Maximally Contained Rewriting: all answers to Query are a subset of those of Rewriting, and Rewriting contains all possible answers given local sources Q MS Next slide shows additional feature CAA … Fodors To rewrite the queries we rely on something called answering queries using views. In answering queries using views, instead of relying on the base relations in which the data is stored, the queries are rewritten to be over materialized views. In this case that means that we can do exactly what we need to do; the user’s query over the mediated schema is translated into queries over the local sources where the data is stored. Answering queries using views was first used in query optimization, the problem of making a query faster to execute in traditional database systems. In that case the goal is to speed up the execution of a query that we already know can be executed. In this case the goal is to ensure that we can find answers to the query at all. For example, let us suppose that we had the following query, where the goal is to find destination airports for our beach vacation. We need the airport codes of cities that have beaches. Both Airport and Feature are relations in the mediated schema and hence have no data, so we need to rewrite them in terms of the local sources. For example, suppose that we had two sources, CAA-air, which gives information about airports, and Fodors, which gives information about Features such as beaches. In essence, the goal is to figure out how to replace each subgoal in the query, which describes what we want but without containing any data, with a view, which has the data. In this case we’d begin by looking for airport information. Since there is no data associated with the mediated schema we look to the sources. CAA-Air gives us information about Airports, so we can try to use CAA-Air to cover the Airport subgoal. That is, to give us the data about airports. Similarly, we can use the Fodors source or view to cover the beach subgoal. Putting these two pieces of information together, we now know that we can answer our query over the mediated schema by first going to CAA-Air and finding out information about airports and then going to Fodors to find out information about Beaches. Our goal in this case is to ensure that we have achieved the maximally contained rewriting, that is, all answers to the query are a subset to those of the rewriting and that the rewriting contains all possible answers given the local sources available. Recall that CAA or Fodors may not contain all information about either airports or beaches; we need to try all possible sources of data and all possible combinations. We’ll see an example of containment in specific algorithm

Answering Queries Using Views
Query: Dest(code) :- Airport(code, city), Feature(city, “Beach”) Sources/Views: CAA-Air(code, city) :- Airport(code, city) Fodors(city, POI) :- Feature(city, POI) Sun-Surf(city) :- Feature(city, “Beach”) Rewriting: Dest(code):-CAA-Air(code, city), Fodors(city, “Beach”)  Dest(code):-CAA-Air(code, city), Sun-Surf(city) Maximally Contained Rewriting: all answers to Query are a subset of those of Rewriting, and Rewriting contains all possible answers given local sources Q MS CAA … Fodors Suppose that we had another source, sun-surf, that also described information about features. In this case, it’s not enough just to rely on Fodors to tell us about beaches, we must try Sun-Surf as well; we have no guarantee that either Fodors or Sun-Surf contain all of the information about beaches. In this case there’s exactly one source that describes airport, so we need to try both ways of combining features with it. In general, we must extend this to the Cartesian product of all possible ways of covering the view subgoals. This is what makes the problem NP-Complete. Previous algorithms worked in this subgoal at a time fashion that I’ve described here. How do we find the Maximally Contained Rewriting?

AQUV in Data Integration: Discussion
There are two assumptions that are made in maximally-contained rewritings: (a) the sources are incomplete and (b) contained rewritings are okay. Are there data integration scenarios where you don’t think that this is true? If so, what? Can you come up with any scenarios where only one of (a) or (b) are true? If so, what?

Naïve Solution: Bucket Algorithm
Created as part of Information Manifold, Levy et al. Algorithm: Create a bucket for each query subgoal, place all relevant views into the bucket: Q(X):- g1(x1), …, gn(xn) For each element in cross product of the buckets, check containment The bucket algorithm was invented as part of the information manifold. The bucket algorithm is an illustration of the general principal that we need to only look at the combinations of n view subgoals The algorithm works in the following manner. Suppose we have query q with subgoals g1 to gn First, a bucket is created for each query subgoal. Next, for each bucket, we check to see which views contain subgoals that have the same name as the query subgoal. For each such view, we add it to the bucket Finally, we take each element of the cross product of the buckets and check containment. If the answer is contained in the original query, it is added to the maximally contained rewriting. X X Don’t worry what containment is. Also NPC

Subgoal Interaction The Bucket Algorithm does not recognize interactions: Query: Dest(code) :- Airport(code, city), Feature(city, “Beach”) Sources/Views: Orbitz(code): Airport(code, city) Beaches(code) : Airport(code, city), Feature(city, “Beach”) Frommers(city, POI):-Feature(city,POI) Bucket would check: Dest'(code):-Orbitz(code),Frommers(city,“Beach”) Expanding this gets: Dest'(code):-Airport(code,_),Feature(city, “Beach”) All answers to Dest' are not answers Dest (containment) It’s faster than trying to do it all at once A containment check (which is np complete in the size of the query) is needed Can’t figure things out First, let us look at v1. We can use v1 to map g1, but since we’ve mapped y to b and b is existential, we can never join on its value, and it cannot be used to form a maximally contained rewriting. The bucket algorithm would not discover this and would instead create a bucket entry for it and then try it in combination with every other possible subgoal combination. Secondly, with v2 we can map g1 by mapping x to c and y to d. However, since d is existential, if we ever want to join on y’s value, we must do so using d. The bucket algorithm would not discover this and would instead attempt to combine v2 with any other mapping of g2 even though we can tell a priori that it can’t be used in a maximally contained rewriting unless we also use it to map g2. We’ll see later how MiniCon recognizes discovers these facts before the combination phase and what gains it makes because of this.

The MiniCon Algorithm: Phase One [Pottinger & (Ha)Levy: VLDB]
Query: Dest(code) :- Airport(code, city), Feature(city, “Beach”) Sources/Views: Orbitz(code) :- Airport(code, city) Beaches(code) :- Airport(code, city),Feature(city, “Beach”) Rewriting: Dest(code) :- Beaches(code) Create MiniConDescriptions (MCDs): View subgoals linked by existential variables must be mapped together The MiniCon algorithm reduces this need for a Cartesian product dramatically by realizing a subtlety in mapping query subgoals to views. Depending on the view, we may be able to tell before we get to the combination phase, that Cartesian product, that some views must cover more than one query subgoal at a time or may not be able to be used to answer the query at all. For example, suppose that we had the same query, and two different sources. Let’s look at the orbitz source. The orbitz source gives information about airports, but city is existential. That is, the source returns us codes for airports that have cities, but not the city themselves. But the query joins on the value of city; it requires the same value in feature that it does in airport – we need the city to have a beach. Since the Orbitz source doesn’t return the value of the city, we can’t join on its value, and no matter what source we use to get information about features, we’ll never get a rewriting. That means that we can completely discard this source now, before we get to the stage where we try to combine it with other sources. What a difference datalog makes! What happens when we need two views?

MiniCon Algorithm Phase Two: Combine MCDs with non-overlapping subgoals
Query: Dest(code) :- Airport(code, city), Feature(city, “Beach”), Flight(“YVR”, code, airline, number) Sources/Views: Orbitz(code) :- Airport(code, city) Beaches(code) :- Airport(code, city),Feature(city, “Beach”) Expedia(orig, dest) :- Flight(orig, dest, airline, number) Rewriting: Dest(code) :- Beaches(code), Expedia(“YVR”, code) Fewer Combinations No Explicit Containment Check In phase two of the algorithm we combine MCDs for subgoals that MUST be mapped together. For example, if we have the same query as before, only now we only want destinations that we can reach with a single flight from Toronto, we might adjust our query to look for direct flights from YVR as shown in this additional subgoal. Again we can discard the Orbitz source easily. We also know that if we’re going to use the beaches source to cover subgoal 1, we must use it to cover subgoal two since they’re linked by an existential view variable. We can then determine that the new Expedia source can cover the flight subgoal since the only query variable used in it that’s used in any other query subgoal is code, and code is mapped to the distinguished view variable dest. Putting the two together, we can get a rewriting that covers all of the query subgoals. Ensuring that each subogoal is covered is not enough to ensure that the rewriting is contained in the query. This can be resolved by using another np-complete algorithm, to build a containment mapping. However, it turns out that by being very careful in our construction of the MCDs, we can avoid the need for a containment check by only combining MCDs that cover disjoint sets of subgoals. Hence, being careful in creating the MCDs, there are many fewer combinations to try, and by taking advantage of the work needed to create the MCDs anyway, we can avoid an explicit containment check.

AQUV Algorithms: Discussion
Does the computational complexity of these problems surprise you? Do they seem harder or easier than expected? How would you scale the complexity of each of the algorithms presented in terms of the completeness of the algorithms?

What happened then?

Schema mappings Where do those mappings come from? What do they look like?

Peer Data Management Systems
Rather than have a centralized authority, make things distributed

Model Management Most metadata applications are redone from scratch every time. It would be nice to have an algebra (like relational algebra) only on the schema level so that these algorithms could be reused

Data Spaces Pay as you go data integration

Industry: Data Integration  Enterprise Information Integration
Challenges: Scale up and performance Horizontal (general) vs. vertical (solving entire problem) Integration with EAI and other middleware But did make it

Answering Queries Using Views: A Survey

Similar presentations

Presentation on theme: "Answering Queries Using Views: A Survey"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Answering Queries Using Views: A Survey

Similar presentations

Presentation on theme: "Answering Queries Using Views: A Survey"— Presentation transcript:

Similar presentations

About project

Feedback