Supporting Ranking in Queries Score-based Paradigm

Supporting Ranking in Queries Score-based Paradigm
Russell Greenspan CS 411 Spring 2004

Supporting Ranking in Queries Talk Outline
What Why How “Out-of-the-box” support “Smart” top-k processing Theme Introduction (Slides 2-8) Slides 2-8 introduce the topic of ranking in queries by answering the questions “what”, “why”, and “how”. What (Slides 3-5) Here we seek to introduce our audience to the mechanism of ranking in queries and queries that return top-k results. We present both intuitive and formal definitions of ranked queries. We then ask the audience to consider how this differs from traditional queries (the potential responses provided are fairly obvious). You might ask for a formal definition of a traditional query and how that definition compares to the formal definition for a ranked query. Why (Slide 6) Here we provide a few applications for ranking in queries. The idea is to differentiate the ‘what’, which is a more objective look at the technique, from the ‘why’, which is a subjective consideration, i.e. how can we apply this technique in the real world? How (Slides 7-8) Now that the audience is sold on the need for the technique, we address implementation issues. How can we perform ranked queries in a RDBMS? Slide 7 asks the audience to consider how a RDBMS might return top-k results using native algorithms that do not extend consideration to our overall goal (returning only k tuples). We then ask the audience to think of ways to improve on this technique, and present many of the ideas from the research community in Slide 8.

Ranking in Queries What is ranking in queries?
A mechanism to return only the top-k results Closest matches to user-specified boolean criteria Scoring results based on user-specified predicates SELECT Address FROM HousesForSale ORDER BY Best(Size, Price) Express similarity, relevance, or preference to a given query

What is ranking in queries? Definitions
Intuitive Output an ordered list of k items such that the list includes only those items whose scored rank is greater than the items not included Formal “Given retrieval size k and scoring function F, a ranked query returns a list K of k objects (i.e. |K| = k) with query scores, sorted in a descending order, such that F(t1, ..., tn) [u] > F(t1, ..., tn) [v] for all u in K and all v not in K.” [Chang, Hwang, 2002]

What is ranking in queries? Differences from traditional queries
How does this differ from traditional queries? Traditional queries: Do not stop processing until all results are computed Do not focus on ranking tuples to best match the input query Traditional boolean queries: Do not return “close” matches Can “over” or “under” match, producing too few or too many results

Ranking in Queries Why use ranking in queries?
Exact matches not required Often times something “close enough” satisfies a user’s demands Fuzzy matches desired Multimedia/image matching, where the very nature of the query does not involve an exact match Avoid unnecessary computations Find the “best” answers quickly as opposed to all answers

Ranking in Queries How do we use execute ranked queries?
“Out-of-the-box” support Perform query as any other, then perform sort and return only first k rows Why is this bad? Lots of unnecessary processing Waste of resources in intermediate results If scoring function is expensive, could result in computation of unneeded scores Can we do better?

How do we use execute ranked queries? “Smart” Ranked Query Execution
Query Processing Try to achieve significant reduction in query execution time Use mid-query (i.e. as query executes) techniques to optimize query plan for top-k results Consider minimal amount of tuples necessary to return k results Scoring Predicate Consider expense of scoring function in determining optimal query plan Implementation Techniques (Slides 9-47) Introduction (Slides 9-10) Here we break down the objectives described in the ‘How’ slides in terms of the research that has been done. There are basically two fairly distinct focuses, and it is important that the audience understand the distinction. On the one hand, some researchers have focused on the operation of top-k queries by introducing new middleware techniques or new query operators. These operators are interjected into existing query plans (or form the basis for new query plans) and significantly reduce the response time for top-k queries. Researchers use “Smart” consideration of the fact that our end goal is just k tuples, and describe techniques for examining fewer tuples at each subsection of the query plan. On the other hand, some research has focused on the ranking function itself. Here, researchers seek to reduce the expense of calls to the ranking function, or describe new ranking techniques for “Smart” top-k ranked queries. The ‘Research and Techniques’ slide (Slide 10) provides a roadmap for all the research we will discuss during our talk. This slide might be returned to throughout the presentation (or duplicated with the technique we are up to in bold font), so that the audience keeps a clear understanding of the overall picture. Techniques (Slides 11-47) Garlic [Fagin, 1999] (Slides 11-16) Because it is relatively easy to digest, we use Garlic as the first example of “Smart” top-k query processing. The technique centers on combining two values from different subsystems to compute an overall score for an object. Ask the audience if they can think of any mechanisms to avoid computing the score for every object. We then use a custom example to show the benefit of Fagin’s A0 algorithm. First, we show the inefficient technique of computing the score for every object (Slide 14), and then we show how Fagin’s algorithm would have approached the problem (Slide 15). Take the audience through the algorithm step by step. CHITRA [Nepal, Ramakrishna, 1999] (Slides 17-21) We next examine an extension to Fagin’s algorithm, Nepal and Ramakrishna’s “Multi-Step” algorithm. We return to the same example, and show how the “Multi-Step” algorithm would approach the problem (again, step the audience through). We show (for illustrative purposes) the algorithm at work with two different combinatorial ranking functions (Slide 20). We also describe the authors’ proof for the correctness of their algorithm (Slide 21). We do not feature many proofs in this talk, so it may be worthwhile to spend some time going through this in detail. STOP Operator [Carey, Kossmann, 1997] (Slides 22-27) The STOP Operator is the first mid-query technique we will examine. We show a custom example of the operator at work in a query plan (Slide 23), and then describe the two heuristics that play into placement of the operator in a query plan (Slides 24-25). The gist of the technique is described on the first slide, and if the audience is not clear, it will be worthwhile to actually compute some comparative statistics for the two query plans in Slide 23. Probabilistic [Donjerkovic, Ramakrishnan, 1999] (Slides 28-31) This technique is presented in a more theoretical manner than the previous techniques. The idea here to explain the fundamental benefit of using RDBMS statistics to simply “know” the top-k tuples at any point in a query plan. Explain to the audience through Slide 29 how this is an extension to the STOP operator, but where the STOP operator requires a SORT, the statistical select does not. Purposefully keep the discussion theoretical, so that in the next technique (which is very similar) we can be more practical. Statistical [Chaudhuri, Gravano, 1999] (Slides 32-36) Here we present the techniques for actually performing these “know-the-top-k-tuples” queries. We describe Chaudhuri and Gravano’s method of examining RDBMS histograms, and how another aggressive/conservative (here NoRestarts/Restarts) heuristic plays into the technique (Slide 34). Describe the algorithm on Slide 35 and show how the top-k tuple query is translated into a boolean range query. MPro [Chang, Hwang, 2002] (Slides 37-42) MPro is a slight shift in paradigm, as we now move into the research that has focused on the scoring function. We present the MPro algorithm and its supporting theories in some detail. Step the audience through the simple example described in Slide 38, and show how we avoid score calculation when it is not necessary. Then take a real example and go through the MPro algorithm described on Slide 40. It should be clear to the audience how this is so beneficial. Follow up by describing the further applications (Slide 41) and highlight the performance benefits (Slide 42). AutoRank [Agrawal, Chaudhuri, Das, Gionis, 2003] AutoRank is another technique that is quite different from the others, but it should be made clear to the audience how it relates to the overall topic. If the audience is unfamiliar with IR techniques of TF, IDF, and Cosign Similarity, it will be very worthwhile to give a basic introduction to those techniques. The audience will then get a feeling of “ah-ha” when the IR techniques are extrapolated to the relational world (Slides 44-45). It might be worthwhile to step through the ITA algorithm (Slide 46) with a real example.

“Smart” Ranked Query Execution Two Areas of Research Focus
Top-k processing Reducing number of tuples considered at each intermediate step Assume minimal work necessary to retrieve items sorted by score (i.e. indexes on simple attributes) Rank function Reducing number of calls to ranking function Assume rank calculation is expensive Implementing unusual ranking function

“Smart” Ranked Query Execution Research and Techniques
Reducing number of tuples considered Middleware/Multimedia Garlic [Fagin, 1999] CHITRA [Nepal, Ramakrishna, 1999] Relational STOP Operator [Carey, Kossmann, 1997] Probabilistic [Donjerkovic, Ramakrishnan, 1999] Statistical [Chaudhuri, Gravano, 1999] Reducing number of calls to ranking function MPro [Chang, Hwang, 2002] Implementing unusual ranking function AutoRank [Agrawal, Chaudhuri, Das, Gionis, 2003]

“Smart” Ranked Querying (Middleware) – Garlic [Fagin, 1999]
Integrates data from different database systems or non-database data servers Relational Query Set vs. “Sorted List” Example: “Return the reddest covers of Beatle’s albums” i.e. (Artist = ‘Beatles’) AND (AlbumColor LIKE ‘red’), where Artists are stored relationally and Album colors in a multimedia database Assign grade to each object Boolean grade either 0 or 1 Fuzzy value 0<=x<=1 indicating closeness

Garlic [Fagin, 1999] Rank Processing Methods
How to combine two fuzzy values to retrieve top-k objects? Inefficient Consider graded sets of all objects by color and shape Compute combined score for every object, then output top k objects Efficient Retrieve objects (sorted by grade) from each subsystem until there are at least k of the same objects in each set Compute combined score for each of these k objects

Garlic [Fagin, 1999] Example Query
Example: (use combined scoring function = x * y) Return Top 2 Color = ‘red’ AND Shape = ‘round’ Object Redness A .2 B .6 C .1 D .8 E .3 F .5 G .9 H Object Roundness A .6 B .8 C .3 D .2 E .9 F .1 G .7 H .4

Garlic [Fagin, 1999] Inefficient vs. Efficient Processing
Calculate combined score for every object Sort by score Return top k objects {G, B} Object Score A .12 B .48 C .03 D .16 E .27 F .05 G .63 H

Garlic [Fagin, 1999] Inefficient vs. Efficient Processing
Efficient (Fagin’s A0 algorithm) Consider ordered members from each set until there are k of the same object in each set A1 = {G(.9), D(.8), B(.6)} A2 = {E(.9), B(.8), G(.7)} Calculate combined score for each of the k objects G = .9 * .7 = .63 B = .6 * .8 = .48 Return these objects ordered by combined score {G, B}

Garlic [Fagin, 1999] Conclusions
Why is this more efficient? Incur expense of scoring function k times, as opposed to n times (where n is the total number of items) Access each subsystem at least k and at most n times, as opposed to n times (again, where n is the total number of items)

“Smart” Ranked Querying (Middleware) – CHITRA [Nepal, Ramakrishna, 1999]
Expands on Fagin’s GARLIC system by proposing new “multi-step” processing algorithm Experimental results show 50% improvement

CHITRA [Nepal, Ramakrishna, 1999] “Multi-step” Algorithm
Consider first sorted item x from each subsystem i Perform random access into every other subsystem to obtain other rankings of x Add object to result set if its rank is greater than the threshold grade, quit when we have k objects Threshold is score of all objects considered each iteration

CHITRA [Nepal, Ramakrishna, 1999] Example Query
Back to our example... Return Top 2 Color = ‘red’ AND Shape = ‘round’ Object Redness A .2 B .6 C .1 D .8 E .3 F .5 G .9 H Object Roundness A .6 B .8 C .3 D .2 E .9 F .1 G .7 H .4

CHITRA [Nepal, Ramakrishna, 1999] Example Scoring Functions Results
Consider two scoring functions as examples: min[x, y] [x * y] Iter. Items Grade Threshold Resultset 1 i1 = {G(.9)} i2 = {E(.9)} G = min[.9, .7] = .7 E = min[.9, .3] = .3 min[.9, .9] = .9 2 i1 = {D(.8)} i2 = {B(.8)} D = min[.8, .2] = .2 B = min[.8, .6] = .6 min[.8, .8] = .8 3 i1 = {B(.6)} i2 = {G(.7)} B = min[.6, .8] = .6 G = min[.7, .9] = .7 min[.6, .7] = .6 {G, B} Iter. Items Grade Threshold Resultset 1 i1 = {G(.9)} i2 = {E(.9)} G = [.9 * .7] = .63 E = [.9 * .3] = .27 [.9 * .9] = .81 2 i1 = {D(.8)} i2 = {B(.8)} D = [.8 * .2] = .16 B = [.8 * .6] = .48 [.8 * .8] = .64 3 i1 = {B(.6)} i2 = {G(.7)} B = [.6 * .8] = .48 G = [.7 * .9] = .63 [.6 * .7] = .43 {G, B}

CHITRA [Nepal, Ramakrishna, 1999] Conclusions
Why is this more efficient? Requires fewer accesses to each subsystem How do we know this algorithm is correct? Proof by contradiction Assume object z which should have been included If Rank(z) > Rank(y), either: y must have at least one subsystem rank smaller than all subsystem ranks of z z must have at least one subsystem rank greater than all subsystem ranks of y However, since Rank(z) < Threshold and Rank(y) >= Threshold, Rank(z) cannot be greater than Rank(y)

“Smart” Ranked Querying (Relational) – STOP Operator [Carey, et al, 1997]
Specifies extension to SQL-92 standard to allow limit on cardinality of result STOP AFTER Return subset of results from each section of query plan Implement with STOP operator STOP(N, D, E) where N is the number of desired tuples, D is the Sort Directive [asc, desc, none], and E is the Sort Expression Heuristically determine when and how to apply

STOP Operator [Carey, et al, 1997] Example query plans
Fig a shows traditional JOIN Join all EMP to DEPT, sort, output top k Fig b shows implementation of STOP operators Based on cardinality estimates, only 20 rows of EMP need be joined with 30 rows of DEPT to produce top-k of 10

STOP Operator [Carey, et al, 1997] Conservative Heuristic
Ensures that every tuple in each intermediate result is guaranteed to generate at least one tuple of the overall query result Advantages No restarts from intermediate processing returning fewer than k results Intermediate STOP operators take their N value from overall query k value Disadvantages Only inserts STOP operators where all remaining predicates are non-reductive (cannot use with multi-way joins)

STOP Operator [Carey, et al, 1997] Aggressive Heuristic
Applies STOP operator wherever it may be beneficial, thus reducing intermediate results to a greater degree Choose N value using cardinality estimates Requires RESTART operator when intermediate processing returns too few results

STOP Operator [Carey, et al, 1997] Experimental Results
Which heuristic is better? Depends on cardinality, expense of processing intermediate results, accuracy of prediction, etc. With low expense of processing intermediate results, experimental results show aggressive overestimation the best: Traditional Conservative Aggressive, Underestimate (1/10) Aggressive, Overestimate (10) 128.3 sec 63.9 sec 63.1 sec 18.5 sec

STOP Operator [Carey, et al, 1997] Experimental Results
Performance vs. Traditional (“out-of-the-box”) processing shows benefits in both indexed and non-indexed situations

“Smart” Ranked Querying (Relational) – Probabilistic [Donjerkovic, et al, 1999]
Introduces idea of ‘selection cutoff’ to produce top k results without requiring SORT Quantifies the risk of fewer than k results being generated using inherent database statistics List the top 10 paid employees becomes List the employees whose salary is greater than x where x is determined by the distribution of employees’ salaries

Probabilistic [Donjerkovic, et al, 1999] Comparison with STOP Operator
In theory, likely to be cheaper to simply ‘select’ the necessary intermediate rows using cutoff (fig b) rather than performing sort and returning top-k (fig a)

Probabilistic [Donjerkovic, et al, 1999] Implementation
Leverage same statistics used by traditional query optimizer to guess cutoff Histograms Selectivity factors

Probabilistic [Donjerkovic, et al, 1999] Performance
For simple query using no indexes (return k highest paid employees, no index on ‘Salary’ attribute), easily outperforms traditional (scan, sort, return top k) Also provides benefit to JOIN queries due to complexity of estimating join selectivity

“Smart” Ranked Querying (Relational) – Statistical [Chaudhuri, Gravano, 1999]
Expansion of probabilistic model Maps rank queries into boolean range queries Works with a variety of scoring functions, including Min, Euclidean, and Sum

Statistical [Chaudhuri, Gravano, 1999] Expansion of probabilistic model
Consider multiple levels of ‘selection cutoff’, here referred to as ‘search score’ (Sq) NoRestarts – score low enough to guarantee no restarts are even needed Restarts – score high enough that restarts might result Intermediate – score between NoRestarts and Restarts

Statistical [Chaudhuri, Gravano, 1999] Implementation
Determine Sq from histograms Choose bounding tuples in each bucket to ensure NoRestarts (fig a) or tight tuples to minimize selection but potentially require Restarts (fig b)

Statistical [Chaudhuri, Gravano, 1999] Implementation
Determine relational query to retrieve all tuples that score above Sq Compute n-rectangle bounding such tuples SELECT * FROM R WHERE (a1<=A1<=b1) ... AND ... (an<=An<=bn) Compute score for all returned tuples Output top-k tuples with score > Sq or rerun query with lower search score

Statistical [Chaudhuri, Gravano, 1999] Expansion of Fagin’s model
Expands Fagin’s ideas to relational queries Substitute ‘search score’ query to determine top tuples for each subsystem Use NoRestarts strategy to ensure that expensive re-querying is avoided

“Smart” Ranked Querying (Rank) – MPro [Chang, Hwang, 2002]
Extends consideration of top-k querying to expensive predicates (monotonic only) As opposed to other work, which assumes the expense of score calculation to be minimal Attempt to minimize the number of scores calculated Consider only Necessary Probes, i.e. only those calculations without which the top-k results cannot be found

MPro [Chang, Hwang, 2002] Determining if probe is necessary
An object’s lowest calculated score represents “ceiling score” (i.e. it is impossible for any other score for that object to raise its lowest score) If “ceiling score” falls below top-k object’s complete score, object is ruled out and no further calculations on the object need be performed Simple Example: Consider scoring function like Min and top-1 results desired If we know object A’s combined rank with respect to F(x) and F(y) is .8, and we calculate object B’s score with respect to F(x) to be .7, B’s score with respect to F(y) need not be calculated (its Min value cannot be higher than .7)

MPro [Chang, Hwang, 2002] Determining all necessary probes
Only objects with ceiling scores in the top-k need be further evaluated If objects are kept in sorted order by current ceiling scores: For any object u in the top-k slots, its next probe is necessary

MPro [Chang, Hwang, 2002] Minimal Probes Algorithm (MPro)
Priority queue initialization Evaluate each object over first predicate (same as sequentially accessing objects sorted by x) Necessary probing Request from queue the object with highest ceiling score Evaluate object over next predicate y Update ceiling score and reinsert into queue Stop when at least k objects have been completely scored (and output these objects)

MPro [Chang, Hwang, 2002] Further Applications
Incremental results Output top k, resume processing where it left off for next k as user requests Fuzzy joins Consider join predicate in same manner Parallel processing Distribute necessary probes across multiple servers Distribute data, calculate top-n over each chunk, merge results

MPro [Chang, Hwang, 2002] Experimental Results
On experimental dataset, over 96% of complete probes found to be unnecessary Elapsed time significantly improved (see below), from to 408 seconds for k = 10

“Smart” Ranked Querying (Rank) – AutoRank [Agrawal, et al, 2003]
Consider ranking of relational attributes in similar way to Information Retrieval (IR) IDF Similarity Extend TF-IDF based on frequency of occurrence of attribute values QF Similarity Use database workload to determine frequency with which attributes and attribute values are referenced “Poor man’s relevance feedback” ITA Index-based top-k algorithm that exploits above ranking functions

AutoRank [Agrawal, et al, 2003] IDF Similarity
Extend TF (term frequency) IR – frequency of terms in a document Relational – frequency of values for an attribute Extend IDF (inverse document frequency) IR – total documents / documents containing term Relational – tuples / tuples where attribute = value For all tuples matching the queried value, IDF Similarity is the attribute’s IDF (for the queried value), and 0 otherwise

AutoRank [Agrawal, et al, 2003] QF Similarity
Consider problem of IDF where desired result is also the most frequent Realty database where homes built in the last three years are most desired, but the few entries existing for old homes (with higher IDF) will be considered “top” Instead, use frequency of occurrence of attribute values in executed queries to determine ranking (by examining workload) Can extend workload analysis to draw comparative conclusions from attribute values queried together Assume similarity between ‘Honda’ and ‘Toyota’ if users frequently look for cars by either of these manufacturers

AutoRank [Agrawal, et al, 2003] Implementation
Store approximate representations of IDF and QF values using smooth function Minimal storage required IDF and QF values can be quickly retrieved at runtime ITA (Index-based Threshold Algorithm) Use available, existing indexes (B+ trees) Define threshold by computing best tuple in data not yet examined Stop processing when similarity of this tuple is no greater than similarity of lowest ranking tuple in top-k buffer

AutoRank [Agrawal, et al, 2003] Experimental Results
Used large realtor database from and MS- SQL Server Measured result-quality via user studies For each test query, asked users to identify relevant and irrelevant tuples and compared results of QF and IDF queries to users’ responses ITA judged to be more efficient than SQL Server’s Top-k operator when indexes exist

Conclusions Clearly, an exciting and worthwhile field
Research has gone in several directions but all shares roots in Fagin and Carey’s work Combines many areas of computer science Artificial Intelligence (Fuzzy Logic) Information Retrieval

The Future Implementation in major RDBMS vendors
Microsoft should be among the first to revamp their Top-K operator, as in-house research [Agrawal, et al, 2003] has provided a smarter, faster technique Explore more complex ranking functions that cannot be easily mapped to range queries or used with indexes

References M. J. Carey and D. Kossmann. On saying “enough already!" in SQL SIGMOD Conference: , 1997. D. Donjerkovic, R. Ramakrishnan. Probabilistic Optimization of Top N Queries. VLDB 1999: , 1999. R. Fagin. Combining Fuzzy Information from Multiple Systems. PODS 1996: , 1996. S. Nepal, M. V. Ramakrishna. Query Processing Issues in Image (Multimedia) Databases. ICDE 1999: 22-29, 1999. Surajit Chaudhuri, Luis Gravano. Evaluating Top-k Selection Queries. VLDB 1999: , 1999. K.C. Chang, S. Hwang. Minimal Probing: Supporting Expensive Predicates for Top-k Queries. SIGMOD Conference 2002: , 2002. Sanjay Agrawal, Surajit Chaudhuri, Gautam Das, Aristides Gionis. Automated Ranking of Database Query Results. CIDR 2003, 2003.

Supporting Ranking in Queries Score-based Paradigm

Similar presentations

Presentation on theme: "Supporting Ranking in Queries Score-based Paradigm"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Supporting Ranking in Queries Score-based Paradigm

Similar presentations

Presentation on theme: "Supporting Ranking in Queries Score-based Paradigm"— Presentation transcript:

Similar presentations

About project

Feedback