All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
All right reserved by Xuehua Shen 2 Problem: Rank Aggregation Each object is scored using m different criteria, m sorted list for each criterion Combined score is calculated by an aggregation function Problem: find top-k objects with highest combined scores
All right reserved by Xuehua Shen 3 carIDMileage Score c1.0 a0.8 e0.6 b0.5 d carIDYear Score a0.9 b0.7 c d e0.5 carIDPrice Score d1.0 e0.9 b0.8 c0.7 a0.6 Rank Aggregation carIDscore d0.81 c0.76 Top 2 Car e.g. weighted sum Combined score = 0.2 *mileage score + 0.3*year score * price score Do we need access all entries of all sorted lists? Example
All right reserved by Xuehua Shen 4 Applications Multimedia database system Web search query Query Rank Aggregation Engine Color=‘red’and Shape=‘round’ Top k Color = ‘red’ Sorted List color shape Shape =‘round’ From Zhang2002 talk
All right reserved by Xuehua Shen 5 Outline Assumptions Fagin Algorithm Threshold Algorithm Summary & Comments
All right reserved by Xuehua Shen 6 Assumption 1: Modes of Access Sequential Access: obtain score of an object in one sorted list sequentially from current position Random Access: obtain score of an object in one sorted list using one random access carIDYear score a0.8 c e0.7 … Assumption: Both Access Modes are available
All right reserved by Xuehua Shen 7 Assumption 2: Aggregation Function Object gets different scores from different subsystems in the interval [0,1] Aggregation function to compute them into combined scores e.g. min, avg Monotone: if for every i
All right reserved by Xuehua Shen 8 Intuition of Algorithms Top objects in individual sorted lists also have chances to be correct answers Do some accesses, and think “Can we stop now?”
All right reserved by Xuehua Shen 9 Fagin Algorithm carIDPrice score a0.9 c0.8 e0.7 … carIDMileage score b1.0 e0.8 f0.7 … carIDYear score a0.8 c e0.7 … ’e’ appears in all of them. top-1 object must be in {a, b, c, e, f}. why? Monotone function, object ‘e’ blocks all objects below Do random access for these 5 objects to get their scores and pick Top-1. We can’t say ‘e’ must be top-1,other objects can still have higher combined score
All right reserved by Xuehua Shen 10 Drawbacks of Fagin Algorithm Only use information provided by sorted list and monotone property Have to remember lots of objects: large buffer size
All right reserved by Xuehua Shen 11 Threshold Algorithm (TA) When object R is seen under sequential access, immediately do random access to get all other scores of object R and compute combined score Halt when at least k objects have combined scores no less than upper bound Intuition: Combined score calculated by aggregation function can provide some extra information. upper bound (or threshold) of combined score of unseen objects! At the same time, Keep track of the upper bound of the unseen objects
All right reserved by Xuehua Shen 12 TA: Example (K=1,AVG aggregation) carI D Price score a0.9 c0.8 e0.7 … carIDYear score a0.8 c e0.7 … carIDMileage score b1.0 e0.8 f0.7 … Step 1: sequential access ‘a’ price score(0.9), then random access ‘a’ mileage score(0.6) and year score(0.8), avg is (0.77) Step 2: sequential access ‘b’ mileage score(1.0), then random access ‘b’ price score(0.7) and year score(0.7), avg is (0.8) Upper Bound: Upper Bound: Const-size buffer
All right reserved by Xuehua Shen 13 Evaluation of TA TA never stops later than FA TA requires only small constant-size (K) buffer However, TA may perform more random accesses
All right reserved by Xuehua Shen 14 Summary FA and TA with both sequential access and random access Extend TA to other situations Approximate algorithm No random access
All right reserved by Xuehua Shen 15 Comments Rely on universal identification of objects from different lists Assumptions can not always be valid e.g. not every sorted list exists beforehand Do sequential access wisely for speeding up TA for skewed data
All right reserved by Xuehua Shen 16
All right reserved by Xuehua Shen 17 Backup Slides
All right reserved by Xuehua Shen 18 Middleware Middleware: functions as a translation layer, handles all incoming requests (such as Top-K query) and replies, interacting with the disparate back-office systems to gather the information it needs. Application developers don’t need know there are several heterogeneous systems behind the middleware.
All right reserved by Xuehua Shen 19 Boolean Query Vs. Fuzzy Query Semantics Get all the results that satisfy the conditions Vs. get the best possible answers to the query Size of result: constant Vs. variable Processing the query It’s possible to determine whether the tuple belongs to result only based on the tuple itself, but for fuzzy query it’s not. So for boolean query we can deal with each tuple individually, but for fuzzy query, we cannot determine whether it’s in the result just by itself
All right reserved by Xuehua Shen 20 Fuzzy Query Processor (from Zhang02) Query Query Processor (Boolean) Title=‘database’ and Price <100 Query Query Processor (Fuzzy) Color=‘red’and Shape=‘round’ Set Top k Traditional Database Database with fuzzy data Color = ‘red’ Sorted List color shape Shape =‘round’
All right reserved by Xuehua Shen 21 Cost Reduce the number of sequential access(Cs) Number of random accesses is bounded by sequential access by a factor of m-1 Overall cost is bounded by the Cs by constant factor Really optimal?
All right reserved by Xuehua Shen 22 Approximation Algorithm Approximately top k answers are acceptable or even desirable θ-approximation (θ>1) For any object y in the answer, z in database θt(y) >= t(z) Turning TA to approximate algorithm The top k objects seen so far satisfy the inequality
All right reserved by Xuehua Shen 23 Non Random Access (NRA) Similar as TA, except that No exact score No sorted order The lower bound and upper bound of such objects Do sequential access until there are k objects whose lower bound no less than the upper bound of all other objects
All right reserved by Xuehua Shen 24 NRA cont. Low Bound: use 0 Upper Bound: use last score seen carIDPrice score a0.9 c0.8 e0.7 … carIDMileage score b1.0 e0.8 f0.7 … carIDYear score a0.8 c e0.7 …
All right reserved by Xuehua Shen 25 NRA example Advantage: R1(1,0), others(1/3,1/3) Top 1 Top 2 vs. Top 1: R1(1,0),R2(1,1/4),others(1/3,1/3) Top 2 Lots of Bookkeeping
All right reserved by Xuehua Shen 26 Optimality of FA Assumption t is monotone Cost Θ(N (m-1)/m k 1/m ) with arbitrarily high probability Optimality Each algorithm that correctly find the top k answers for strict monotone query F t (A 1, A 2, …,A m ) where A 1, A 2, …,A m are independent, and without wild guess has the cost Θ (N (m-1)/m k 1/m ) with arbitrarily high probability FA is optimal in all such algorithms in high probability sense
All right reserved by Xuehua Shen 27 Optimality of TA Assumption t is monotone Instance Optimality For any algorithm C that correctly find the top k answers for monotone query F t (A 1, A 2, …,A m ) without wild guess on any database D Cost(TA,D)=O(cost(C,D)) TA is instance optimal in all such algorithms
All right reserved by Xuehua Shen 28 Optimality of NRA Assumption t is monotone Instance Optimality For all algorithm that correctly find the top k objects for monotone query t for every database and don’t make random access
All right reserved by Xuehua Shen 29 Algorithm Comparision (from Zhang2002 talk) AlgorithmAssumptionAccess Model Termination Worst Case Termination Expected Buffer Space FAMonotoneSorted Random n(m-1)/m + k/m N m-1/m k 1/m N TAMonotoneSorted Random Bounded by FA Depends on distribution k NRAMonotoneSortedNDepends on distribution N
All right reserved by Xuehua Shen 30 Worst Case O1O O2O O n O n O n O 2n Aggregation Function: min n(m-1)/m + k/m
All right reserved by Xuehua Shen 31 Naïve algorithm Algorithm: For each criterion, do sequential access to retrieve all objects and their scores Calculate combined scores for all objects Pick up top K Comments: Access the entire database Cost is linear in the database size Does NOT use the fact that each list is sorted
All right reserved by Xuehua Shen 32 Fagin Algorithm Algorithm: Do sequential in parallel to all sorted list Li, until there is k “matches”. A “match” is an object that has been seen in all sorted lists Li. Then for each object that has been seen, do random access to get all its score. Compute the combined scores and pick the top k