CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.

Slides:



Advertisements
Similar presentations
Comp 122, Spring 2004 Order Statistics. order - 2 Lin / Devi Comp 122 Order Statistic i th order statistic: i th smallest element of a set of n elements.
Advertisements

Review: Search problem formulation
Web Information Retrieval
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinVinayan Verenkar Computer Science Dept San Jose State University.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
Counting the bits Analysis of Algorithms Will it run on a larger problem? When will it fail?
Congestion Games with Player- Specific Payoff Functions Igal Milchtaich, Department of Mathematics, The Hebrew University of Jerusalem, 1993 Presentation.
Introduction to Algorithms
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
Efficient Query Evaluation on Probabilistic Databases
Planning under Uncertainty
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 12 June 18, 2006
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu.
6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
Rank Aggregation. Rank Aggregation: Settings Multiple items – Web-pages, cars, apartments,…. Multiple scores for each item – By different reviewers, users,
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Aggregation Algorithms and Instance Optimality
Combining Fuzzy Information: an Overview Ronald Fagin Abdullah Mueen -- Slides by Abdullah Mueen.
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Answering Imprecise Queries over Autonomous Web Databases Ullas Nambiar Dept. of Computer Science University of California, Davis Subbarao Kambhampati.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Strategic Behavior in Multi-Winner Elections A follow-up on previous work by Ariel Procaccia, Aviv Zohar and Jeffrey S. Rosenschein Reshef Meir The School.
Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.
CHAPTER 7: SORTING & SEARCHING Introduction to Computer Science Using Ruby (c) Ophir Frieder at al 2012.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
Vegas Baby A trip to Vegas is just a sample of a random variable (i.e. 100 card games, 100 slot plays or 100 video poker games) Which is more likely? Win.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Analysis of Algorithms
Edge-disjoint induced subgraphs with given minimum degree Raphael Yuster 2012.
Scheduling policies for real- time embedded systems.
The Integers. The Division Algorithms A high-school question: Compute 58/17. We can write 58 as 58 = 3 (17) + 7 This forms illustrates the answer: “3.
“Artificial Intelligence” in my research Seung-won Hwang Department of CSE POSTECH.
1 Relational Algebra and Calculas Chapter 4, Part A.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
Vector Space Models.
Combining Fuzzy Information: An Overview Ronald Fagin.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 7.
12 INFINITE SEQUENCES AND SERIES. In general, it is difficult to find the exact sum of a series.  We were able to accomplish this for geometric series.
Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
1 SQL: The Query Language. 2 Example Instances R1 S1 S2 v We will use these instances of the Sailors and Reserves relations in our examples. v If the.
Database Management System
Seung-won Hwang, Kevin Chen-Chuan Chang
Chapter 12: Query Processing
Top-k Query Processing
CS 4/527: Artificial Intelligence
Rank Aggregation.
Popular Ranking Algorithms
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Efficient Processing of Top-k Spatial Preference Queries
Query Specific Ranking
Presentation transcript:

CS246 Ranked Queries

Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary between “answers” and “non- answers” Goal: Find all “matching” tuples Optionally ordered by a certain field T: All Tuples A: Answer Clear boundary

Junghoo "John" Cho (UCLA Computer Science)3 Ranked Queries Find “cheap” houses “close” to UCLA Cheap( x ) & NearUCLA( x ) Non-Boolean semantics No clear boundary between “answers” and “non- answers” Answers inherently ranked Goal: Find top ranked tuples T: All Tuples A: Answer No clear boundary

Junghoo "John" Cho (UCLA Computer Science)4 Issues? How to rank? Distance 3 miles: proximity? Similarity: looks like “Tom Cruise”? How to combine rankings? Price = 0.8, Distance = 0.2. Overall grade? Weighting? Price is twice more “important” than distance? Query processing? Traditional query processing is based on Boolean semantics

Junghoo "John" Cho (UCLA Computer Science)5 Fagin’s paper Previously all of the 4 issues were a “black art” No disciplined way to address the problems Fagin’s paper studied the last 3 issues in a more “disciplined” way Combination of ranks Weighting Query processing Find general “properties” and derive a formula satisfying the properties

Junghoo "John" Cho (UCLA Computer Science)6 Topics Combining multiple grades Weighting Efficient query processing

Junghoo "John" Cho (UCLA Computer Science)7 Rank Combination Cheap( x ) & NearUCLA( x ) Cheap( x ) = 0.3 NearUCLA( x ) = 0.8 Overall ranking? How would you approach the problem?

Junghoo "John" Cho (UCLA Computer Science)8 General Query (Cheap( x ) & (NearUCLA( x ) | NearBeach( x ))) & RedRoof( x ) How to compute the overall grade? CheapNearUCLANearBeachRedRoof | & &

Junghoo "John" Cho (UCLA Computer Science)9 Main Idea Atomic scoring function  A ( x ): given by application  Cheap ( x ) = 0.3,  NearUCLA ( x ) = 0.2 … Query: recursive application of AND and OR (Cheap & (NearUCLA | NearBeach)) & RedRoof Combination of two grades for “AND” and “OR” 2 -nary function: t : [0, 1] 2  [0,1] Example: min(a, b) for “AND”?  Cheap & NearUCLA ( x ) = min(0.3, 0.2) = 0.2 Properties of AND/OR scoring function?

Junghoo "John" Cho (UCLA Computer Science)10 Properties of Scoring Function Logical equivalence The same overall score for logically equivalent queries  A&(B|C) ( x ) =  (A&B)|(A&C) ( x ) Monotonicity if  A ( x 1 ) <  A ( x 2 ) and  B ( x 1 ) <  B ( x 2 ), then  A&B ( x 1 ) <  A&B ( x 2 ) t(x 1, x 2 ) < t(x’ 1, x’ 2 ) if x i < x’ I for all i

Junghoo "John" Cho (UCLA Computer Science)11 Uniqueness Theorem The min() and max() are the only scoring functions with the two properties Min() for “AND” and Max() for “OR” Quite surprising and interesting result More discussion later Is logical equivalence really true?

Junghoo "John" Cho (UCLA Computer Science)12 Question on Logical Equivalence? Query: Homepage of “John Grisham” PageRank & John & Grisham Logically equivalent, but are they same? Does logical equivalence hold for non-Boolean queries? PRJohnGrisham & & PRJohnGrisham & &

Junghoo "John" Cho (UCLA Computer Science)13 Summary of Scoring Function Question: how to combine rankings Scoring function: combine grades Results from fuzzy logic Logical equivalence Monotonicity Uniqueness theorem Min() for “AND” and Max() for “OR” Logical equivalence may not be valid for graded Boolean expression

Junghoo "John" Cho (UCLA Computer Science)14 Topics Combining multiple grades Weighting Efficient query processing

Junghoo "John" Cho (UCLA Computer Science)15 Weighting of Grades Cheap( x ) & NearUCLA( x ) What if proximity is “more important” than price? Assign weights to each atomic query Cheap( x ) = 0.2, weight = 1 NearUCLA( x ) = 0.8, weight = 10 Proximity is 10 times more important than price Overall grade?

Junghoo "John" Cho (UCLA Computer Science)16 Formalization m -atomic queries  = (  1, …,  m ) : weight of each atomic query X = ( x 1, …, x m ) : grades from each atomic query f ( x 1, …, x m ) : unweighted scoring function f  ( x 1, …, x m ) : new weighted scoring function What should f  ( x 1, …, x m ) be given  ? Properties of f  ( x 1, …, x m )?

Junghoo "John" Cho (UCLA Computer Science)17 Properties P1: When all weights are equal f (1/m, …, 1/m) ( x 1, …, x m ) = f ( x 1, …, x m ) P2: If an argument has zero weight, we can safely drop the argument f (  1, …,  m-1, 0) ( x 1, …, x m ) = f (  1, …,  m-1 ) ( x 1, …, x m-1 ) P3: f  (X) should be locally linear f  +(1-  )  ’ ( x 1, …, x m ) =  f  ( x 1, …, x m ) + (1-  ) f  ’ ( x 1, …, x m )

Junghoo "John" Cho (UCLA Computer Science)18 Local Linearity Example  1 = (1/2, 1/2), f  1 (X) = 0.2  2 = (1/4, 3/4), f  2 (X) = 0.4 If  3 = (3/8, 5/8) = 1/2  1 + 1/2  2 f  3 (X) = 1/2 f  1 (X) + 1/2 f  2 (X) = 0.3 Q: m-atomic queries. How many independent weight assignments? A: m. Only m degrees of freedom Very strong assumption Not too unreasonable, but no rationale

Junghoo "John" Cho (UCLA Computer Science)19 Theorem 1·(  1 -  2 ) f ( x 1 ) + 2·(  2 -  3 ) f ( x 1, x 2 ) + 3·(  3 -  4 ) f ( x 1, x 2, x 3 ) + … m·  m · f ( x 1, …, x m ) is the only function that satisfies such properties

Junghoo "John" Cho (UCLA Computer Science)20 Examples  = (1/3, 1/3, 1/3) 1·(1/3-1/3) f ( x 1 ) + 2·(1/3-1/3) f ( x 1, x 2 ) + 3·(1/3) f ( x 1, x 2, x 3 ) = f ( x 1, x 2, x 3 )  = (1/2, 1/4, 1/4) 1·(1/2-1/4) f ( x 1 ) + 2·(1/4-1/4) f ( x 1, x 2 ) + 3·(1/4) f ( x 1, x 2, x 3 ) = 1/4 f ( x 1 ) + 3/4 f ( x 1, x 2, x 3 )  = (1/2, 1/3, 1/6) 1·(1/2-1/3) f ( x 1 ) + 2·(1/3-1/6) f ( x 1, x 2 ) + 3·(1/6) f ( x 1, x 2, x 3 ) = 1/6 f ( x 1 ) + 2/6 f ( x 1, x 2 ) + 3/6 f ( x 1, x 2, x 3 )

Junghoo "John" Cho (UCLA Computer Science)21 Summary of Weighting Question: different “importance” of grades  = (  1, …,  m ): weight assignment Uniqueness theorem Local linearity and two other reasonable assumption 1·(  1 -  2 ) f ( x 1 ) + 2·(  2 -  3 ) f ( x 1, x 2 ) + … m·  m · f ( x 1, …, x m ) Linearity assumption questionable

Junghoo "John" Cho (UCLA Computer Science)22 Application? Web page ranking PageRank & (Keyword1 & Keyword2 & …) Should we use min()? min(keyword1, keyword2, keyword3,…) Would it be better than the cosine measure? If PageRank is 10 times more important, should we use Fagin’s formula? 9/11 PR + 2/11 min(PR, min(keywords)) Would it be better than other ranking function? Is Fagin’s formula practical?

Junghoo "John" Cho (UCLA Computer Science)23 Topics Combining multiple grades Weighting Efficient query processing

Junghoo "John" Cho (UCLA Computer Science)24 Question How can we process ranked queries efficiently? Top k answers for “Cheap( x ) & NearUCLA( x )” Assume we have good scoring functions How do we process traditional Boolean query? GPA > 3.5 & Dept = “CS” What’s the difference? What is difficult compared to Boolean query?

Junghoo "John" Cho (UCLA Computer Science)25 Naïve Solution Cheap( x ) & NearUCLA( x ) 1. Read prices of all houses 2. Compute distances of all houses 3. Compute combined grades of all houses 4. Return the k -highest grade objects Clearly very expensive when database is large

Junghoo "John" Cho (UCLA Computer Science)26 Main Idea We don’t have to check all objects/tuples Most tuples have low grades and will not be returned Basic algorithm Check top objects from each atomic query and find the best objects Question: How many objects should we see from each “atomic query”?

Junghoo "John" Cho (UCLA Computer Science)27 Architecture a: 0.9 b: 0.8 c: 0.7 … d: 0.9 a: 0.85 b: 0.78 … b: 0.9 d: 0.9 a: 0.75 … f ( x 1, x 2, x 3 ) b: 0.78 a: 0.75 How many to check? How to minimize it? Sorted access Random access any monotonic function

Junghoo "John" Cho (UCLA Computer Science)28 Three Papers Fuzzy queries Optimal aggregation Minimal probing

Junghoo "John" Cho (UCLA Computer Science)29 Fagin’s Model a: 0.9 b: 0.8 c: 0.7 … d: 0.9 a: 0.85 b: 0.78 … b: 0.9 d: 0.9 a: 0.75 … f ( x 1, x 2, x 3 ) Sorted access

Junghoo "John" Cho (UCLA Computer Science)30 Fagin’s Model Sorted access on all streams Cost model: # objects accessed by sorted/random accesses c s s + c r r Ignore the cost for “sorting” Reasonable when objects have been sorted already Sorted index Inappropriate when objects have not been sorted We have to compute grades for all objects Sorting can be costly

Junghoo "John" Cho (UCLA Computer Science)31 Main Question How many objects to access? When can we stop? A: When we know that we have seen at least k objects whose scores are higher than any unseen objects

Junghoo "John" Cho (UCLA Computer Science)32 Fagin’s First Algorithm Read objects from each stream in parallel Stop when k objects have been seen in common from all streams Top answers should be in the union of the objects that we have seen Why? f ( x 1, x 2, x 3 ) a: 0.9 b: 0.8 c: 0.7 … d: 0.9 a: 0.85 b: 0.78 … b: 0.9 d: 0.9 a: 0.75 … k objects 

Junghoo "John" Cho (UCLA Computer Science)33 Stopping Condition Reason The grades of the k objects in the intersection is higher than any unseen objects Proof x : object in the intersection, y : unseen object y 1  x 1. Similarly y i  x i for all i f ( y 1, …, y m )  f ( x 1, …, x m ) due to monotonicity

Junghoo "John" Cho (UCLA Computer Science)34 Fagin’s First Algorithm 1. Get objects from each stream in parallel until we have seen k objects in common from all streams 2. For all objects that we have seen so far If its complete grade is not known, obtain unknown grades by random access 3. Find the object with the highest grade

Junghoo "John" Cho (UCLA Computer Science)35 Example ( k = 2) a: 0.9 b: 0.8 c: 0.7 … d: 0.9 a: 0.85 b: 0.5 … min ( x 1, x 2 ) d: 0.6 c: a0.9 d b0.8 c x1x1 x2x2 min a: 0.85 d: 0.6

Junghoo "John" Cho (UCLA Computer Science)36 Performance We only look at a subset of objects Ignoring high cost for random access, clearly better than the naïve solution Total number of accesses O ( N (m-1)/m k 1/m ) assuming independent and random object order for each atomic query E.g., O ( N 1/2 k 1/2 ) if m = 2

Junghoo "John" Cho (UCLA Computer Science)37 Summary of Fagin’s Algorithm Sorted access on all streams Stopping condition k common objects from all streams

Junghoo "John" Cho (UCLA Computer Science)38 Problem of Fagin’s Algorithm Performance depends heavily on object orders in the streams k = 1, min(x1, x2) We need to read all objects Sorted access until 3 rd objects and random access for all remainder Can we avoid this pathological scenario? b: 1 a: 1 c: 1 d: 0 e: 0 e: 1 d: 1 b: 1 c: 0 a: 0

Junghoo "John" Cho (UCLA Computer Science)39 New Idea Let us read all grades of an object once we see it from a sorted access Do not need to wait until the streams give k common objects Less dependent on the object order When can we stop? Until we have seen k common objects from sorted accesses?

Junghoo "John" Cho (UCLA Computer Science)40 When Can We Stop? If we are sure that we have seen at least k objects whose grades are higher than those of unseen objects How do we know the grades of unseen objects? Can we predict the maximum grade of unseen objects?

Junghoo "John" Cho (UCLA Computer Science)41 Maximum Grade of Unseen Objects Assuming min(x1, x2), what will be the maximum grade of unseen objects? a: 1 b: 0.9 c: 0.8 d: 0.7 e: 0.6 e: 1 d: 0.8 b: 0.7 c: 0.7 a: 0.2 x1 < 0.8 and x2 < 0.7, so at most min(0.8, 0.7) = 0.7 Generalization?

Junghoo "John" Cho (UCLA Computer Science)42 Generalization x i : the minimum grade from stream i by sorted access f ( x 1, …, x m ) is the maximum grade of unseen objects x i < x i for all unseen objects f ( x 1, …, x m ): monotonic x1 x2

Junghoo "John" Cho (UCLA Computer Science)43 Basic Idea of TA We can stop when top k seen object grades are higher than the maximum grade of unseen objects Maximum grade of unseen objects: f ( x 1, …, x m )

Junghoo "John" Cho (UCLA Computer Science)44 Threshold Algorithm 1. Read one object from each stream by sorted access 2. For each object O that we just read Get all grades for O by random access If f (O) is in top k, store it in a buffer 3. If the lowest grade of top k object is larger than the threshold f ( x 1, …, x m ) stop

Junghoo "John" Cho (UCLA Computer Science)45 f (0.9,0.9) = 0.9f (0.8,0.85) = 0.8f (0.7,0.5) = 0.5 Example ( k = 2) a: 0.9 b: 0.8 c: 0.7 … d: 0.9 a: 0.85 b: 0.5 … min ( x 1, x 2 ) d: 0.6 c: 0.2 a0.9 d b x1x1 x2x2 min a: 0.85 d: 0.6 c f (1,1) = 1

Junghoo "John" Cho (UCLA Computer Science)46 Comparison of FA and TA? TA sees fewer objects than FA TA always stops earlier than FA When we have seen k objects in common, their grades are higher than the threshold TA may perform more random accesses than FA In TA, ( m -1) random accesses for each object In FA, Random accesses are done at the end, only for missing grades TA requires bounded buffer space ( k ) At the expense of more random seeks

Junghoo "John" Cho (UCLA Computer Science)47 Comparison of FA and TA TA can be better in general, but it may perform more random seeks What if random seek is very expensive or impossible? Algorithm with no random seek possible?

Junghoo "John" Cho (UCLA Computer Science)48 Algorithm NRA An algorithm with no random seek Isn’t random seek essential? How can we know the grade of an object when some of its grades are missing?

Junghoo "John" Cho (UCLA Computer Science)49 Basic Idea We may still compute the lower bound of an object, even if we miss some of its grades E.g., max(0.6, x )  0.6 We may also compute the upper bound of an object, even if we miss some of its grades E.g., max(0.6, x )  0.8 if x  0.8 If the lower bound of O 1 is higher than the upper bound of other objects, we can return O 1

Junghoo "John" Cho (UCLA Computer Science)50 Generalization ( x 1, …, x m ): the minimum grades from sorted access Lower bound of object: 0 for missing grades When x 3, x 4 are missing, f ( x 1, x 2, 0, 0) From monotonicity Upper bound of object: x i for missing grades When x 3, x 4 are missing, f ( x 1, x 2, x 3, x 4 ) x 3  x 3, x 4  x 4, thus f ( x 1, x 2, x 3, x 4 )  f ( x 1, x 2, x 3, x 4 )

Junghoo "John" Cho (UCLA Computer Science)51 NRA Algorithm 1. Read one object from each stream by sorted access. Assume ( x 1, …, x m ) are the lowest grades from the streams 2. For each object O seen so far Update its upper/lower bounds by Upper bound = use x i for missing grades Lower bound = use 0 for missing grades 3. If lower bounds of top k objects are larger than upper bounds of any other object, stop

Junghoo "John" Cho (UCLA Computer Science)52 AVG(0.5,0.7)=0.6 AVG(0.5,0.2)=0.35 AVG(0.3,0.7)=0.5 AVG(0.5,0.6) = 0.55AVG(0.3,0.2) = 0.25 Example ( k = 2) a: 0.9 b: 0.5 c: 0.3 … d: 0.7 a: 0.6 e: 0.2 … AVG ( x 1, x 2 ) a0.9 d b0.5 AVG(0,0.7)=0.35 AVG(0.3,0)=0.15 x1x1 x2x2 Lower Bound AVG(0.9,0)=0.45 AVG(0.5,0)=0.25 a, d c e AVG(0,0.2)=0.1 AVG(0.9,0.7)=0.8 AVG(0.3,0.2)=0.25 AVG(0.9,0.7)=0.8 AVG(0.5,0.6)=0.55 AVG(0.3,0.2)=0.25 Upper Bound 0.75 AVG(0.9,0.7) = 0.8

Junghoo "John" Cho (UCLA Computer Science)53 Properties of NRA No random access We may return an object even if we don’t know its grade We may only know its lower bound We need to constantly update the upper bounds of objects As threshold value decreases

Junghoo "John" Cho (UCLA Computer Science)54 Chang’s View Computing grades can be expensive Sorting is expensive Minimize sorted access

Junghoo "John" Cho (UCLA Computer Science)55 Chang’s Model Sorted access on one stream and random access on the remaining streams At least one sorted access necessary to “discover” objects Cost model: # of random accesses Reasonable when the objects are not sorted for some streams

Junghoo "John" Cho (UCLA Computer Science)56 Chang’s Model a: 0.9 b: 0.8 c: 0.7 … d: 0.9 a: 0.85 b: 0.78 … b: 0.9 d: 0.9 a: 0.75 … f ( x 1, x 2, x 3 ) Sorted access Random access

Junghoo "John" Cho (UCLA Computer Science)57 Chang’s Solution Main Idea? Probe only necessary attributes A probe is necessary iff we cannot find the right answer without it Which probe is necessary?

Junghoo "John" Cho (UCLA Computer Science)58 Necessary Probes Assume attribute probe order is fixed Assume min() is the scoring function Assume the threshold (grade of k th highest object) is 0.7 a : (0.9, 0.3, 0.2) Is the the second probe necessary? b : (0.5, 0.7, 0.3) Is the second probe necessary? Is the necessity dependent on algorithm?

Junghoo "John" Cho (UCLA Computer Science)59 Observation Probe necessity is independent of algorithm Purely dependent on the dataset Assuming probe order is fixed How do we find the necessary probes? When the upper bound of the grade of object O goes below the threshold, no more probe is necessary from the object How do we find the threshold value? We know the upper bound of the threshold Threshold upper bound = k th upper bound of grades (g k ) … Upper bound of Object grades k g1g1 gkgk

Junghoo "John" Cho (UCLA Computer Science)60 Algorithm MPro As long as we probe objects with grades above the threshold upper bound, we are safe Q : a priority queue for the upper bound grades of objects Pick the top object O from Q Probe the next attribute of O Stop if we have the complete grade for the top k objects in Q

Junghoo "John" Cho (UCLA Computer Science)61 Property of MPro MPro is optimal in the exact sense (not in the big O sense) All probes are necessary Assuming we need to compute the complete grades for all returned objects Assuming object probing order fixed No other algorithm can beat MPro Does it work for max()? Performance depends on the scoring function Good only when the upper bound is “tight”

Junghoo "John" Cho (UCLA Computer Science)62 Other Issues for MPro How to select the attribute probing order Mpro is optimal given a particular probing order Attribute probing order affects performance significantly Probe order estimation from sampling How to parallelize Mpro Probe top k objects simutaneously

Junghoo "John" Cho (UCLA Computer Science)63 Summary Efficient processing of ranked queries Sorted access Random access FA: k common objects TA: threshold value NRA: upper and lower bounds MPro: necessary probe principle

Junghoo "John" Cho (UCLA Computer Science)64 Hints on Paper Writing The goal of a paper is to be read and used by other people Should be easy to understand Tricky balance How to make a paper easy to read? Explicitly specify your assumptions Readers do not know what you think! Use examples Run experiments