Presentation is loading. Please wait.

Presentation is loading. Please wait.

Boolean + Ranking: Querying a Database by K-Constrained Optimization

Similar presentations


Presentation on theme: "Boolean + Ranking: Querying a Database by K-Constrained Optimization"— Presentation transcript:

1 Boolean + Ranking: Querying a Database by K-Constrained Optimization
Zhen Zhang Joint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi Chang

2 Many queries naturally combine Boolean and ranking
Traditional databases Boolean query: dept = CS and year = 2 Find top answers + B: dept = CS and year = 2 Qualifying constraint R: gpa Quantifying function Information retrieval Ranking query: Top 5 ranked by gpa B, R same color scheme Database applications on Web

3 Motivating scenarios Data retrieval: Data analysis:
Find houses in certain price range with good price/sqrft ratio Data analysis: Find products with highest sale increase in consecutive years Select h.address from House h, CrimeRate c Where h.price ≤ 200k ν h.price ≥ 400k and h.zipcode = c.zipcode Order by h.size/|h.price-300k| *c.crimerate-1 Limit 10 Select h.address from House h Where h.price ≤ 200k ν h.price ≥ 400k Order by h.size/|h.price-300k| Limit 1 Select itemid from Sales s1, Sales s2 Where s1.itemid = s2.itemid and s2.year – s1.year = 1 Order by s2.sale – s1.sale Limit 10

4 Boolean + Ranking form a coherent goal function
Boolean B + Ranking R = Goal function G For a tuple t R(t) if B(t) is true 0 if B(t) is false G(t) = B(t)*R(t) = (ie, lowest score)

5 The nature of Boolean + Ranking is K-constrained optimization query
Optimize goal function G over database D D G Goal function G h.size/|h.price-300k| [h.price ≤ 200k ν h.price ≥ 400k ] Addr Zip Price Size 1. Oak park, Chicago 60644 600K 4500 2. Mattis, Champaign 61821 350K 2000 3. 150K 1000 4. 250K 5. 300K 3500 6. 80K 500 Database D

6 What is the query evaluation mechanism?
Boolean query Ranking query + How to answer?

7 Current techniques lack of global search mechanism
If evaluated as separate operators If search by an overall goal function G as a ranking function Ranking query R Boolean query B Ranking query R Boolean query B D Current techniques optimize only condition-by-condition lacking global search mechanism D R B Goal function G Current techniques restrict G to be monotonic

8 Our thesis: Evaluate query as its nature suggests!
D OPT* Function optimization of G Optimize G over D Discrete state search over D D

9 We view compound index as discrete space
Addr Zip Price Size 1. Oak park, Chicago 60644 600K 4500 2. Mattis, Champaign 61821 350K 2000 3. 150K 1000 4. 250K 5. 300K 3500 6. 80K 500

10 We view compound index as discrete space
Price (k) 0-250 0-100 5 2 1 ……… b1 b3 b2 b7 b6 600 1 350 2 5 250 3 4 100 6 size 1500 3000 4000 4500 0-3000 0-1500 5 1 ……… a1 a6 a3 a2 a7 Copy data from table (slides 5) Replace table with 2-d space, label x,y axis, Remove star, and color of tuple 1 Show index 1, show index 2 Show …., we show some of region which we will use in later illustration Make every leaf region different, corresponding to index nodes

11 We view compound index as discrete space
Price (k) Mij =(ai, bj) b1 0-250 600 b2 b3 M11 1 0-100 350 b6 b7 2 5 ……… M22 M32 M23 M33 250 2 5 1 3 4 100 M55 M75 M56 M66 M77 M67 M76 6 size 1500 3000 4000 4500 4 2 5 1 a1 0-3000 a2 a3 Copy data from table (slides 5) Replace table with 2-d space, label x,y axis, Remove star, and color of tuple 1 Show index 1, show index 2 Show …., we show some of region which we will use in later illustration Make every leaf region different, corresponding to index nodes 0-1500 a6 a7 ……… 5 1

12 We view compound index as discrete space
conceptually, combined space Price (k) Mij =(ai, bj) b1 0-250 600 b2 b3 M11 1 0-100 350 M22 M32 M23 M33 b6 b7 2 5 ……… 250 2 5 1 3 4 100 M55 M75 M56 M66 M77 M67 M76 6 size 1500 3000 4000 4500 4 2 5 1 a1 0-3000 a2 a3 put yellow box 0-1500 a6 a7 ……… 5 1

13 How to perform the search in the space?
What is the search mechanism? How to conceptually view the index space of D for search How to guide the search? How to use function G to focus the search color of dots

14 Challenge 1: What is the search mechanism?

15 We encode as A* because it’s optimal
What A* is: Finding the shortest path Why we choose: Completeness and optimality with proper heuristics Complete: guarantee to find shortest path Optimal: visit least number of nodes origin 3 5 1 highlight the whole path 5 2 1 7 9 6 destination

16 Encoding our problem into shortest path is challenging
K-constrained optimization Find a tuple with maximal score Shortest path Find a path with minimal distance How to encode: a tuple  a path? score of tuple distance of path? use italic font

17 Therefore, we encode K-constrained opt. as:
How to encode a tuple to a path? Adding a virtual target t* only reachable through tuples How to encode maximal tuple with minimal path? Quality of path depends solely on the tuple it passes by For tuple state t D(t, t*) = - G(t) For two states r, u D(r, u) = 0 M11 M22 M32 M23 M33 puple color not clear virtual node different color M55 M75 M56 M66 M77 M67 M76 4 2 5 1 - G(1) - G(4) t*

18 Challenge 2: How to guide the search?

19 We use function opt. to sketch the landscape of G
Function optimization measures quality of states Function optimization enables: 1. How to define heuristics? 2. How to configure space? 3. Where to start the search?

20 1. Define admissible heuristics: Measure tightest upper bound
To guarantee completeness A* requires admissible heuristics, ie, estimate optimistically To ensure admissible heuristics Function optimization gives tightest upper bound Analytical approaches Numeric analysis package H(region) = OPTMAX(G, region) ie, maximal value of G in the region

21 2. Configure descending space: disconnect uphills
To guarantee optimality A* requires descending heuristics To ensure descending heuristics Remove uphill links M11 M22 M32 M23 M33 By knowing function, I know the relateive quality, thus can go down hill in our initial setting, all leaf regions are interconnected, this is problematic Show two directions for M66 and M77, Show big x on removed links Only shows M66, M77, M55 M75 M56 M66 M77 M67 M76 4 2 5 1

22 Find right start point: Start from local optima
To guarantee correctness Every tuple state must be reachable from start states Taking only downhills requires start with high points To ensure reachability Initial states should contain all local optima M11 M22 M32 M23 M33 As I only go down hill,I will start with highest point M55 M75 M56 M66 M77 M67 M76 4 2 5 1

23 Putting together: Executing A* on the configured space
top-down M11 M22 M32 M23 M33 M55 M75 M56 M66 M77 M67 M76 M57 4 2 5 1 Implemented as priority queue driven traversal Comparison of the two spaces Implemented in bottom up Search is implemented as priority queue driven traversal

24 Putting together: Executing A* on the configured space
top-down M11 M22 M32 M23 M33 M55 M75 M56 M66 M77 M67 M76 M57 4 2 5 1 bottom-up M11 M22 M32 M23 M33 move tuple nodes M55 M75 M56 M66 M77 M67 M76 M57 2 5 1 4 Bottom-up approach is always better than top-down

25 Experiments Comparison vs. Metrics: node accessed = Nl + Nt Settings:
Boolean then ranking Ranking then boolean Metrics: node accessed = Nl + Nt Settings: Benchmark queries over real dataset Controlled queries over synthetic dataset

26 Benchmark queries Datasets: Queries
19,706 real estate listing crawled online Queries Q1: size * bedrms/| price-450k| : [40k<=price<=50k] Q2: size * ebedrms / |price-350k| : [price<400k^size>4000] Q3: size/price : [bedrms=3 ν bedrms=4] BR_clustered BR_unclustered OPT* Q1 Q2 Q3

27 Controlled queries Datasets Queries
Three randomly generated datasets of 100k points Uniform, gaussian, logvariatenormal Queries Linear average queries: (eg, 0.4*a + 0.6*b) Nearest neighbor queries: (eg, (x-3)^2 + (y-4)^2) Join queries: (0.4*R.a + 0.6*S.b: R.c=R.d)

28 Conclusion Problem Abstraction Framework
Study K-constrained optimization queries as boolean + ranking Abstraction Encode K-constrained optimization into shortest path problem Framework Develop OPT* to process K-constrained optimization

29 Thank you! Questions?

30 How to implement function optimization?
How do we compare with RankSQL? If bottom-up is always better, why consider top-down Computing upper bound for each region is costly Random vs. sequential I/O Assuming indices on every attribute? Materialize state space for every query? Exponential number of states when attribute grows Not every attribute has index on it Selective choose the right index (attribute) to use We do perform experiment to study how the system scale with #attr Your algorithm is not optimal because you change the space


Download ppt "Boolean + Ranking: Querying a Database by K-Constrained Optimization"

Similar presentations


Ads by Google