Presentation is loading. Please wait.

Presentation is loading. Please wait.

13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.

Similar presentations


Presentation on theme: "13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia."— Presentation transcript:

1 13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

2 13/04/2015 2 Outline Demo & Introduction Ranking Query Evaluation Conclusions

3 13/04/2015 3 Demo

4 13/04/2015 4 Demo …

5 13/04/2015 5 SPARK I Searching, Probing & Ranking Top-k Results Thesis project (2004 – 2005) with Nino Svonja Taste of Research Summary Scholarship (2005) Finally, CISRA prize winner http://www.computing.unsw.edu.au/softwareengine ering.php http://www.computing.unsw.edu.au/softwareengine ering.php

6 13/04/2015 6 SPARK II Continued as a research project with PhD student Yi Luo 2005 – 2006 SIGMOD 2007 paper Still under active development

7 13/04/2015 7 A Motivating Example

8 13/04/2015 8 A Motivating Example … Top-3 results in our system 1Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller (#2.1) 2Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller (#2.1)  ActorPlay: Character = Himself  Actors: Hanks, Tom 3Actors: John Hanks  ActorPlay: Character = Alexander Kerst  Movies: Rosamunde Pilcher - Winduber dem Fluss (2001)

9 13/04/2015 9 Improving the Effectiveness Three factors are considered to contribute to the final score of a search result (joined tuple tree) (modified) IR ranking score. the completeness factor. the size normalization factor.

10 13/04/2015 10 Preliminaries Data Model Relation-based Query Model Joined tuple trees (JTTs) Sophisticated ranking address one flaw in previous approaches unify AND and OR semantics alternative size normalization

11 13/04/2015 11 Problems with DISCOVER2 score(c i )score(p j )score c1  p11.0 2.0 c2  p21.0 2.0 signatureSPARK (1, 1)0.98 (0, 2)0.44

12 13/04/2015 12 Virtual Document Combine tf contributions before tf normalization / attenuation. c i  p j score(maxtor)score(netvista)score a * c1  p11.00 2.00 c2  p20.001.53

13 13/04/2015 13 Virtual Document Collection Collection: 3 results idf netvista = ln(4/3) idf maxtor = ln(4/2) Estimate idf: idf netvista =  idf maxtor = Estimate avdl =  avdl C + avdl P score a c1  p10.98 c2  p20.44

14 13/04/2015 14 Completeness Factor For “short queries” User prefer results matching more keywords Derive completeness factor based on extended Boolean model Measure L p distance to the ideal position netvista maxtor (1,1) Ideal Pos (c1  p1) (c2  p2) d = 1 d = 0.5 L 2 distance score b c1  p1(1.41-0.5)/1.41 = 0.65 c2  p2(1.41-1)/1.41 = 0.29 d = 1.41

15 13/04/2015 15 Size Normalization Results in large CNs tend to have more matches to the keywords Score c = (1+s 1 -s 1 *|CN|) * (1+s 2 -s 2 *|CN nf |) Empirically, s 1 = 0.15, s 2 = 1 / (|Q| + 1) works well

16 13/04/2015 16 Putting ‘ em Together score(JTT) = score a * score b * score c a : IR-score of the virtual document b : completeness factor c : size normalization factor score a * score b c1  p10.98 * 0.65 = 0.64 c2  p20.44 * 0.29 = 0.13

17 13/04/2015 17 Comparing Top-1 Results DBLP; Query = “nikos clique”

18 13/04/2015 18 #Rel and R-Rank Results DBLP; 18 queries; Union of top-20 results Mondial; 35 queries; Union of top-20 results DISCOVER2[Liu et al, SIGMOD06] p = 1.0p = 1.4p = 2.0 #Rel2216 18 R-Rank  0.243  0.333 0.9260.9351.000 DISCOVER2[Liu et al, SIGMOD06] p = 1.0p = 1.4p = 2.0 #Rel210272934 R-Rank  0.276  0.491 0.8810.9090.986

19 13/04/2015 19 Query Processing 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes)

20 13/04/2015 20 Query Processing … 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes) Enumerate all possible Candidate Networks (CN)

21 13/04/2015 21 Query Processing … 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes) Enumerate all possible Candidate Networks (CN) Execute the CNs Most algorithms differ here. The key is how to optimize for top-k retrieval

22 13/04/2015 22 Monotonic Scoring Function Execute a CN CN: P Q  C Q C P C2C2 C1C1 P2P2 P1P1 DISCOVER2 Assume: idf netvista > idf maxtor and k = 1 score(c i )score(p j )score c1  p11.060.972.03 c2  p21.06 2.12 c1  p1 c2  p2  < c1  p1 c2  p2 <

23 13/04/2015 23 Non-Monotonic Scoring Function Execute a CN CN: P Q  C Q C P C2C2 C1C1 P2P2 P1P1 SPARK Assume: idf netvista > idf maxtor and k = 1 score(c i )score(p j )score a c1  p11.060.970.98 c2  p21.06 0.44 c1  p1 c2  p2  < c1  p1 c2  p2 < ? ? 1)Re-establish the early stopping criterion 2)Check candidates in an optimal order

24 13/04/2015 24 Upper Bounding Function Idea: use a monotonic & tight, upper bounding function to SPARK’s non-monotonic scoring function Details sumidf =  w idf w watf(t) = (1/sumidf) *  w (tf w (t) * idf w ) A = sumidf * (1 + ln(1 + ln(  t watf(t) ))) B = sumidf *  t watf(t) then, score a  uscore a = (1/(1-s)) * min(A, B) score b score c are constants given the CN score  uscore monotonic wrt. watf(t)

25 13/04/2015 25 Early Stopping Criterion Execute a CN CN: P Q  C Q C P C2C2 C1C1 P2P2 P1P1 SPARK Assume: idf netvista > idf maxtor and k = 1 uscorescore a c1  p11.130.98 c2  p21.760.44 1)Re-establish the early stopping criterion 2)Check candidates in an optimal order score( )  uscore( ) stop!  

26 13/04/2015 26 Query Processing … Execute the CNs CN: P Q  C Q C P C1C1 C2C2 C3C3 P1P1 P2P2 P3P3 [P 1,P 1 ]  [C 1,C 1 ] C.get_next() [P 1,P 1 ]  C 2 P.get_next() P 2  [C 1,C 2 ] P.get_next() P 3  [C 1,C 2 ] … [VLDB 03] Operations: {P 1, P 2, …} and {C1, C2, …} have been sorted based on their IR relevance scores. Score(Pi  Cj) = Score(Pi) + Score(Cj) // a parametric SQL query is sent to the dbms

27 13/04/2015 27 Skyline Sweeping Algorithm Execute the CNs CN: P Q  C Q C P C1C1 C2C2 C3C3 P1P1 P2P2 P3P3 P 1  C 1 P 2  C 1 P 3  C 1 Skyline Sweep,,,,,, … Dominance uscore( ) > uscore( ) and uscore( ) > uscore( ) Priority Queue: Operations: 1)Re-establish the early stopping criterion 2)Check candidates in an optimal order  sort of

28 13/04/2015 28 Block Pipeline Algorithm Inherent deficiency to bound non-monotonic function with (a few) monotonic upper bounding functions  draw an example Lots of candidates with high uscores return much lower (real) score unnecessary (expensive) checking cannot stop earlier Idea Partition the space (into blocks) and derive tighter upper bounds for each partitions “unwilling” to check a candidate until we are quite sure about its “prospect” (bscore)

29 13/04/2015 29 Block Pipeline Algorithm … Execute a CN CN: P Q  C Q C P Block Pipeline Assume: idf n > idf m and k = 1 Blockuscorebscorescore a 2.741.05 2.63 2.500.95 1)Re-establish the early stopping criterion 2)Check candidates in an optimal order  (n:1, m:0)(n:0, m:1) (n:1, m:0) (n:0, m:1)  2.74 2.63 1.05 2.63 1.05 2.41 2.38 stop!

30 13/04/2015 30 Efficiency DBLP ~ 0.9M tuples in total k = 10 PC 1.8G, 512M

31 13/04/2015 31 Efficiency … DBLP, DQ13

32 13/04/2015 32 Conclusions A system that can perform effective & efficient keyword search on relational databases Meaningful query results with appropriate rankings second-level response time for ~10M tuple DB (imdb data) on a commodity PC

33 13/04/2015 33 Q&A Thank you.


Download ppt "13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia."

Similar presentations


Ads by Google