13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.

13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

13/04/2015 2 Outline Demo & Introduction Ranking Query Evaluation Conclusions

13/04/2015 3 Demo

13/04/2015 4 Demo …

13/04/2015 5 SPARK I Searching, Probing & Ranking Top-k Results Thesis project (2004 – 2005) with Nino Svonja Taste of Research Summary Scholarship (2005) Finally, CISRA prize winner http://www.computing.unsw.edu.au/softwareengine ering.php http://www.computing.unsw.edu.au/softwareengine ering.php

13/04/2015 6 SPARK II Continued as a research project with PhD student Yi Luo 2005 – 2006 SIGMOD 2007 paper Still under active development

13/04/2015 7 A Motivating Example

13/04/2015 8 A Motivating Example … Top-3 results in our system 1Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller (#2.1) 2Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller (#2.1)  ActorPlay: Character = Himself  Actors: Hanks, Tom 3Actors: John Hanks  ActorPlay: Character = Alexander Kerst  Movies: Rosamunde Pilcher - Winduber dem Fluss (2001)

13/04/2015 9 Improving the Effectiveness Three factors are considered to contribute to the final score of a search result (joined tuple tree) (modified) IR ranking score. the completeness factor. the size normalization factor.

13/04/2015 10 Preliminaries Data Model Relation-based Query Model Joined tuple trees (JTTs) Sophisticated ranking address one flaw in previous approaches unify AND and OR semantics alternative size normalization

13/04/2015 11 Problems with DISCOVER2 score(c i )score(p j )score c1  p11.0 2.0 c2  p21.0 2.0 signatureSPARK (1, 1)0.98 (0, 2)0.44

13/04/2015 12 Virtual Document Combine tf contributions before tf normalization / attenuation. c i  p j score(maxtor)score(netvista)score a * c1  p11.00 2.00 c2  p20.001.53

13/04/2015 13 Virtual Document Collection Collection: 3 results idf netvista = ln(4/3) idf maxtor = ln(4/2) Estimate idf: idf netvista =  idf maxtor = Estimate avdl =  avdl C + avdl P score a c1  p10.98 c2  p20.44

13/04/2015 14 Completeness Factor For “short queries” User prefer results matching more keywords Derive completeness factor based on extended Boolean model Measure L p distance to the ideal position netvista maxtor (1,1) Ideal Pos (c1  p1) (c2  p2) d = 1 d = 0.5 L 2 distance score b c1  p1(1.41-0.5)/1.41 = 0.65 c2  p2(1.41-1)/1.41 = 0.29 d = 1.41

13/04/2015 15 Size Normalization Results in large CNs tend to have more matches to the keywords Score c = (1+s 1 -s 1 *|CN|) * (1+s 2 -s 2 *|CN nf |) Empirically, s 1 = 0.15, s 2 = 1 / (|Q| + 1) works well

13/04/2015 16 Putting ‘ em Together score(JTT) = score a * score b * score c a : IR-score of the virtual document b : completeness factor c : size normalization factor score a * score b c1  p10.98 * 0.65 = 0.64 c2  p20.44 * 0.29 = 0.13

13/04/2015 17 Comparing Top-1 Results DBLP; Query = “nikos clique”

13/04/2015 18 #Rel and R-Rank Results DBLP; 18 queries; Union of top-20 results Mondial; 35 queries; Union of top-20 results DISCOVER2[Liu et al, SIGMOD06] p = 1.0p = 1.4p = 2.0 #Rel2216 18 R-Rank  0.243  0.333 0.9260.9351.000 DISCOVER2[Liu et al, SIGMOD06] p = 1.0p = 1.4p = 2.0 #Rel210272934 R-Rank  0.276  0.491 0.8810.9090.986

13/04/2015 19 Query Processing 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes)

13/04/2015 20 Query Processing … 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes) Enumerate all possible Candidate Networks (CN)

13/04/2015 21 Query Processing … 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes) Enumerate all possible Candidate Networks (CN) Execute the CNs Most algorithms differ here. The key is how to optimize for top-k retrieval

13/04/2015 22 Monotonic Scoring Function Execute a CN CN: P Q  C Q C P C2C2 C1C1 P2P2 P1P1 DISCOVER2 Assume: idf netvista > idf maxtor and k = 1 score(c i )score(p j )score c1  p11.060.972.03 c2  p21.06 2.12 c1  p1 c2  p2  < c1  p1 c2  p2 <

13/04/2015 23 Non-Monotonic Scoring Function Execute a CN CN: P Q  C Q C P C2C2 C1C1 P2P2 P1P1 SPARK Assume: idf netvista > idf maxtor and k = 1 score(c i )score(p j )score a c1  p11.060.970.98 c2  p21.06 0.44 c1  p1 c2  p2  < c1  p1 c2  p2 < ? ? 1)Re-establish the early stopping criterion 2)Check candidates in an optimal order

13/04/2015 24 Upper Bounding Function Idea: use a monotonic & tight, upper bounding function to SPARK’s non-monotonic scoring function Details sumidf =  w idf w watf(t) = (1/sumidf) *  w (tf w (t) * idf w ) A = sumidf * (1 + ln(1 + ln(  t watf(t) ))) B = sumidf *  t watf(t) then, score a  uscore a = (1/(1-s)) * min(A, B) score b score c are constants given the CN score  uscore monotonic wrt. watf(t)

13/04/2015 25 Early Stopping Criterion Execute a CN CN: P Q  C Q C P C2C2 C1C1 P2P2 P1P1 SPARK Assume: idf netvista > idf maxtor and k = 1 uscorescore a c1  p11.130.98 c2  p21.760.44 1)Re-establish the early stopping criterion 2)Check candidates in an optimal order score( )  uscore( ) stop!  

13/04/2015 26 Query Processing … Execute the CNs CN: P Q  C Q C P C1C1 C2C2 C3C3 P1P1 P2P2 P3P3 [P 1,P 1 ]  [C 1,C 1 ] C.get_next() [P 1,P 1 ]  C 2 P.get_next() P 2  [C 1,C 2 ] P.get_next() P 3  [C 1,C 2 ] … [VLDB 03] Operations: {P 1, P 2, …} and {C1, C2, …} have been sorted based on their IR relevance scores. Score(Pi  Cj) = Score(Pi) + Score(Cj) // a parametric SQL query is sent to the dbms

13/04/2015 27 Skyline Sweeping Algorithm Execute the CNs CN: P Q  C Q C P C1C1 C2C2 C3C3 P1P1 P2P2 P3P3 P 1  C 1 P 2  C 1 P 3  C 1 Skyline Sweep,,,,,, … Dominance uscore( ) > uscore( ) and uscore( ) > uscore( ) Priority Queue: Operations: 1)Re-establish the early stopping criterion 2)Check candidates in an optimal order  sort of

13/04/2015 28 Block Pipeline Algorithm Inherent deficiency to bound non-monotonic function with (a few) monotonic upper bounding functions  draw an example Lots of candidates with high uscores return much lower (real) score unnecessary (expensive) checking cannot stop earlier Idea Partition the space (into blocks) and derive tighter upper bounds for each partitions “unwilling” to check a candidate until we are quite sure about its “prospect” (bscore)

13/04/2015 29 Block Pipeline Algorithm … Execute a CN CN: P Q  C Q C P Block Pipeline Assume: idf n > idf m and k = 1 Blockuscorebscorescore a 2.741.05 2.63 2.500.95 1)Re-establish the early stopping criterion 2)Check candidates in an optimal order  (n:1, m:0)(n:0, m:1) (n:1, m:0) (n:0, m:1)  2.74 2.63 1.05 2.63 1.05 2.41 2.38 stop!

13/04/2015 30 Efficiency DBLP ~ 0.9M tuples in total k = 10 PC 1.8G, 512M

13/04/2015 31 Efficiency … DBLP, DQ13

13/04/2015 32 Conclusions A system that can perform effective & efficient keyword search on relational databases Meaningful query results with appropriate rankings second-level response time for ~10M tuple DB (imdb data) on a commodity PC

13/04/2015 33 Q&A Thank you.

13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.

Similar presentations

Presentation on theme: "13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.

Similar presentations

Presentation on theme: "13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia."— Presentation transcript:

Similar presentations

About project

Feedback