13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia
13/04/ Outline Demo & Introduction Ranking Query Evaluation Conclusions
13/04/ Demo
13/04/ Demo …
13/04/ SPARK I Searching, Probing & Ranking Top-k Results Thesis project (2004 – 2005) with Nino Svonja Taste of Research Summary Scholarship (2005) Finally, CISRA prize winner ering.php ering.php
13/04/ SPARK II Continued as a research project with PhD student Yi Luo 2005 – 2006 SIGMOD 2007 paper Still under active development
13/04/ A Motivating Example
13/04/ A Motivating Example … Top-3 results in our system 1Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller (#2.1) 2Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller (#2.1) ActorPlay: Character = Himself Actors: Hanks, Tom 3Actors: John Hanks ActorPlay: Character = Alexander Kerst Movies: Rosamunde Pilcher - Winduber dem Fluss (2001)
13/04/ Improving the Effectiveness Three factors are considered to contribute to the final score of a search result (joined tuple tree) (modified) IR ranking score. the completeness factor. the size normalization factor.
13/04/ Preliminaries Data Model Relation-based Query Model Joined tuple trees (JTTs) Sophisticated ranking address one flaw in previous approaches unify AND and OR semantics alternative size normalization
13/04/ Problems with DISCOVER2 score(c i )score(p j )score c1 p c2 p signatureSPARK (1, 1)0.98 (0, 2)0.44
13/04/ Virtual Document Combine tf contributions before tf normalization / attenuation. c i p j score(maxtor)score(netvista)score a * c1 p c2 p
13/04/ Virtual Document Collection Collection: 3 results idf netvista = ln(4/3) idf maxtor = ln(4/2) Estimate idf: idf netvista = idf maxtor = Estimate avdl = avdl C + avdl P score a c1 p10.98 c2 p20.44
13/04/ Completeness Factor For “short queries” User prefer results matching more keywords Derive completeness factor based on extended Boolean model Measure L p distance to the ideal position netvista maxtor (1,1) Ideal Pos (c1 p1) (c2 p2) d = 1 d = 0.5 L 2 distance score b c1 p1( )/1.41 = 0.65 c2 p2(1.41-1)/1.41 = 0.29 d = 1.41
13/04/ Size Normalization Results in large CNs tend to have more matches to the keywords Score c = (1+s 1 -s 1 *|CN|) * (1+s 2 -s 2 *|CN nf |) Empirically, s 1 = 0.15, s 2 = 1 / (|Q| + 1) works well
13/04/ Putting ‘ em Together score(JTT) = score a * score b * score c a : IR-score of the virtual document b : completeness factor c : size normalization factor score a * score b c1 p10.98 * 0.65 = 0.64 c2 p20.44 * 0.29 = 0.13
13/04/ Comparing Top-1 Results DBLP; Query = “nikos clique”
13/04/ #Rel and R-Rank Results DBLP; 18 queries; Union of top-20 results Mondial; 35 queries; Union of top-20 results DISCOVER2[Liu et al, SIGMOD06] p = 1.0p = 1.4p = 2.0 #Rel R-Rank DISCOVER2[Liu et al, SIGMOD06] p = 1.0p = 1.4p = 2.0 #Rel R-Rank
13/04/ Query Processing 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes)
13/04/ Query Processing … 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes) Enumerate all possible Candidate Networks (CN)
13/04/ Query Processing … 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes) Enumerate all possible Candidate Networks (CN) Execute the CNs Most algorithms differ here. The key is how to optimize for top-k retrieval
13/04/ Monotonic Scoring Function Execute a CN CN: P Q C Q C P C2C2 C1C1 P2P2 P1P1 DISCOVER2 Assume: idf netvista > idf maxtor and k = 1 score(c i )score(p j )score c1 p c2 p c1 p1 c2 p2 < c1 p1 c2 p2 <
13/04/ Non-Monotonic Scoring Function Execute a CN CN: P Q C Q C P C2C2 C1C1 P2P2 P1P1 SPARK Assume: idf netvista > idf maxtor and k = 1 score(c i )score(p j )score a c1 p c2 p c1 p1 c2 p2 < c1 p1 c2 p2 < ? ? 1)Re-establish the early stopping criterion 2)Check candidates in an optimal order
13/04/ Upper Bounding Function Idea: use a monotonic & tight, upper bounding function to SPARK’s non-monotonic scoring function Details sumidf = w idf w watf(t) = (1/sumidf) * w (tf w (t) * idf w ) A = sumidf * (1 + ln(1 + ln( t watf(t) ))) B = sumidf * t watf(t) then, score a uscore a = (1/(1-s)) * min(A, B) score b score c are constants given the CN score uscore monotonic wrt. watf(t)
13/04/ Early Stopping Criterion Execute a CN CN: P Q C Q C P C2C2 C1C1 P2P2 P1P1 SPARK Assume: idf netvista > idf maxtor and k = 1 uscorescore a c1 p c2 p )Re-establish the early stopping criterion 2)Check candidates in an optimal order score( ) uscore( ) stop!
13/04/ Query Processing … Execute the CNs CN: P Q C Q C P C1C1 C2C2 C3C3 P1P1 P2P2 P3P3 [P 1,P 1 ] [C 1,C 1 ] C.get_next() [P 1,P 1 ] C 2 P.get_next() P 2 [C 1,C 2 ] P.get_next() P 3 [C 1,C 2 ] … [VLDB 03] Operations: {P 1, P 2, …} and {C1, C2, …} have been sorted based on their IR relevance scores. Score(Pi Cj) = Score(Pi) + Score(Cj) // a parametric SQL query is sent to the dbms
13/04/ Skyline Sweeping Algorithm Execute the CNs CN: P Q C Q C P C1C1 C2C2 C3C3 P1P1 P2P2 P3P3 P 1 C 1 P 2 C 1 P 3 C 1 Skyline Sweep,,,,,, … Dominance uscore( ) > uscore( ) and uscore( ) > uscore( ) Priority Queue: Operations: 1)Re-establish the early stopping criterion 2)Check candidates in an optimal order sort of
13/04/ Block Pipeline Algorithm Inherent deficiency to bound non-monotonic function with (a few) monotonic upper bounding functions draw an example Lots of candidates with high uscores return much lower (real) score unnecessary (expensive) checking cannot stop earlier Idea Partition the space (into blocks) and derive tighter upper bounds for each partitions “unwilling” to check a candidate until we are quite sure about its “prospect” (bscore)
13/04/ Block Pipeline Algorithm … Execute a CN CN: P Q C Q C P Block Pipeline Assume: idf n > idf m and k = 1 Blockuscorebscorescore a )Re-establish the early stopping criterion 2)Check candidates in an optimal order (n:1, m:0)(n:0, m:1) (n:1, m:0) (n:0, m:1) stop!
13/04/ Efficiency DBLP ~ 0.9M tuples in total k = 10 PC 1.8G, 512M
13/04/ Efficiency … DBLP, DQ13
13/04/ Conclusions A system that can perform effective & efficient keyword search on relational databases Meaningful query results with appropriate rankings second-level response time for ~10M tuple DB (imdb data) on a commodity PC
13/04/ Q&A Thank you.