13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.

Slides:

Advertisements

Similar presentations

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.

Advertisements

Examples of Physical Query Plan Alternatives

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.

Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology.

Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.

Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,

Spatio-temporal Databases

Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.

Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.

RankSQL: Supporting Ranking Queries in RDBMS Chengkai Li (UIUC) Mohamed A. Soliman (Univ. of Waterloo) Kevin Chen-Chuan Chang (UIUC) Ihab F. Ilyas (Univ.

1 Chapter 5 : Query Processing and Optimization Group 4: Nipun Garg, Surabhi Mithal

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,

School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.

Effective Keyword Search in Relational Databases Fang Liu (University of Illinois at Chicago) Clement Yu (University of Illinois at Chicago) Weiyi Meng.

SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,

Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.

Hinrich Schütze and Christina Lioma

Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.

CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.

Chapter 5: Information Retrieval and Web Search

Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.

Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.

MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.

Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.

1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.

Querying Structured Text in an XML Database By Xuemei Luo.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.

Chapter 6: Information Retrieval and Web Search

Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Date : 2012/10/25 Author : Yosi Mass, Yehoshua Sagiv Source : WSDM’12 Speaker : Er-Gang Liu Advisor : Dr. Jia-ling Koh 1.

Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.

ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, Keyword Search on Relational Data Streams Alexander Markowetz Yin.

Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.

Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.

Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.

Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.

Supporting Ranking and Clustering as Generalized Order-By and Group-By Chengkai Li (UIUC) joint work with Min Wang Lipyeow Lim Haixun Wang (IBM) Kevin.

Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.

Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB

Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.

HKU CSIS DB Seminar Skyline Queries HKU CSIS DB Seminar 9 April 2003 Speaker: Eric Lo.

1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.

Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.

Supporting Ranking and Clustering as Generalized Order-By and Group-By

Spatio-temporal Pattern Queries

Examples of Physical Query Plan Alternatives

Structure and Content Scoring for XML

Panagiotis G. Ipeirotis Luis Gravano

Overview of Query Evaluation

Structure and Content Scoring for XML

Efficient Processing of Top-k Spatial Preference Queries

CoXML: A Cooperative XML Query Answering System

Presentation transcript:

13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia

13/04/ Outline Demo & Introduction Ranking Query Evaluation Conclusions

13/04/ Demo

13/04/ Demo …

13/04/ SPARK I Searching, Probing & Ranking Top-k Results Thesis project (2004 – 2005) with Nino Svonja Taste of Research Summary Scholarship (2005) Finally, CISRA prize winner ering.php ering.php

13/04/ SPARK II Continued as a research project with PhD student Yi Luo 2005 – 2006 SIGMOD 2007 paper Still under active development

13/04/ A Motivating Example

13/04/ A Motivating Example … Top-3 results in our system 1Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller (#2.1) 2Movies: “Primetime Glick” (2001) Tom Hanks/Ben Stiller (#2.1)  ActorPlay: Character = Himself  Actors: Hanks, Tom 3Actors: John Hanks  ActorPlay: Character = Alexander Kerst  Movies: Rosamunde Pilcher - Winduber dem Fluss (2001)

13/04/ Improving the Effectiveness Three factors are considered to contribute to the final score of a search result (joined tuple tree) (modified) IR ranking score. the completeness factor. the size normalization factor.

13/04/ Preliminaries Data Model Relation-based Query Model Joined tuple trees (JTTs) Sophisticated ranking address one flaw in previous approaches unify AND and OR semantics alternative size normalization

13/04/ Problems with DISCOVER2 score(c i )score(p j )score c1  p c2  p signatureSPARK (1, 1)0.98 (0, 2)0.44

13/04/ Virtual Document Combine tf contributions before tf normalization / attenuation. c i  p j score(maxtor)score(netvista)score a * c1  p c2  p

13/04/ Virtual Document Collection Collection: 3 results idf netvista = ln(4/3) idf maxtor = ln(4/2) Estimate idf: idf netvista =  idf maxtor = Estimate avdl =  avdl C + avdl P score a c1  p10.98 c2  p20.44

13/04/ Completeness Factor For “short queries” User prefer results matching more keywords Derive completeness factor based on extended Boolean model Measure L p distance to the ideal position netvista maxtor (1,1) Ideal Pos (c1  p1) (c2  p2) d = 1 d = 0.5 L 2 distance score b c1  p1( )/1.41 = 0.65 c2  p2(1.41-1)/1.41 = 0.29 d = 1.41

13/04/ Size Normalization Results in large CNs tend to have more matches to the keywords Score c = (1+s 1 -s 1 *|CN|) * (1+s 2 -s 2 *|CN nf |) Empirically, s 1 = 0.15, s 2 = 1 / (|Q| + 1) works well

13/04/ Putting ‘ em Together score(JTT) = score a * score b * score c a : IR-score of the virtual document b : completeness factor c : size normalization factor score a * score b c1  p10.98 * 0.65 = 0.64 c2  p20.44 * 0.29 = 0.13

13/04/ Comparing Top-1 Results DBLP; Query = “nikos clique”

13/04/ #Rel and R-Rank Results DBLP; 18 queries; Union of top-20 results Mondial; 35 queries; Union of top-20 results DISCOVER2[Liu et al, SIGMOD06] p = 1.0p = 1.4p = 2.0 #Rel R-Rank   DISCOVER2[Liu et al, SIGMOD06] p = 1.0p = 1.4p = 2.0 #Rel R-Rank  

13/04/ Query Processing 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes)

13/04/ Query Processing … 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes) Enumerate all possible Candidate Networks (CN)

13/04/ Query Processing … 3 Steps Generate candidate tuples in every relation in the schema (using full-text indexes) Enumerate all possible Candidate Networks (CN) Execute the CNs Most algorithms differ here. The key is how to optimize for top-k retrieval

13/04/ Monotonic Scoring Function Execute a CN CN: P Q  C Q C P C2C2 C1C1 P2P2 P1P1 DISCOVER2 Assume: idf netvista > idf maxtor and k = 1 score(c i )score(p j )score c1  p c2  p c1  p1 c2  p2  < c1  p1 c2  p2 <

13/04/ Non-Monotonic Scoring Function Execute a CN CN: P Q  C Q C P C2C2 C1C1 P2P2 P1P1 SPARK Assume: idf netvista > idf maxtor and k = 1 score(c i )score(p j )score a c1  p c2  p c1  p1 c2  p2  < c1  p1 c2  p2 < ? ? 1)Re-establish the early stopping criterion 2)Check candidates in an optimal order

13/04/ Upper Bounding Function Idea: use a monotonic & tight, upper bounding function to SPARK’s non-monotonic scoring function Details sumidf =  w idf w watf(t) = (1/sumidf) *  w (tf w (t) * idf w ) A = sumidf * (1 + ln(1 + ln(  t watf(t) ))) B = sumidf *  t watf(t) then, score a  uscore a = (1/(1-s)) * min(A, B) score b score c are constants given the CN score  uscore monotonic wrt. watf(t)

13/04/ Early Stopping Criterion Execute a CN CN: P Q  C Q C P C2C2 C1C1 P2P2 P1P1 SPARK Assume: idf netvista > idf maxtor and k = 1 uscorescore a c1  p c2  p )Re-establish the early stopping criterion 2)Check candidates in an optimal order score( )  uscore( ) stop!  

13/04/ Query Processing … Execute the CNs CN: P Q  C Q C P C1C1 C2C2 C3C3 P1P1 P2P2 P3P3 [P 1,P 1 ]  [C 1,C 1 ] C.get_next() [P 1,P 1 ]  C 2 P.get_next() P 2  [C 1,C 2 ] P.get_next() P 3  [C 1,C 2 ] … [VLDB 03] Operations: {P 1, P 2, …} and {C1, C2, …} have been sorted based on their IR relevance scores. Score(Pi  Cj) = Score(Pi) + Score(Cj) // a parametric SQL query is sent to the dbms

13/04/ Skyline Sweeping Algorithm Execute the CNs CN: P Q  C Q C P C1C1 C2C2 C3C3 P1P1 P2P2 P3P3 P 1  C 1 P 2  C 1 P 3  C 1 Skyline Sweep,,,,,, … Dominance uscore( ) > uscore( ) and uscore( ) > uscore( ) Priority Queue: Operations: 1)Re-establish the early stopping criterion 2)Check candidates in an optimal order  sort of

13/04/ Block Pipeline Algorithm Inherent deficiency to bound non-monotonic function with (a few) monotonic upper bounding functions  draw an example Lots of candidates with high uscores return much lower (real) score unnecessary (expensive) checking cannot stop earlier Idea Partition the space (into blocks) and derive tighter upper bounds for each partitions “unwilling” to check a candidate until we are quite sure about its “prospect” (bscore)

13/04/ Block Pipeline Algorithm … Execute a CN CN: P Q  C Q C P Block Pipeline Assume: idf n > idf m and k = 1 Blockuscorebscorescore a )Re-establish the early stopping criterion 2)Check candidates in an optimal order  (n:1, m:0)(n:0, m:1) (n:1, m:0) (n:0, m:1)  stop!

13/04/ Efficiency DBLP ~ 0.9M tuples in total k = 10 PC 1.8G, 512M

13/04/ Efficiency … DBLP, DQ13

13/04/ Conclusions A system that can perform effective & efficient keyword search on relational databases Meaningful query results with appropriate rankings second-level response time for ~10M tuple DB (imdb data) on a commodity PC

13/04/ Q&A Thank you.