Discovering Neighborhood Pattern Queries by Sample Answers in Knowledge Base Jialong Han1, Kai Zheng2, Aixin Sun1, Shuo Shang3, and Ji-Rong Wen4 1 Nanyang Technological University 2 The University of Queensland 3 China University of Petroleum (Beijing) 4 Renmin University of China 31/12/2018 ICDE 2016, Helsinki, Finland
Knowledge Bases and Structural Queries Knowledge bases: DBpedia, Freebase, YAGO, NELL, etc. Viewed as graphs Queried by structural queries SPARQL for RDF MQL for Freebase Cypher for neo4j Which chess player was born and died in the same place ? SELECT ?uri WHERE { ?uri :type :ChessPlayer . ?uri :birthPlace ?place . ?uri :deathPlace ?place } Complete Answers M. Botvinnik P. Morphy … 31/12/2018 ICDE 2016, Helsinki, Finland
Structural Query Discovery: Motivation It is always not easy for a user to write structural queries. She needs to follow the syntax; She needs to be familiar with types/relations used in the KB. Can we automatically find structural queries based on representative partial answers? Which chess player was born and died in the same place? SELECT ?uri WHERE { ?uri :type :ChessPlayer . ?uri :birthPlace ?place . ?uri :deathPlace ?place } Which chess player was born and died in the same place? Complete Answers M. Botvinnik P. Morphy … Representative Partial Answers M. Botvinnik ? 31/12/2018 ICDE 2016, Helsinki, Finland
Motivating Example We concentrate on Neighborhood Pattern Queries (NPQ). One “pivot”. Does not involve numeric ops, regular expressions, etc. Given example entities 𝐼 from the user, all NPQs can be classified into three kinds. Irrelevant: results does not cover 𝐼; Not relevant enough: results cover 𝐼 but does not rank them high; Relevant: 𝐼 is ranked high in the results. Popularity Order Query (a) Query (b) Query (c) B. Obama E. Lasker V. Putin M. Botvinnik P. Morphy G. Kasparov Rank: +∞ Rank: 4 Rank: 1 31/12/2018 ICDE 2016, Helsinki, Finland
Problem Statement and Solution Overview Reverse Top-k Neighborhood Pattern Queries (RkNPQ) Given a knowledge base 𝐷 and a popularity order ≺ on 𝑉 𝐷 , for input nodes 𝐼⊆𝑉 𝐷 , find all neighborhood pattern queries 𝑞 s.t. 𝐷 𝑞 ⊇𝐼, and when ranking 𝐷 𝑞 according to ≺, nodes in 𝐼 all appear in the top-k results. Solution: filter and refine. Filter: generate all NPQs satisfying 1; Refine: eliminate all generated NPQs violating 2. 31/12/2018 ICDE 2016, Helsinki, Finland
The Filtering Stage Perform level-wise search on the query space. Start with the simplest shapes of NPQs (single node or edge). Generate complicated ones through Extend and Join on simple ones. Completeness guaranteed by [Han, CIKM’13]. Terminate a branch if condition 1 is violated. 𝐼 = { M. Botvinnik } 31/12/2018 ICDE 2016, Helsinki, Finland
Trivial Refine Execute all NPQs generated by the filter stage, and test for condition 2. Use SPARQL or graph query engines like neo4j, gStore, and JENA-TDB. Drawbacks: unnecessary or redundant computations are not removed. We propose three optimizations on this stage. 𝐼 = { M. Botvinnik } 31/12/2018 ICDE 2016, Helsinki, Finland
Refine Optimization 1: Shared Evaluation Observation 1: 𝐷 𝑞 of different 𝑞 overlap with each other. For q 1 , q 2 , q 1 is a sub-query of q 2 , we have 𝐷(𝑞 1 )⊇𝐷( 𝑞 2 ). Maintain 𝐷 𝑞 by (intersecting and) verifying results of sub-queries. 𝐼 = { M. Botvinnik } 31/12/2018 ICDE 2016, Helsinki, Finland
Refine Optimization 2: Indicator Answers Observation 2: To verify 𝑞, 𝐷 𝑞 needs not be completely evaluated. Define indicator answers 𝐼𝐴 𝑞 = 𝑣 𝑣∈𝐷 𝑞 ∧𝑣≺ 𝑖𝑛𝑓 𝐼 ∧𝑣∉𝐼 . Only nodes in 𝐼𝐴 𝑞 affect the Top-k condition. 𝑞 meets the Top-k condition iff. 𝐼𝐴 𝑞 ≤𝑘−|𝐼|. Indicator answers are compatible with shared evaluation! For q 1 , q 2 , q 1 is a sub-query of q 2 , we have 𝐼𝐴(𝑞 1 )⊇𝐼𝐴( 𝑞 2 ). Popularity Order Query (b) Query (c) B. Obama V. Putin M. Botvinnik G. Kasparov P. Morphy E. Lasker Rank: 4 Rank: 1 31/12/2018 ICDE 2016, Helsinki, Finland
Refine Optimization 3: Partial Evaluation Observation 3: Even 𝐼𝐴 𝑞 needs not be completely obtained to reject 𝑞. Only a lower bound of 𝐼𝐴 𝑞 is needed. Instead of one list 𝐼𝐴 𝑞 , we keep two: nodes confirmed/uncertain to be in 𝐼𝐴 𝑞 . Reject 𝑞 immediately if the confirmed list is long enough (>𝑘−|𝐼|). The number of “match” checks can be reduced. Popularity Order Query (b) Query (c) B. Obama V. Putin M. Botvinnik G. Kasparov P. Morphy E. Lasker Rank: 4 Rank: 1 31/12/2018 ICDE 2016, Helsinki, Finland
Experimental Settings Datasets Knowledge base: DBpedia 3.9. Popularity ranking: PageRank score. Queries: 52 questions from 250 in the QALD-4-Task-1 dataset Allocated into 5 groups w.r.t. the shape (size, radius) of their ground truth query. Compared variants: RkNPQ-gStore: Trivial refine using gStore [Zou, PVLDB’11] RkNPQ-S: Shared evaluation RkNPQ-SI: Shared evaluation of Indicator answers RkNPQ-SPI: Shared and Partial evaluation of Indicator answers Methodology and Metrics: Use top-1/2 results to call our algorithms; Investigate the effectiveness (# returned queries) and efficiency (running time). 31/12/2018 ICDE 2016, Helsinki, Finland
Effectiveness Classify questions into Easy/Moderate/Hard w.r.t. # returned queries. Simpler question groups have more Easy/Moderate questions. More example answers cause many questions to turn Easy/Moderate. The inherent ambiguity of the input is reduced. Two examples are generally enough for a browsable output. 31/12/2018 ICDE 2016, Helsinki, Finland
Efficiency Compare adjacent pairs of the four variants. When 𝐼 =𝑘=1, the three optimizations speed up RkNPQ by one to two orders of magnitudes, respectively. When 𝐼 =𝑘=2, 𝐼𝐴 𝑞 →𝐷(𝑞), making indicator answers less beneficial. 31/12/2018 ICDE 2016, Helsinki, Finland
Analysis of Parameter k What happens if we fix 𝐼 =1 and increase k? # Returned queries? Running time (RkNPQ-SPI)? More queries are returned. As k increases, the number tends to converge. Larger k prevents unsuccessful search, but hardens the browse of returned queries. RkNPQ-SPI is almost always faster than RkNPQ-SI. The running time increases slowly. 31/12/2018 ICDE 2016, Helsinki, Finland
Related Work Reverse Engineering Structured Queries SQL queries: [Tran, SIGMOD’09] and [Zhang, SIGMOD’13] Interactive setting: [Bonifati, EDBT’14], [Starworko, ICDT’12], and [Bonifati, EDBT’15] Reverse Query Problems for Vector Data Reverse top-k queries: [Vlachou, ICDE’10] Reverse KNN queries: [Korn, SIGMOD’00] Reverse skyline queries: [Dellis, PVLDB’07] Query by Example Entities & Tuples [Jayaram, TKDE’15], [Lim, EDBT’13], and [Mottin, PVLDB’14] Natural Language QA over Knowledge Bases [Unger, WWW’12], [Yahya, EMNLP’12], [Berant, EMNLP’13], and [Zou, SIGMOD’14]. 31/12/2018 ICDE 2016, Helsinki, Finland
Conclusions We propose Reverse top-k Neighborhood Pattern Queries to help users issue knowledge base queries using representative partial answers. The search space is explored under a filter-refine framework. Three optimizations on the refine stage are investigated. Shared evaluation, indicator answers, and partial evaluation. (When given enough examples) the RkNPQ-SPI algorithm can generate a small number of possible queries for the user within reasonable time. 31/12/2018 ICDE 2016, Helsinki, Finland
Thank you! Questions? 31/12/2018 ICDE 2016, Helsinki, Finland