Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,

Slides:



Advertisements
Similar presentations
Efficient Top-k Search across Heterogeneous XML Data Sources Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Rui Zhou 1 1 Swinburne University of Technology.
Advertisements

Computing Structural Similarity of Source XML Schemas against Domain XML Schema Jianxin Li 1 Chengfei Liu 1 Jeffrey Xu Yu 2 Jixue Liu 3 Guoren Wang 4 Chi.
A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Effective Keyword Search in Relational Databases Fang Liu (University of Illinois at Chicago) Clement Yu (University of Illinois at Chicago) Weiyi Meng.
SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.
XSEarch XML Search Engine Jonathan MAMOU October 2002.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Information Retrieval in Practice
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
1 Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
XSEarch: A Semantic Search Engine for XML Sara Cohen Jonathan Mamou Yaron Kanza Yehoshua Sagiv Presented at VLDB 2003, Germany.
Presented by Zeehasham Rasheed
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
Identifying Meaningful Return Information for XML Keyword Search Yi Chen Ziyang Liu, Yi Chen Arizona State University.
Effective XML Keyword Search with Relevance Oriented Ranking Paper by: Zhifeng Bao, Tok Wang Ling, Bo Chen, Jiaheng Lu Presented by: Ilanit Goldshtein.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Click to edit Present’s Name Xiaoyang Zhang 1, Jianbin Qin 1, Wei Wang 1, Yifang Sun 1, Jiaheng Lu 2 HmSearch: An Efficient Hamming Distance Query Processing.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Chapter 6: Information Retrieval and Web Search
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.
Information Retrieval in Practice
Search Engine Architecture
TT-Join: Efficient Set Containment Join
Spatial Online Sampling and Aggregation
Toshiyuki Shimizu (Kyoto University)
Structure and Content Scoring for XML
MCN: A New Semantics Towards Effective XML Keyword Search
Panagiotis G. Ipeirotis Luis Gravano
Structure and Content Scoring for XML
Relax and Adapt: Computing Top-k Matches to XPath Queries
Introduction to XML IR XML Group.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology, Australia) Wei Wang University of New South Wales, Australia

Outline Motivation Scoring Result Types Query Processing Algorithms Experimental Study Conclusions 2

Motivation Keyword query is easy to use for casual users No need to know a query language or schema of the data Keyword query is inherently imprecise. How to find relevant results? Browse all relevant results – Impossible or Unusable! Restrict the results 1. XSEarch, SLCA/ELCA, and their variants 2. Return result instances from the most likely query result type (XReal and our work) 3 If it has a syntax, it isn't user friendly -- /usr/game/fortune

Query Result Types 4 Query = {1980, art} Result type: /root/students/student /root/books/book A label path such that at least one of its corresponding instances contains all the search keywords Intuition: users want to fetch instances of certain entity type with keyword predicates

Ranking 5 Query = {1980, art} Result type: /root/students/student /root/books/book Score each result type, and select the most promising return type (i.e., the one with the highest score) Subtleties: 1 return type  n query templates  n*m result instances

Scoring Individual Results /1 score(result type) = aggregate( score(instance1), score(instance2), …) Need to score individual result instance R 1. Not all matches are equal in terms of content Inverse element frequency (ief(x)) = N / # nodes containing the token x E.g., Weight(n i contains a) = log(ief(a)) 6 ab c

Scoring Individual Results /2 score(result type) = aggregate( score(instance1), score(instance2), …) Need to score individual result instance R 2. Not all matches are equal in terms of structure distance between the match and the root of the subtree also considers avg-depth of the XML tree to attenuate the impact of long paths 7 a b c dist=3

Scoring Individual Results /3 score(result type) = aggregate( score(instance1), score(instance2), …) Need to score individual result instance R 3. Favor tightly-coupled results When calculating dist(), discount the shared path segments 8 Loosely coupledTightly coupled

Scoring Individual Results score(result type) = aggregate( score(instance1), score(instance2), …) Need to score individual result instance R The final formula 9

Scoring Return Types score(result type) = aggregate( score(instance1), score(instance2), …) aggregate: sum up the top-k instance scores k = average instance numbers of all query result types Pad 0.0 if necessary 10

Query Processing Algorithms INL (Inverted Node List-based Algorithm) Merge all relevant nodes and group the merged results by different result types Using inverted index + Dewey encoding Calculate the score for each result type by using ranking function Only needs to keep the top-k best scores for each result type Slow because of no pruning or skipping 11

SDI Algorithm How to be more efficient? Approximately compute the score of each return type Prune some of the less likely return types SDI (Statistic Distribution Information-based Algorithm) Based on several additional indexes: Keyword-path index, Enhanced F&B index (with distributional info). Generate query templates by merging distinct paths Estimate the scores of each query templates Aggregate the scores for each result type 12

Keyword-Path Index & Query Templates Maps each keyword to the set of label paths that characterizes all its occurrences Merge the label paths to obtain query templates root/students/student[born ~ 1980][interest ~ art] root/books/book/title root/books/book[year ~ 1980][title ~ art] Iteratively ascend to its parent label path if a query template has no estimated result { root/students/student/born, root/books/book/year, root/books/book/title } art{ root/students/student/interest, root/books/book/title }

Enhanced F&B Index 14 More refined XSketch synopsis Estimate size of certain simple queries, e.g., size(root/books/book[title]) size(root/books/book[name]) Hardly handles correlation size(root/books/book[title][name])

Structural Distribution 15 size(root/books/book[title][name]) = 0

Value Distribution 16

Estimation /root/books/book  6 /root/books/book/year  1 /root/books/book/title  1 h book[year][title] = 4/6 f 1980|year = 3/5 f art|title = 4/5 Final estimation = title root books book year 1980 art

Recap SDI Retrieve the relevant label paths by the keyword-path index Generate query templates by merging distinct paths Estimate the scores of each query templates Aggregate the scores for each result type 18

Experiment Setup /1 Three real datasets used: NASA: astronomical data UWM: course data derived from university websites. DBLP: computer science journals and proceedings 19

Experiment Setup /2 18 Keyword queries: 20

Quality of Suggestions 21 NASA: XReal only focuses on one node at the higher level while INL and SDI can reach to the more detailed nodes. UWM: For Q12 and Q32, INL and SDI can predict more meaningful results than XReal does. For Q42, XReal can do “Better”. DBLP: All methods produce the same results because the structure of DBLP is so flat.

Efficiency 22 SDI’s speedup against XReal: 3x ~ 10x Speedup is even more significant on other two datasets

Conclusions Alleviates the inherent imprecision of keyword queries by scoring their result types Can only return instances from the most promising one Or take such score into consideration in the final ranking function Efficient estimation-based method to find most promising return types Experimental results demonstrates both the effectiveness and efficiency of the proposed approach 23

Q & A Our Keyword Search Project Homepage:

Related Work [Liu & Chen, SIGMOD07] Classifies XML nodes into one of three node types However, it only identifies a specific return node type for each result. XReal [Bao et al, ICDE09] Summarizing the statistic information between element nodes and all tokens in the leaf nodes IR style method is used to infer the result type based on the statistics. However, it does not model the correlation among the XML elements and values. 25