Toward Scalable Keyword Search over Relational Data Akanksha Baid, Ian Rae, Jiexing Li, AnHai Doan, and Jeffrey Naughton University of Wisconsin VLDB 2010.

Slides:



Advertisements
Similar presentations
Algorithm Design Techniques
Advertisements

Foundations of Relational Implementation (2) IS 240 – Database Management Lecture #14 – Prof. M. E. Kabay, PhD, CISSP Norwich University
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
0 - 0.
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
Information Systems Today: Managing in the Digital World
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Chapter 18 Methodology – Monitoring and Tuning the Operational System Transparencies © Pearson Education Limited 1995, 2005.
Database Performance Tuning and Query Optimization
© Abdou Illia MIS Spring 2014
© 2007 by Prentice Hall (Hoffer, Prescott & McFadden) 1 Joins and Sub-queries in SQL.
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
State of Connecticut Core-CT Project Query 8 hrs Updated 6/06/2006.
New-Product Development and Product Life-Cycle Strategies
1 Weiren Yu 1,2, Xuemin Lin 1, Wenjie Zhang 1 1 University of New South Wales 2 NICTA, Australia Towards Efficient SimRank Computation over Large Networks.
User Query Control An Enhancement For AS/400 Query On The IBM iSeries from  Copyright I/O International, 2005 Skip Intro.
Executional Architecture
Problems and Their Classes
Indexing DNA Sequences Using q-Grams
Module 12 WSP quality assurance tool 1. Module 12 WSP quality assurance tool Session structure Introduction About the tool Using the tool Supporting materials.
Chapter 15 A Table with a View: Database Queries.
Choosing an Order for Joins
Bounded Conjunctive Queries Yang Cao 1,2, Wenfei Fan 1,2, Tianyu Wo 2, Wenyuan Yu 3 1 University of Edinburgh, 2 Beihang University, 3 Facebook Inc.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Amit Shvarchenberg and Rafi Sayag. Based on a paper by: Robin Dhamankar, Yoonkyong Lee, AnHai Doan Department of Computer Science University of Illinois,
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Combining Keyword Search and Forms for Ad Hoc Querying of Databases (Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, Jeffrey Naughton) Computer Sciences.
Combining Keyword Search and Forms for Ad Hoc Querying of Databases Eric Chu, Akanksha Baid, Xiaoyong Chai, AnHai Doan, Jeffrey Naughton University of.
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
Page 1 ISMT E-120 Desktop Applications for Managers Introduction to Microsoft Access.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
Chapter 1 Overview of Database Concepts Oracle 10g: SQL
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Lecture 2 An Overview of Relational Database IST 318 – DB Admin.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Querying Structured Text in an XML Database By Xuemei Luo.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Query Segmentation Using Conditional Random Fields Xiaohui and Huxia Shi York University KEYS’09 (SIGMOD Workshop) Presented by Jaehui Park,
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
© IBM Corporation 2005 Informix User Forum 2005 John F. Miller III Explaining SQLEXPLAIN ®
Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
CIS 250 Advanced Computer Applications Database Management Systems.
DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
20 Copyright © 2008, Oracle. All rights reserved. Cache Management.
Ten Thousand SQLs Kalmesh Nyamagoudar 2010MCS3494.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Database Management System
CS222P: Principles of Data Management Notes #11 Selection, Projection
Selected Topics: External Sorting, Join Algorithms, …
MCN: A New Semantics Towards Effective XML Keyword Search
CS222: Principles of Data Management Notes #11 Selection, Projection
Evaluation of Relational Operations: Other Techniques
A Framework for Testing Query Transformation Rules
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #10 Selection, Projection Instructor: Chen Li.
Presentation transcript:

Toward Scalable Keyword Search over Relational Data Akanksha Baid, Ian Rae, Jiexing Li, AnHai Doan, and Jeffrey Naughton University of Wisconsin VLDB Original Slides by Jaehui Park Modified by Bao Huy Ung

Copyright 2011 by CEBT Introduction The success of search engines Significant attention on Keyword Search over Relational Databases Robustness, Accuracy, Reliability, and Privacy Performance related issues – Unpredictable running times – KWS solutions often require the solution of sub-problems that are NP- complete. But, users want answers under an absolute time limit. Basic idea Combining KWS systems with forms – Producing answers that can be generated quickly as in today's KWS systems – Showing users query forms that characterize the unexplored portion of the answer space 2

Copyright 2011 by CEBT Problems in current KWS solutions Running time increases as the number of joins is increased. Core problem – Dealing with the problem of searching a graph to find all sub-graphs that satisfy certain properties. (Steiner tree problem: NP-hard) Existing Solutions Bounding maximum number of joins Considering top-k answers 3

Copyright 2011 by CEBT KWS-F: Keyword Search using Forms How can we enable naïve users to pose complex SQL queries Generate a set of SQL queries that are most likely to be asked by naïve users Generate a set of query forms that encode those SQL queries Two indexes DataIndex – keyword -> schema-term – generating a set of form queries FormIndex – returning form-ids Limitations A tedious process – The user must examine the forms that KWS-F returns, select promising ones, fill them out, submit them, examine the results, and potentially repeat it. 4

Copyright 2011 by CEBT Combining KWS and KWS-F To achieve predictable performance and good coverage Time limit – when the time limit has been reached, a result must be returned missing answers the system give the user an idea about what the unexplored portion of the answer space "look like" 5

Copyright 2011 by CEBT Combining KWS and KWS-F When ranking cannot help Cases where a good ranking function does not exist When many result tuples having the same score are returned Forms offer a "guidance effect" Offering a good transition to go from an unstructured keyword query to the results of a structured query – partially structured query 6

Copyright 2011 by CEBT The Hybrid Approach Phase 0 generating a large set of forms offline Phase 1 given the user query Q and the time limit T, the system KWS' attempts to only generate those CNs and executes those SQL queries that can be completed within time T Phase 2 send Q together with a status report on its execution obtaining a ranked list of forms Phase 3 examines the status report remove forms have been covered by the KWS' 7

Copyright 2011 by CEBT KWS-F: Form generation (Phase 0) Generating forms based only on the primary key-foreign key relationships in the underlying schema graph Many duplicates in generating forms – ex) duplicates in forms (219 unique join sequences) – Duplicate form elimination Exploiting the property that joins are both associative and commutative 8

Copyright 2011 by CEBT KWS: CN generation and SQL Execution (Phase 1) Modified KWS system Options – Waiting for CN generation to terminate before starting the SQL query execution – Dividing the time budget T into two parts Failure to estimate this division accurately – Interleaving CN generation and SQL query execution – (Algorithm 2)Producer-consumer fashion using two separate threads While CN generation continues, SQL queries are executed based on where they occur in the priority queue Order by query execution cost (disk page fetches) 9

Copyright 2011 by CEBT KWS-F: Search (Phase 2) Once KWS' has terminated, the results are displayed to users The query and the execution status of KWS's are passed to the KWS-F Form search step – Full-text index Form Index over the forms was generated in Phase 0 – Returning forms that is multiple conjunctive schema term queries 10

Copyright 2011 by CEBT KWS-F: Minimizing overlap (Phase 3) Why overlap minimization is important Redundant forms and limited screen size Two conditions: Discard a form f only if All CNs that map to the form f have been generated in Phase 1. The list of CN templates populated in Phase 1 is used to verify this condition. All SQL queries corresponding to the form f have been executed in Phase 1. The list of unexecuted queries is used to verify this condition. 11

Copyright 2011 by CEBT Experimental Evaluation Data sets DBLP – 680MB (in XML), 1340MB (in RDB) DBLife – 40MB ( tuples in 14 tables) The new algorithm generates 30%-40% fewer forms than the baseline algorithm 12

Copyright 2011 by CEBT Experimental Evaluation Time out based strategy Time limit = 15s The number of SQL queries executed by each approach The producer-consumer based approach performs better 13

Copyright 2011 by CEBT Experimental Evaluation The distribution of the SQL queries generated in response to the keyword query "dewitt widom" Ordering the queries by estimated execution cost does yield some benefit The overlap minimization algorithm The number of forms and the number of SQL queries eliminated for each query 14

Copyright 2011 by CEBT Experimental Evaluation The hybrid approach apply to recent systems BANKS, BLINKS, EASE 15

Copyright 2011 by CEBT Questions What is a reasonable time restraint for users? What types of forms are used in KWSF systems? Designed with novice users in mind, what kind of real world applications? 16