Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University 2009. 02. 12.

Slides:



Advertisements
Similar presentations
Toward Scalable Keyword Search over Relational Data Akanksha Baid, Ian Rae, Jiexing Li, AnHai Doan, and Jeffrey Naughton University of Wisconsin VLDB 2010.
Advertisements

Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over.
13/04/20151 SPARK: Top- k Keyword Query in Relational Database Wei Wang University of New South Wales Australia.
1 Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Effective Keyword Search in Relational Databases Fang Liu (University of Illinois at Chicago) Clement Yu (University of Illinois at Chicago) Weiyi Meng.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.
Information Retrieval in Practice
Search Engines and Information Retrieval
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Incorporating Language Modeling into the Inference Network Retrieval Framework Don Metzler.
Information Retrieval and Databases: Synergies and Syntheses IDM Workshop Panel 15 Sep 2003 Jayavel Shanmugasundaram Cornell University.
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Keyword Proximity Search on XML Graphs Vagelis Hristidis Yannis Papakonstatinou Andrey Presenter: Feng Shao.
Overview of Search Engines
Indexing XML Data Stored in a Relational Database VLDB`2004 Shankar Pal, Istvan Cseri, Gideon Schaller, Oliver Seeliger, Leo Giakoumakis, Vasili Vasili.
Bidirectional Expansion for Keyword Search on Graph Databases Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Search Engines and Information Retrieval Chapter 1.
Keyword Search on External Memory Data Graphs Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan PVLDB 2008 Reported by: Yiqi Lu.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal, Surajit Chaudhuri, Gautam Das Cathy Wang
Querying Structured Text in an XML Database By Xuemei Luo.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Database and Query Model ◦ Informal Model ◦ Formal Model ◦ Query and Answer Model 
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber CIDR 2007) Conference on Innovative Data Systems.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
GEORGIOS FAKAS Department of Computing and Mathematics, Manchester Metropolitan University Manchester, UK. Automated Generation of Object.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Survey Jaehui Park Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Keyword Searching and Browsing in Databases using BANKS Charuta Nakhe, Arvind Hulgeri, Gaurav Bhalotia, Soumen Chakrabarti, S. Sudarshan Presented by Sushanth.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Structured-Value Ranking in Update- Intensive Relational Databases Jayavel Shanmugasundaram Cornell University (Joint work with: Lin Guo, Kevin Beyer,
Information Retrieval in Practice
Search Engine Architecture
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Keyword Searching and Browsing in Databases using BANKS
Introduction to Information Retrieval
Keyword Searching and Browsing in Databases using BANKS
Topic: Semantic Text Mining
Introduction to XML IR XML Group.
Presentation transcript:

Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University

Copyright  2009 by CEBT Outline  Introduction  Bibliography  Fundamental Characteristics  Research Dimensions Summary  Future Direction 2

Copyright  2009 by CEBT DataQuerying Introduction  Querying structured data Relational databases – A repository for a significant amount of data (e.g. enterprise data) – RDBMS managing an abstract view of underlying data Structured Query Language (SQL) – Precise and complete – Difficult for casual users  Querying unstructured data (Web) documents – Collection of unstructured (natural language) documents available online – Search engine The most popular application for information discovery Keyword search – Simple and user-friendly – Approximating the precise results In statistical and semantic ways  Deep Web Information over the Web comes out of relational databases 3 Structured Unstructured Precise Easy Easy way of querying structured data

Copyright  2009 by CEBT Introduction  Enabling casual users to query relational databases with keywords “casual users” – Without any knowledge about the schema information – Without any knowledge of the query language (SQL)  Search system should have the knowledge in behalf of users  Challenges Inherent discrepancy of data between IR and DB – Information often splits across the tables (or tuples) in relational databases Ex) A single retrieval unit of information 4 Relational Databases Results SQLkeywords

Copyright  2009 by CEBT Bibliography  Proximity [Goldman et al., VLDB, 1998] Proximity Search in Databases  DataSpot [Palmon et al., VLDB, 1998] DTL's DataSpot - Database Exploration Using Plain Language [Palmon et al., SIGMOD, 1998] DTL's DataSpot- database exploration as easy as browsing the Web  DBXplorer [Agrawal et al., 2002, ICDE] DBXplorer: a system for keyword-based search over relational databases  BANKS [Hulgeri et al., 2001, DEBU] Keeyword Search in Databases [Hulgeri et al., 2002, ICDE] Keyword Searching and Browsing in Databases using BANKS [Kacholia et al., 2005, VLDB] Bidirectional Expansion For Keyword Search  DISCOVER [Hristidis et al., 2002, VLDB] DISCOVER: Keyword search in relational databases [Hristidis et al., 2003, VLDB] Efficient IR-Style Keyword Search over Relational Databases. [Liu et al., 2006 SIGMOD] Effective Keyword Search in Relational Databases  ObjectRank [Balmin and Hristidis et al., 2004, VLDB] ObjectRank: Authority-Based Keyword Search in Databases [Balmin and Hristidis et al., 2008, TODS] Authority-based search on databases 5

Copyright  2009 by CEBT Proximity  Proximity Measure of how related objects are Object related by a distance function – Shortest path computation K-neighborhood distance look-up table 6 ……… …… documentrelational database

Copyright  2009 by CEBT DataSpot  Hyperbase Modeling data graph Sub-hyperbase as an answer  Best-first searching 7 Customer ID… … Customers …Customer ID … Orders Record Field Field Name Field Value String Key Text “Customer” Text “ID” Stem Stem “customer” Thesaurus Stem “client” Relational Databases Hyperbase keywords query convert SQL query

Copyright  2009 by CEBT DBXplorer  Symbol table index for schema entities Locating objects efficiently – Granularity – Compaction  Schema graph Join tree enumeration – Joining several tables on the fly 8 Relational Databases term.location …… …… …… keywords query

Copyright  2009 by CEBT BANKS  Directed (data) graph Backward edge Graph traversing algorithm – NP-hard problem – Heuristics Backward Expanding search Bi-directional expanding search  Rich interface 9

Copyright  2009 by CEBT DISCOVER  High level representation of the architecture for keyword search in relational databases  Top-k join query processing Pipeline algorithm – Threshold [Fagin et al. 2001]  IR-style ranking function TF-IDF based tuple ranking 10

Copyright  2009 by CEBT ObjectRank  Authority Measure of how important objects are – Authority flow graph Modified Pagerank algorithm – (Global) ObjectRank algorithm – Inverse ObjectRank algorithm 11

Copyright  2009 by CEBT Fundamental Characteristics  Identifying schema elements To avoid linearly scanning all the tables Indexing structure – Inverted index  Processing queries Keyword query processing – Making the best of the lack of syntax in query keywords Formalizing internal queries – e.g. SQL  Modeling answers Logical unit of retrieval is not a document – e.g. Directed Acyclic Graph (DAG)  Ranking answers Assign a single score, which can reflect the semantics of underlying schema, for each answer Order the returned answers 12 RDB RDBMS Indexing Processing Model Ranking Search system k1 k2 k3k4

Copyright  2009 by CEBT Research Dimensions  Model  Processing  Indexing  Ranking 13  Data Representation  Query Representation  Efficient Processing  Top-k query processing  Indexing structure  Ranking  Presentation

Copyright  2009 by CEBT Data representation (1/4)  Graph model Data graph Schema graph 14 Writes AuthorID PaperID … Author AuthorID AuthorName … Paper PaperID PaperName … Cites Citing Cited … Writes J.H.Park0 8 Web Content Summarization Using … PaperIDPaperName JHPark J.H.Park0 8 AuthorIDPaperID SGLee S.G.Lee0 8 JHParkJaehui Park AuthorIDAuthorName SGLeeSang-goo Lee Paper Author

Copyright  2009 by CEBT Data representation (2/4)  Data graph Efficient graph traversing – Search time reducing Finding an optimal answer – NP-hard : Steiner tree problem Heuristics Size problem – Too huge to fit into main memory Maintenance problem – Not appropriate for update-intensive databases 15 RDB traversekeywords

Copyright  2009 by CEBT Data representation (3/4)  Schema graph Smaller Size – Scales well for huge database Utilize underlying RDBMS facilities – e.g. Database indexes on columns Exploiting the schema of the underlying database – Generating optimal internal queries : SQL – Evaluation for Queries 16 Query keywords : Jaehui Relational Database Candidate join queries: Tmp 1 : select * from Paper, Writes where Paper.PaperName = ‘Relational Database’ AND … Tmp 2 : select * from Tmp 1, Author where … Author.AuthorName = ‘Jaehui’ AND … RDB traverse Query keywords

Copyright  2009 by CEBT Data representation (4/4)  Graph model A logical unit of information – Subgraph A set of multiple nodes joined together may include some tuples that does not contain any query keywords Weighting scheme – Edges Distance (or Proximity)  Join operations – Nodes Importance (or Authority) 17 T1T1 T2T2 T3T3 T4T4 T5T5 K2K2 T6T6 K3K3 K3K3 K1K1 T1T1 T2T2 T3T3 T4T4 K2K2 K3K3 K1K1 T1T1 T2T2 T3T3 T5T5 K2K2 T6T6 K3K3 K1K1 T1T1 T3T3 K2K2 T6T6 K3K3 K1K1

Copyright  2009 by CEBT Ranking  Relevance Answer size – Minimal subgraph including all the query keywords – Distance as the semantics closeness between objects The distance between an entity and its attributes The distance between tuples in the same table The distance between tuples related through primary and foreign key Term frequency – Standard IR weighting method TF-IDF  Text databases (e.g. user complaints, product descriptions, book reviews, etc.)  Importance Authority – Authority transfer graph Nodes with incoming link with high authority are assumed to have higher importance – Specificity problem Specific results should be ranked higher than general one e.g., InverseObjectRank algorithm 18 Writes AuthorID PaperID … Author AuthorID AuthorName … Paper PaperID PaperName … Cites Citing Cited … Writes Jane Tom … Paper Tree Traverse algorithm … Query Evaluation … …

Copyright  2009 by CEBT Efficient processing  Indexing structure Reducing scan time – Granularity levels of schema elements Column level vs. Record (or Cell) level Reducing computation time – Precomputation edge weights, node weights, relevance scores, etc.  Query execution technique Top-k query processing – Avoiding creating all query results Decide which candidate answers will produce top-k results  e.g. Sparse algorithm Pipeline algorithm ROWID b1 b2 b3 Score …… ROWID a1 a2 a3 Score ……

Copyright  2009 by CEBT Query representation  Logical operators conjunction, disjunction  Type and condition Type – Find type, Near type Conditional keywords – e.g. Year >

Copyright  2009 by CEBT Presentation  Visualizing search result e.g. Tree view – structural level vs. tuple level  Limiting maximum size of an answer  Limiting maximum number of answer  … 21

Copyright  2009 by CEBT Summary  Comparison in a common framework 22 Data modelRankingEfficiencyQuery representation Presentation ProximityData-graphDistanceK-neighborhood distance look-up Type, Conjunction - DataSpotData-graphNumber of edges-ConjunctionTable DBXplorerSchema-graphNumber of joinsSymbol tableConjunctionEnumerated rows BANKSData-graph (directed) Edge weight, Node weight Disk resident index on keyword ConjunctionDynamic Joined Tree DISCOVERSchema-graphNumber of joinsMaster IndexConjunction, Disjunction - ObjectRankSchema-graph, Data-graph AuthorityMaster IndexConjunction, Disjunction -

Copyright  2009 by CEBT Future Directions  Probabilistic model Naïve approaches – Rank measures on the answer size Cannot directly estimate the (probability of) relevance between the query and the retrieved tuples Heuristic performs well Probabilistic model – e.g. Bayesian belief network Term-based approach to approximate optimal answer Modification for dealing with relational database  Dependencies between schema elements  Efficient query processing Top-k query processing have shown a great impact on performance – Ranking function involves aggregation or grouping operator – Symbol table design  Conclusion Various approaches are described with our understanding We envision the above research directions to be important to pursue. 23

Copyright  2009 by CEBT 24 Thank you