Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Google News Personalization: Scalable Online Collaborative Filtering
Traditional IR models Jian-Yun Nie.
Indexing DNA Sequences Using q-Grams
What is a Database By: Cristian Dubon.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
Chapter 5: Introduction to Information Retrieval
Text Databases Text Types
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Effective Keyword Search in Relational Databases Fang Liu (University of Illinois at Chicago) Clement Yu (University of Illinois at Chicago) Weiyi Meng.
SPARK: Top-k Keyword Query in Relational Databases Yi Luo, Xuemin Lin, Wei Wang, Xiaofang Zhou Univ. of New South Wales, Univ. of Queensland SIGMOD 2007.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
CS/Info 430: Information Retrieval
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Clustering-based Collaborative filtering for web page recommendation CSCE 561 project Proposal Mohammad Amir Sharif
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Querying Structured Text in an XML Database By Xuemei Luo.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Database and Query Model ◦ Informal Model ◦ Formal Model ◦ Query and Answer Model 
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
SINGULAR VALUE DECOMPOSITION (SVD)
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
Session 1 Module 1: Introduction to Data Integrity
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Modern Information Retrieval
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
CS315 Introduction to Information Retrieval Boolean Search 1.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Indexing Structures for Files and Physical Database Design
Information Retrieval in Practice
Information Retrieval and Web Search
Indexing and Hashing Basic Concepts Ordered Indices
6. Implementation of Vector-Space Retrieval
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung

Overview What is unstructured retrieval? This is retrieving data from documents like journals, articles etc. What is structured retrieval? Retrieving data from databases, XML files etc. (that is, structural relationship between data exists)

Traditional IR approach Use keyword frequency and document frequency statistics for query words to determine relevance of a document –Keyword frequency – No. of times a keyword appears in a document –Document frequency – No. of documents in which a keyword appears. Use the combination of the two as a weighting factor

Traditional IR technique is inadequate for relational databases Traditional IR techniques do not capture the relationship between data sources in a normalized database Need to take into account the relationship between keywords in a database Example: –A keyword is in a tuple referenced by many other tuples –No. of joins that need to be performed to get all keywords in a query

Example DB1 Inproceedings Conferences idinprocIDtitleprocIDyearmonannote t1t1 Adiba1986Historical Multimedia Databases Augtemporal t2t2 Abarbanel1987Connection Perspective Reform MayIntellicorp idprocIDConference t323 The conference on Very Large Databases (VLDB) t418ACM Sigmod Conf on management of data

Example DB2

Example Query = (Multimedia, Database, VLDB) DB1 will give us good results, But traditional IR model will return DB2 as the better one as term frequencies are higher in DB2 Hence we need to effectively summarize relationships between keywords in databases

Contributions 1)Address the problem of selection of structured data sources for keyword based queries 2)Propose a method for summarizing relationships between keywords in a database 3)Define metrics to rank source databases given a keyword query based on keyword relationships 4)Evaluation of proposed summarization using real datasets

Measuring Strength of Relationships Between Keywords Strength of relationships between two keywords measured as a combination of two factors: 1)Proximity factor – Inverse of distance 2)Frequency factor, given a distance d – Number of combinations of exactly d+1 distinct tuples that can be joined in a sequence to get the two keywords in the end tuples

Modeling of an RDBMS Let m = No. of distinct keywords in database DB Let n = Total no. of tuples in DB. Then matrix D = t1 t2 …. tn k1 k2 : km D represents presence or absence of a keyword in a tuple (Similar to term-document incidence matrix in VSM)

Modeling of an RDBMS Cont’d Matrix T represents relationship between tuples (for example, foreign key) T= t1 t2 ……………… tn t1 0 1 t2 1 0 : tn

Mathematical representation of keyword relationships

Mathematical representation of keyword relationships Cont’d A Keyword Relationship Matrix (KRM) R represents the relationship between any two pair of keywords with respect to δ and K

Mathematical representation of keyword relationships Cont’d

Example For two given keywords k 1 and k 2, and K=40 Database A has 5 joining sequences connecting them at distance = 1 Then score = 5 * (1/2) = 2.5 Database B has 40 joining sequences connecting them at distance = 4 Then score = 40*(1/5) = 8 Here B wins.

Example (cont’d) If we bring down K to 10, then A wins. Thus one may prefer A to B due to better quality. K defines the number of top results users expect from the database.

Computation of KRM How to compute Few definitions –

Three proven propositions aiding the computation of the KRM

Three proven propositions aiding the Computation of KRM Cont’d

Comparison of frequencies of keyword pairs in DB 1 and DB 2 Frequencies of keyword pairs in DB 1 Frequencies of keyword pairs in DB 2 Our query was Q = (Multimedia, Database, VLDB ) Observation tells us that query words are more closely related in DB 1 Keyword paird=0d=1d=2d=3d=4 database:multimedia11--- multimedia:VLDB01--- Database:VLDB11--- Keyword paird=0d=1d=2d=3d=4 database:multimedia00002 multimedia:VLDB00000 Database:VLDB00100

Comparison of relationship scores of DB 1 and DB 2 Keyword pairDB1DB2 Database:multimedia Multimedia:VLDB0.50 Database:VLDB Sample computation for DB 1 (K=10) Rel [ Database, multimedia ] = 1 * * 1 = 1.5

Implementation with SQL Relation R D (kId, tId) represents the non-zero entries of the keyword incidence matrix D kId is the keyword ID and tId is the tuple ID R K (kId, keyword) stores the keyword IDs and keywords (similar to a word dictionary in IR) Matrices T 1, T 2, T 3... (Tuple relationship matrices) are represented with relations R T1,R T2,R T3.. R T1 :- Produced by joining pairs of tables R T2 :- Produced by self-joining R T1

Implementation with SQL Cont’d R T3 produced using the following SQLs INSERT INTO R T3 (tId1, tId2) SELECT s1.tId1, s2.tId2 FROM R T2 s1, R T1 s2 WHERE s1.tId2 = s2.tId1 INSERT INTO R T3 (tId1, tId2) SELECT s1.tId1, s2.tId1 FROM R T2 s1, R T1 s2 WHERE s1.tId2 = s2.tId2 AND s1.tId1 < s2.tId1 INSERT INTO R T3 (tId1, tId2) SELECT s2.tId1, s1.tId2 FROM R T2 s1, R T1 s2 WHERE s1.tId1 = s2.tId2

Implementation with SQL Cont’d INSERT INTO R T3 (tId1, tId2) SELECT s1.tId2, s2.tId2 FROM R T2 s1, R T1 s2 WHERE s1.tId1 = s2.tId1 AND s1.tId2 < s2.tId2 DELETE a FROM R T3 a, R T2 b, R T1 c WHERE (a.tId1 = b.tId1 AND a.tId2 = b.tId2) OR (a.tId1 = c.tId1 AND a.tId2 = c.tId2) In general, R Td is generated by joining R Td-1 with R T1 and excluding the tuples already in R Td-1, R Td-2, … R T1

Creation of W 0,W 1, W 2 ….(Matrices representing frequencies) W 0 is represented with a relation R W0 (kId1, kId2, freq) tuple (kId1, kId2, freq) records the pair of keywords (kId1,kId2) (kId1 < kId2), and its frequency (freq) at 0 distance, where freq is greater than 0. R W0 is the result of self-joining R D (kId, tId). SQL for creating R W0 INSERT INTO R W0 (kId1, kId2, freq) SELECT s1.kId AS kId1, s2.kId AS kId2, count(*) FROM R D s1, R D s2 WHERE s1.tId = s2.tId AND s1.kId < s2.kId GROUP BY kId1, kId2

Creation of W0,W1, W2….(Matrices representing frequencies) SQL for creating R Wd, d > 0 INSERT INTO R Wd (kId1, kId2, freq) SELECT s1.kId AS kId1, s2.kId AS kId2, count(*) FROM R D s1, R D s2, R Td r WHERE ((s1.tId = r.tId1 AND s2.tId = r.tId2) OR (s1.tId = r.tId2 AND s2.tId = r.tId1)) AND s1.kId < s2.kId GROUP BY kId1, kId2

Final resulting KRM The final resulting KRM, R is stored in a relation R R (kId 1,kId 2 ),consisting of pairs of keywords and their relationship score. It is computed using the formula – Update issues :- The tables for storing these matrices can be updated dynamically.

Estimating multi-keyword relationships Mutiple keywords are connected with Steiner trees. It is an NP complete problem to find a minimum Steiner tree. Most current keyword search algorithms rely on heuristics to find top-K results. Hence estimation between multiple keywords estimated using derived keyword relationships described above.

Estimating multi-keyword relationships Cont’d

Database ranking and indexing With KR summary, we can effectively rank a set of databases D = {DB 1,DB 2,…,DB N } for a given keyword query. We can use either a global index or a local index Global Index – 1.Analogous to an inverted index in IR Use keyword pairs as key, and as a postings entry 2.To evaluate a query, fetch the corresponding inverted lists, and compute the score for each database.

Database ranking and indexing Cont’d Decentralized index 1.Each machine can store a subset of the index (that is, keyword pairs and inverted lists) 2.When a query is received at a node, search messages are sent across nodes and the corresponding postings lists are retrieved.

Experiments done to evaluate efficiency of this system K-R score compared with score from brute force method (real_rank) over 82 databases spread across 16 nodes. Effectiveness of this technique has been successfully established over distributed databases Definitions used for comparison :-

Experiments done to evaluate efficiency of this system

Experiments done to evaluate efficiency of this system Cont’d Effects of (length of joining sequence) 1)Selection performance of keyword queries generally gets better when grows larger. 2)Precision and recall values for different values tend to cluster into groups 3) There are big gaps in both precision and recall values when and when is greater

Experiments done to evaluate efficiency of this system Cont’d Recall and precision of 2-keyword queries using KR summaries and KF-summaries

Experiments done to evaluate efficiency of this system Cont’d Effects of number of query keywords – 1) Performance of 2-keyword queries generally better than 3-keyword and 4-keyword queries 5-keyword queries give better recall than 3 and 4 keyword queries as they are more selective 2) Generally, the difference in the recall of queries with different no. of keywords is less than that of the precision This shows that the system is effective in assigning high ranks to useful databases, although less relevant or irrelevant databases may also be selected.

Comparison of four kinds of estimations (MIN,MAX,SUM,PROD) SUM and PROD have similar behavior and outperform the other two methods Hence it is more effective to take into account relationship information of every keyword pair in the query when estimating overall scores Experiments done to evaluate efficiency of this system Cont’d

Recall and precision of K-R summaries using different estimations ( ) Experiments done to evaluate efficiency of this system Cont’d