Keyword Search on External Memory Data Graphs Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan PVLDB 2008 Reported by: Yiqi Lu.

Slides:



Advertisements
Similar presentations
Chapter 5: Tree Constructions
Advertisements

AI Pathfinding Representing the Search Space
Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
An Efficient Multi-Dimensional Index for Cloud Data Management Xiangyu Zhang Jing Ai Zhongyuan Wang Jiaheng Lu Xiaofeng Meng School of Information Renmin.
Segmentation and Paging Considerations
CSE506: Operating Systems Block Cache. CSE506: Operating Systems Address Space Abstraction Given a file, which physical pages store its data? Each file.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Transforming Infix to Postfix
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
1 Keyword Search on External Memory Data Graphs Bhavana Dalvi* Meghana Kshirsagar # S. Sudarshan Indian Institute of Technology, Bombay *: Current affiliation:
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
Bidirectional Expansion for Keyword Search on Graph Databases Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
Flashing Up the Storage Layer I. Koltsidas, S. D. Viglas (U of Edinburgh), VLDB 2008 Shimin Chen Big Data Reading Group.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 12 Query Processing and Optimization.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Querying Structured Text in an XML Database By Xuemei Luo.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Database and Query Model ◦ Informal Model ◦ Formal Model ◦ Query and Answer Model 
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
12.1Database System Concepts - 6 th Edition Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Join Operation Sorting 、 Other.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Computing & Information Sciences Kansas State University Tuesday, 03 Apr 2007CIS 560: Database System Concepts Lecture 29 of 42 Tuesday, 03 April 2007.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Chapter 12 Query Processing. Query Processing n Selection Operation n Sorting n Join Operation n Other Operations n Evaluation of Expressions 2.
Chapter 12 Query Processing (1) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
External A* Stefan Edelkamp, Shahid Jabbar (ich) University of Dortmund, Germany and Stefan Schrödl (DaimlerChrysler, CA)
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
Keyword Search on Graph-Structured Data
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
I/O Efficient Directed Model Checking Shahid Jabbar and Stefan Edelkamp, Computer Science Department University of Dortmund, Germany.
CSCI 5708: Query Processing II Pusheng Zhang University of Minnesota Feb 5, 2004.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
Keyword Searching and Browsing in Databases using BANKS Charuta Nakhe, Arvind Hulgeri, Gaurav Bhalotia, Soumen Chakrabarti, S. Sudarshan Presented by Sushanth.
for all Hyperion video tutorial/Training/Certification/Material Essbase Optimization Techniques by Amit.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Zaiben Chen et al. Presented by Lian Liu. You’re traveling from s to t. Which gas station would you choose?
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Best-first search is a search algorithm which explores a graph by expanding the most promising node chosen according to a specified rule.
Indexing and hashing.
Database Management System
Lecture 16: Data Storage Wednesday, November 6, 2006.
CAM Content Addressable Memory
Chapter 12: Query Processing
Cache memory Direct Cache Memory Associate Cache Memory
Lecture 11: DMBS Internals
Chapter 15 QUERY EXECUTION.
CS120 Graphs.
A* Path Finding Ref: A-star tutorial.
Keyword Searching and Browsing in Databases using BANKS
University of Wisconsin-Madison
Lecture 2- Query Processing (continued)
Bidirectional Query Planning Algorithm
Chapter 12 Query Processing (1)
Database Systems (資料庫系統)
Efficient Migration of Large-memory VMs Using Private Virtual Memory
Accelerating Regular Path Queries using FPGA
Presentation transcript:

Keyword Search on External Memory Data Graphs Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan PVLDB 2008 Reported by: Yiqi Lu

Background: Graph Model  Direct graph model for data

Background: Answer Tree Model  Answer Tree  Keyword Query

Background: score function  A function of the node score and edge score of answer tree  Several score models have been proposed.

Background: keyword search  Input: keywords, data graph  Output: top-k answer trees  Algorithm:  first  looking up an inverted keyword index to get the node-ids of nodes  Second  a graph search algorithm is run to find out trees connecting the keyword nodes found above. The algorithm finds rooted answer trees, which should be generated in ranked order.

Example: backward expanding search  For each keyword term k i  First find the set of nodes S i that contain keyword k i  Run Dijkstra SP algorithm which provides an interface to incrementally retrieve the next nearest node  Traverses the graph  To find a common vertex from which a forward path exists to at least one node in each set S i  Then the answer tree’s root is the common vertex and the keywords are leaves

Background: external memory search  Run search algorithm on an external memory graph representation which clusters nodes into disk pages  Naïve migration will lead to poor performance  keyword search algorithms designed for in-memory search access a lot of nodes, and such node accesses lead to a lot of expensive random IO when data is disk resident.

Background : 2-level graph  Clustering parameters are chosen such that supernode graph fits into the available amount of memory

Background: 2-phase search algorithm

This algorithm lack consideration of time locality

multi-granular graph structure  This paper proposes a multi-granular graph structure to exploit information present in lower-level nodes that are cache-resident at the time a query is executed

MG graph  a hybrid graph  A supernode is present either in expanded form (all its innernodes along with their adjacency lists are present in the cache)  Or unexpanded form (its innernodes are not in the cache)

several types of edges

Supernode answer Pure answer

ITERATIVE EXPANSION SEARCH  Explore phase: Run an in-memory search algorithm on the current state of the multi-granular graph (the multi- granular graph is entirely in memory)  Expand phase: Expand the supernodes found in top-n results of the (a) and add them to input graph to produce an expanded multi-granular graph

ITERATIVE EXPANSION SEARCH

 the stopping criterion :  The algorithm stops at the iteration where all top-k results are pure.  node-budget heuristic:  Stop search when

ITERATIVE EXPANSION SEARCH  A assumption: the part of graph relevant to the query fits in cache  May fail in some cases  Query has many keywords or algorithm explores a large number of nodes  Have to evict some supernodes from the cache based on a cache replacement policy  some parts of the multi-granular graph may shrink after an iteration  Such shrinkage can unfortunately cause a problem of cycles in evaluation

ITERATIVE EXPANSION SEARCH  do not shrink the logical multi-granular graph, but instead provide a “virtual memory view” of an ever-expanding multi-granular graph.  maintain a list, Top-n-SupernodeList, of all supernodes found in the top-n results of all previous iterations.  Any node present in Top-n-SupernodeList but not in cache is transparently read into cache whenever it is accessed.

INCREMENTAL EXPANSION SEARCH  Iterative Expansion algorithm restart search when supernodes are expanded  This can lead to significantly increased CPU time  Incremental expansion algorithm updates the state of the search algorithm

Take BES as example

Heuristics to improve performance  stop-expansion-on-full-cache  Intra-supernode-weight heuristic  We define the intra-supernode weight of a supernode as the average of all innernode → innernode edges within that supernode.

Experiment  Search Algorithms Compared  Iterative Expanding search  Incremental Expanding (Backward) Search with different heuristics  the in-memory Backward Expanding search  the Sparse algorithm from “Efficient IR-Style keyword search in relational databases”  A naive approach to external memory search would be to run in-memory algorithms in virtual memory.  we have implemented this approach on the supernode graph infrastructure, treating each supernode as a page

Data sets  DBLP 2003  IMDB 2003  Cluster using EBFS technique  Default supernode size is 100 innernodes corresponding to an average of 7KB on DBLP and 6.8KB on IMDB  Supernode contents were stored sequentially in a single file, with an index for random access within the file to retrieve a specified supernode.

Data sets

Clustering result

Cache Management  3GB RAM, and a 2.4GHz Intel Core 2 processor, and ran Fedora Core 6  All results are taken on a cold cache.  Force linux kernel to drop page cache, inode cache and dentry cache  By excuting sync(flush dirty pages back to disk) then excuting echo 3 > /proc/sys/vm/drop_caches

Queries

Experimental Results  first implemented Incremental search without any of the heuristics  did not perform well, and gave poor results, taking unreasonably long times for many queries.  results for this case not presented  two versions of Incremental expansion, one with and one without the intra-supernode-weight heuristic

the intra-supernode-weight heuristic reduces the number of cache misses drasti cally without significantly reducing answer quality.

Comparison With Alternatives