1 Keyword Search on External Memory Data Graphs Bhavana Dalvi* Meghana Kshirsagar # S. Sudarshan Indian Institute of Technology, Bombay *: Current affiliation:

Slides:



Advertisements
Similar presentations
BEST FIRST SEARCH - BeFS
Advertisements

Review: Search problem formulation
An Introduction to Artificial Intelligence
A* Search. 2 Tree search algorithms Basic idea: Exploration of state space by generating successors of already-explored states (a.k.a.~expanding states).
Evaluating “find a path” reachability queries P. Bouros 1, T. Dalamagas 2, S.Skiadopoulos 3, T. Sellis 1,2 1 National Technical University of Athens 2.
Twig 2 Stack: Bottom-up Processing of Generalized-Tree-Pattern Queries over XML Documents Songting Chen, Hua-Gang Li *, Junichi Tatemura Wang-Pin Hsiung,
Optimizing Join Enumeration in Transformation-based Query Optimizers ANIL SHANBHAG, S. SUDARSHAN IIT BOMBAY VLDB 2014
Problem Solving by Searching Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 3 Spring 2007.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Solving Problem by Searching
Quiz 4-26-’07 Search.
ICS-171:Notes 4: 1 Notes 4: Optimal Search ICS 171 Summer 1999.
Artificial Intelligence Lecture No. 7 Dr. Asad Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
Computability Start complexity. Motivation by thinking about sorting. Homework: Finish examples.
Lectures on Network Flows
Multiversion Access Methods - Temporal Indexing. Basics A data structure is called : Ephemeral: updates create a new version and the old version cannot.
Review: Search problem formulation
Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,
CS Lecture 9 Storeing and Querying Large Web Graphs.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Review of Graphs A graph is composed of edges E and vertices V that link the nodes together. A graph G is often denoted G=(V,E) where V is the set of vertices.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
Bidirectional Expansion for Keyword Search on Graph Databases Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University
Keyword Search on External Memory Data Graphs Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan PVLDB 2008 Reported by: Yiqi Lu.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Representing and Using Graphs
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Database and Query Model ◦ Informal Model ◦ Formal Model ◦ Query and Answer Model 
Network of Epidemiology Digital Objects Naren Sundar, Kui Xu Client: Sandeep Gupta, S.M. Shamimul CS6604 Class Project.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Q2Semantic: A Lightweight Keyword Interface to Semantic Search Haofen Wang 1, Kang Zhang 1, Qiaoling Liu 1, Thanh Tran 2, and Yong Yu 1 1 Apex Lab, Shanghai.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Informed search strategies Idea: give the algorithm “hints” about the desirability of different states – Use an evaluation function to rank nodes and select.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
Review: Tree search Initialize the frontier using the starting state While the frontier is not empty – Choose a frontier node to expand according to search.
Group 8: Denial Hess, Yun Zhang Project presentation.
Lecture 3: Uninformed Search
Search CPSC 386 Artificial Intelligence Ellen Walker Hiram College.
Basic Problem Solving Search strategy  Problem can be solved by searching for a solution. An attempt is to transform initial state of a problem into some.
Distributed Computing Seminar Lecture 5: Graph Algorithms & PageRank Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet Summer 2007 Except.
Keyword Search on Graph-Structured Data
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:
1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
1 Review of report "LSDX: A New Labeling Scheme for Dynamically Updating XML Data"
Keyword Searching and Browsing in Databases using BANKS Charuta Nakhe, Arvind Hulgeri, Gaurav Bhalotia, Soumen Chakrabarti, S. Sudarshan Presented by Sushanth.
Temporal Indexing MVBT. Temporal Indexing Transaction time databases : update the last version, query all versions Queries: “Find all employees that worked.
Zaiben Chen et al. Presented by Lian Liu. You’re traveling from s to t. Which gas station would you choose?
Multiple-goal Search Algorithms and their Application to Web Crawling Dmitry Davidov and Shaul Markovitch Computer Science Department Technion, Haifa 32000,
Lecture 3: Uninformed Search
Neighborhood - based Tag Prediction
Temporal Indexing MVBT.
Declarative Creation of Enterprise Applications
A* Path Finding Ref: A-star tutorial.
EA C461 – Artificial Intelligence
Lectures on Graph Algorithms: searching, testing and sorting
Keyword Searching and Browsing in Databases using BANKS
Lecture 2- Query Processing (continued)
Keyword Searching and Browsing in Databases using BANKS
Bidirectional Query Planning Algorithm
Presentation transcript:

1 Keyword Search on External Memory Data Graphs Bhavana Dalvi* Meghana Kshirsagar # S. Sudarshan Indian Institute of Technology, Bombay *: Current affiliation: Google Inc. #: Current affiliation: Yahoo Labs.

2 Keyword Search on Graph Data Motivation: querying of data from (possibly) multiple data sources E.g. Organizational, government, scientific, medical Often no schema or partially defined schema Graph data model Lowest common denominator model, across relational, HTML, XML, RDF, … Much recent work on extracting and integrating data into a graph model Keyword search is a natural way to query such data graphs, esp. in the absence of schema This is the focus of this paper

3 Keyword Search on Graph-Structured Data E.g. query: “soumen byron” Key differences from IR/Web Search: Normalization (implicit/explicit) splits related data across multiple nodes To answer a keyword query we need to find a (closely) connected set of entities that together match all given keywords Focused Crawling … Soumen C.Byron Dom writes author paper Sudarshan BANKS: Keyword search…

4 Query/Answer Models on Graph Data Query : set of keywords Answer: rooted directed tree connecting keyword nodes (e.g. BANKS) Answer relevance based on node prestige 1/(tree edge weight) Several closely related ranking models Focused Crawling Soumen C. Byron Dom writes author paper query: “soumen byron”

5 Keyword Search on Graphs Goal: efficiently find top k answers to keyword query Several algorithms proposed earlier Backward expanding search Bidirectional search DPBF, BLINKS, Spark, … All above algorithms assume graph fits in memory

6 External Memory Graph Search Problem: what if graph size > memory? Motivation: Web crawl graphs, social networks, Wikipedia, data generated by IE from Web Algorithm Alternatives: Alternative 1: Virtual Memory −ve: thrashing (experimental results later) Alternative 2: SQL −ve: For relational data only −ve: not good for top-K answer generation Our proposal: use in-memory graph summary to focus search on relevant parts of the graph avoid IO for rest of graph

7 Related Work Keyword querying on graphs using precomputed info Idea: Avoid search at query time, use only inverted list merge Drawbacks include high space overhead (ObjectRank, EKSO) External memory graph traversal Several algorithms (Nodine, Buchsbaum, etc) that give worst case guarantees, but require excessive replication Shortest path computation in external memory graphs Several algorithms (Shekhar, Chang etc) But all depend on properties specific to road networks (large diameter, near planarity etc) Hierarchical clustering For visualization (Lieserson, Buchsbaum etc.) For web graph computations (Raghavan and Garcia-M.) 2-level graph clustering

8 Inner node Supernode Graph Edge weights: wt(S1 → S2): min{wt(i → j): i  S1, j  S2}

9 Strawman: 2-Phase Search First-Attempt Algorithm: Phase 1 : Search on supernode graph to get top-k results (containing supernodes) Using any search algorithm Expand all supernodes from supernode results Phase 2 : Search on this expanded component of graph to get final top-k results Doesn’t quite work: Top-k on expanded component may not be top-k on full graph Experiments show poor recall

10 Multi-Granular Graph Representation Original supernode graph is in-memory Some supernodes are expanded i.e. their contents are fetched into cache Multi-granular graph: a logical graph view containing inner nodes from expanded supernodes unexpanded supernodes edges between these nodes Search runs on resultant multi-granular graph Multi-granular graph evolves as execution proceeds, and supernodes get expanded

11 Multi-Granular Graph Edge-weights:Supernode  Innernode wt(S → j): min{wt(i → j): i  S} wt(j → S): symmetric to above S3 S4 S2 S1 Supernode (unexpanded ) Inner Node Expanded Supernode I - I edge S - I edge S - S edge Key:

12 Iterative Expansion Search Yes Output No Expand supernodes in top answers Edges in top-k answers Explore ( generate top-k answers on current MG graph, using any in-memory search method ) top-k answers pure?

13 Iterative Expansion (Cont.) Any in-memory search algorithm can be used Iteration will terminate What if too many nodes are expanded? Eviction of expanded nodes from MG graph Can lead to non-convergence Evict expanded nodes from cache, but retain in logical MG graph, re-fetch as required Can cause thrashing (thrashing control possible) Performance Evaluation (details later) Significantly reduces IO compared to search using virtual memory BUT: High CPU cost due to multiple iterations, with each iteration starting search from scratch

14 Incremental Search Motivation Repeated restarts of search in iterative search Basic Idea Search on multi-granular graph Expand supernode(s) in top answer Unlike Iterative Search Update the state of the search algorithm when a supernode is expanded, and Continue search instead of restarting State update depends on search algorithm We present state update for backward expanding search (BANKS, ICDE02/VLDB05)

15 Backward Expanding Search Soumen C.Byron Dom authors Focused Crawling paper Query: soumen byron writes SPI Tree

16 Backward Expanding Search Based on Dijkstra’s single-source shortest path algorithm One instance of Dijkstra’s algorithm per keyword Explored nodes: nodes for which shortest path already found Fringe nodes: unexplored nodes adjacent to explored nodes Shortest-Path Iterator Tree (SPI-Tree): Tree containing explored and fringe nodes. Edge u  v if (current) shortest path from u to keyword passes through v More details in paper

17 Incremental Backward Search Backward search run on multi-granular graph repeat Find next best answer on current multi-granular graph If answer has supernodes expand supernode(s) Update the state of backward search, i.e. all SPI trees, to reflect state change of multi-granular graph due to expansion until top-k answers on current multi-granular graph are “pure” answers

18 State Update on Supernode Expansion Nodes affected by deletion S1 Result containing supernodes Supernode S1 to be expanded SPI tree containing S1

19 Nodes Get Attached 1.Affected nodes get detached 2.Inner-nodes get attached (as fringe nodes) to adjacent explored nodes based on shortest path to K1 3. Affected nodes get attached (as fringe nodes) to adjacent explored nodes based on shortest path to K1

20 Effect of Supernode Expansion Differences from Dijkstra's shortest-path algorithm: For Explored nodes: Path-costs of explored nodes may increase Explored nodes may become fringe nodes For Fringe nodes: Incremental Expansion: Path-costs may increase or decrease Invariant SPI trees reflect shortest paths for explored nodes in current multi-granular graph Theorem: Incremental backward expanding search generates correct top-k answers

21 Heuristics Thrashing Control : Stop supernode expansion on cache full Use only parts of the graph already expanded for further search Intra-supernode edge weight details in paper Heuristics can affect recall Recall at or close to 100% for relevant answers, with heuristics, in our experiments (see paper for details)

22 Experimental Setup Clustering algorithm to create supernodes Orthogonal to our work Experiments use Edge prioritized BFS (details in paper)‏ Ongoing work: develop better clustering techniques All experiments done on cold cache echo 3 > /proc/sys/vm/drop caches DatasetOriginal Graph Size Supernode Graph Size EdgesSuperedges DBLP99MB17MB8.5M1.4M IMDB94MB33MB8M2.8M Default Cache size (Incr/Iter)1024 (7MB) Default Cache Size (VM, DBLP)3510 (24MB) Default Cache Size (VM, IMDB)5851 (40MB)

23 Algorithms Compared Iterative Incremental Virtual Memory (VM) Search Use same clustering as for supernode graph Fetch cluster into cache whenever a node is accessed evicting LRU cluster if required Search code unaware of clustering/caching gets “Virtual Memory” view Sparse SQL-based approach from Hristidis et al. [VLDB03] Not applicable to graphs without schema used for comparison, on graphs derived from relational schema

24 Query Execution Time (top 10 results) Bars: Iterative, Incremental and VM resp. Query Execution Time (Seconds)

25 Query Execution Time (Last Relevant Result) Iterative, Incremental, VM and Sparse resp. Query Execution Time (Seconds)

26 Cache Misses for Different Cache Sizes Note: Graphs in paper used wrong cache sizes for VM queries on IMDB (Q8,Q9, Q10 and Q12). Graph above shows corrected results, but there are no significant differences. All Incr. All VM

27 Conclusions Graph summarization coupled with a multi- granular graph representation shows promise for external memory graph search Ongoing/Future work Applications in distributed memory graph search Improved clustering techniques Extending Incremental to bidirectional search and other graph search algorithms Testing on really large graphs

28 The End Queries?

29 Minor Correction to Paper Cache size (Incr/Iter)1024 (7MB)1536 (10.5MB)2048 (14MB) Cache Size (VM, DBLP)3510 (24MB)4023 (27.5MB)4535 (31MB) Cache Size (VM, IMDB)5851 (40MB)6363 (43.5MB)6875 (47MB) For IMDB queries Q8-Q10,Q12, for the case of VMSearch, cache sizes from DBLP were inadvertently used earlier instead of the cache sizes shown above. Queries were rerun on the correct cache size, but there were no changes in the relative performance of Incremental versus VMSearch, on cache misses as well time taken.