Bidirectional Expansion for Keyword Search on Graph Databases Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

ADAPTIVE FASTEST PATH COMPUTATION ON A ROAD NETWORK: A TRAFFIC MINING APPROACH Hector Gonzalez, Jiawei Han, Xiaolei Li, Margaret Myslinska, John Paul Sondag.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Efficient IR-Style Keyword Search over Relational Databases Vagelis Hristidis University of California, San Diego Luis Gravano Columbia University Yannis.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Keyword Searching in Relational Databases
Relational Databases for Querying XML Documents: Limitations & Opportunities VLDB`99 Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton,
Effective Keyword Search in Relational Databases Fang Liu (University of Illinois at Chicago) Clement Yu (University of Illinois at Chicago) Weiyi Meng.
Best-First Search: Agendas
Suggestion of Promising Result Types for XML Keyword Search Joint work with Jianxin Li, Chengfei Liu and Rui Zhou ( Swinburne University of Technology,
Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Time-Variant Spatial Network Model Vijay Gandhi, Betsy George (Group : G04) Group Project Overview of Database Research Fall 2006.
1 Matching DOM Trees to Search Logs for Accurate Webpage Clustering Deepayan Chakrabarti Rupesh Mehta.
1 Keyword Search on External Memory Data Graphs Bhavana Dalvi* Meghana Kshirsagar # S. Sudarshan Indian Institute of Technology, Bombay *: Current affiliation:
Minimum Spanning Trees What is a MST (Minimum Spanning Tree) and how to find it with Prim’s algorithm and Kruskal’s algorithm.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Keyword Search in Relational Databases Jaehui Park Intelligent Database Systems Lab. Seoul National University
Keyword Search on External Memory Data Graphs Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan PVLDB 2008 Reported by: Yiqi Lu.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
1 Exact Top-k Nearest Keyword Search in Large Networks Minhao Jiang†, Ada Wai-Chee Fu‡, Raymond Chi-Wing Wong† † The Hong Kong University of Science and.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Database and Query Model ◦ Informal Model ◦ Formal Model ◦ Query and Answer Model 
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
GEORGIOS FAKAS Department of Computing and Mathematics, Manchester Metropolitan University Manchester, UK. Automated Generation of Object.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
Lecture 3: Uninformed Search
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Multi-Query Optimization and Applications Prasan Roy Indian Institute of Technology - Bombay.
Basic Problem Solving Search strategy  Problem can be solved by searching for a solution. An attempt is to transform initial state of a problem into some.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Keyword Search on Graph-Structured Data
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Keyword Searching and Browsing in Databases using BANKS Charuta Nakhe, Arvind Hulgeri, Gaurav Bhalotia, Soumen Chakrabarti, S. Sudarshan Presented by Sushanth.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Zaiben Chen et al. Presented by Lian Liu. You’re traveling from s to t. Which gas station would you choose?
© Prentice Hall1 DATA MINING Web Mining Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Companion slides.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Mehdi Kargar Department of Computer Science and Engineering
Neighborhood - based Tag Prediction
Data Structures: Disjoint Sets, Segment Trees, Fenwick Trees
Data Mining (and machine learning)
Keyword Searching and Browsing in Databases using BANKS
CS120 Graphs.
CS & CS ST: Big Data Analytics
Keyword Searching and Browsing in Databases using BANKS
Keyword Searching and Browsing in Databases using BANKS
Bidirectional Query Planning Algorithm
Panagiotis G. Ipeirotis Luis Gravano
UNINFORMED SEARCH -BFS -DFS -DFIS - Bidirectional
Efficient Processing of Top-k Spatial Preference Queries
Introduction to XML IR XML Group.
Presentation transcript:

Bidirectional Expansion for Keyword Search on Graph Databases Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan Rushi Desai Hrishikesh Karambelkar

Keyword Search on Graph Representation of Data Keyword search on relational, XML, HTML, etc. data BANKS, Discover, DBXplorer, XRank, etc. Need to find a (closely) connected set of nodes that together match all given keywords Focus of our work Search algorithms to find connections between nodes

Outline Data, Query and Response Models Backward Search Algorithm Bidirectional Search Algorithm Experiments Related Work Conclusions

Graph Data Model Data modeled as a directed weighted graph: BANKS [ICDE’02] Can model relational, XML, HTML, etc. data E.g., DBLP database Node = tuple Edge = foreign key reference Multi-Query Optimization SudarshanPrasan Roy writes author paper Soumen BANKS: Keyword search…

Graph Data Model (2) E.g., XML data Databases Keyword Search Databases title proceedings paper = 1) paper = 2) cite

Response Model Response: Minimal, rooted tree connecting keyword nodes Undirected: Discover, DBXplorer Directed: BANKS Multi-Query Optimization Sudarshan Prasan Roy writes author paper E.g., Sudarshan Roy

Response Ranking Edge Score = E A Smaller tree => higher score E.g., BANKS: E A = 1/ (  edge weights) Node Score = N A Measure of authority of nodes in tree E.g., BANKS: N A =  (leaf and root node authorities) Overall score = f (E A, N A ) E.g., BANKS: f (E A, N A )  E A. N A

Finding Answer Trees Backward Expanding Search: BANKS [ICDE02] Intuition: travel backwards from keyword nodes till you hit a common node SudarshanPrasan Roy authors MultiQuery Optimization paper Query: sudarshan roy writes

Backward Search: Algorithm Algorithm Run concurrent single source shortest path iterators from each node matching a keyword Traverse the graph edges in reverse direction Output next nearest node on each get-next() call Do best-first search across iterators Output node if in the intersection of sets of nodes reached from each keyword

Backward Search: Limitations Wasteful exploration of graph: Frequently occurring keywords “Hub” nodes in the graph (high in-degree) … Database Shashank Sudarshan … author paper writes Schema Legend “Shashank Sudarshan Database”

Bidirectional Search: Motivation

Bidir Search: Intuition First cut solution: Don’t go backward if keyword matches many nodes Don’t go backward if node points to a hub Instead explore forward from other keywords

Bidir Search: Example … … author paper writes Schema Legend “Shashank Sudarshan Database” Database Shashank Sudarshan …

Bidir Search: Issues What should threshold for not expanding be? Our solution: prioritize expansion of nodes based on spreading activation to penalize frequent keywords and bushy trees How to manage exploration in both directions?

Bidir Search: Spreading Activation Spreading Activation Node with highest activation explored first Every node given an initial activation Gives low activation to frequently occurring keywords “John” 1/5

Bidir Search: Spreading Activation Spreading Activation Node with highest activation explored first Activation spread to neighbors (μ = 0.3) Gives low activation to neighbors of hubs 0.3 x 1/5 0.7 x 1/5 x 1/4 1/ x 1/5 x 1/

How to manage exploration in both directions? Single backward iterator + single forward iterator w/ suitable datastructures E.g., to keep track of parents of nodes Details in full paper Bidir Search: Iterators … 1 [0,∞][0,∞][∞,0][∞,0] [1,∞][1,∞] [∞,∞][∞,∞] [∞,1][∞,1][∞,1][∞,1] “A”“B” [2,3 ∞][2,3 ∞] [∞,∞ 2][∞,∞ 2] [Dist from “A”, Dist from “B”] [2,∞][2,∞]

Bidir Search: Algorithm Algorithm Activate matching nodes; insert into backward iterator while (iterators are not empty) Choose iterator for expansion in best-first manner Explore node with highest activation Spread activation to neighbors Update path weights (and other datastructures) Propagate values to ancestors if necessary Insert nodes explored in the backward direction into the forward iterator /* for future forward exploration */ Stop when top-k results are produced

Bidir Search: top-k results Results need not be generated “in-order” Naïve solution Store results in an intermediate heap Output top k results after mk total results have been generated (m ~ 10) Can do better Compute upper bound on score of next result; output answers with a higher score Similar to NRA algorithm (Fagin et al., PODS’01)

Experiments Datasets DBLP, IMDB ~ 2 million nodes, 9 million edges US Patent DB ~ 4 million nodes, 15 million edges Workload Keywords randomly picked from results of SQL join statements Search algorithms MI-Bkwd: original backward search Iterator for every node matching a keyword SI-Bkwd: backward search with single backward iterator Bidirec: bidirectional search Time taken/nodes explored Measured when 10 th answer is generated (or last answer if #answers < 10) Origin size #nodes matched by keywords in the query

Experiments (2) MI-Bkwd versus SI-Bkwd SI-Bkwd gain increases with origin size, # keywords

Experiments (3) SI-Bkwd versus Bidirec Bidirec gain increases with origin size, # keywords

Experiments (4) Precision/Recall experiments Relevant answers are well-defined; can be generated through SQL statements Both MI-Backward and Bidirectional show similar performance Recall ~ 100% Precision ~ 100% at near full recall Few irrelevant answers produced before generating all relevant answers Bidirectional runs faster, yet minimal loss of relevant results!

Experiments (5) Comparison with Sparse: Hristidis et al. [VLDB’03] Generate join expressions leading to query results Use DB-provided scores for ranking tuples and aggregate them to rank answer trees For top-k results: automatically determine required number of join expressions Sparse-LB Manually generate required join expressions Sparse needs to do at least this much (and usually a lot more!) Bidirectional versus Sparse-LB Bidirectional outperforms by a factor of ~ 3 (esp. when #joins is large)

Experiments (6) SI-Bkwd versus Bidirec: by origin size Bidirec gains more with unbalanced origin sizes A = (T,S,S,S) B = (M,M,M,M) C = (M,L,L,L) D = (M,M,L,L) E = (T,L,L,L) F = (T,S,M,L) G = (T,M,L,L) H = (T,T,T,L)

Discussion Bidirectional search as dynamic per-tuple join ordering Related work in this area: Eddies Bidirectional search Schema-less Prioritization based on activation instead of selectivity Generate answers in relevance order

Related Work Keyword querying on relational data: Discover (UCSD), DBExplorer (Microsoft) Use SQL generation, without in-memory data structures Issues: generate join plans, re-use common sub-expressions, etc. Keyword querying on XML XRank (Cornell), Schema-Free XQuery (Michigan), … Tree model is too limited ObjectRank

Conclusions Graph model Convenient common denominator representation Schema-free querying leads to graph search Purely backward strategy inadequate Bidirectional search with spreading activation performs much better Dynamically choose join order on per-tuple basis

Thank You! Questions??

Future of Keyword Search in DBs Next generation of intelligent search will require context information E.g. search , files, calendar,.. Information integration will be important Graph structured data will be a key component Is there a killer app? Deep web? Display of answers Users don’t want to see schema details Can we leverage off existing (Web) apps?

BANKS Future Work Applications of BANKS Soumen Chakrabarti, Sunita Sarawagi and students Exploit BANKS to integrate different sources of data Extract information, Infer soft links BANKS for personal information management SPIN: Search Personal Information Networks Ongoing/future work on BANKS: More sysadmin/user control on ranking One size does not fit all BANKS provides infrastructure Characterize bidirectional search better And find other applications Security

Bidir Search: top-k results (2) Compute upper bound on score of next result; output answers with a higher score Computing the bound m i = minimum path length explored backward from keyword i unseen answer node: 1/(m 1 + m 2 + … + m n ) visited answer node: suppose reached from first x keywords with distance d i 1/[(d 1 + d 2 + … + d x ) + (m x+1 + m x+2 + … + m n )] combine this with max node prestige We simply use 1/(m 1 + m 2 + … + m n ) ! Experiments show no significant loss in using this heuristic