Download presentation
Presentation is loading. Please wait.
Published byShona Hoover Modified over 9 years ago
1
Bidirectional Expansion for Keyword Search on Graph Databases http://www.cse.iitb.ac.in/banks/ Varun Kacholia Shashank Pandit Soumen Chakrabarti S. Sudarshan Rushi Desai Hrishikesh Karambelkar
2
Keyword Search on Graph Representation of Data Keyword search on relational, XML, HTML, etc. data BANKS, Discover, DBXplorer, XRank, etc. Need to find a (closely) connected set of nodes that together match all given keywords Focus of our work Search algorithms to find connections between nodes
3
Outline Data, Query and Response Models Backward Search Algorithm Bidirectional Search Algorithm Experiments Related Work Conclusions
4
Graph Data Model Data modeled as a directed weighted graph: BANKS [ICDE’02] Can model relational, XML, HTML, etc. data E.g., DBLP database Node = tuple Edge = foreign key reference Multi-Query Optimization SudarshanPrasan Roy writes author paper Soumen BANKS: Keyword search…
5
Graph Data Model (2) E.g., XML data Databases Keyword Search Databases title proceedings paper (@id = 1) paper (@id = 2) cite
6
Response Model Response: Minimal, rooted tree connecting keyword nodes Undirected: Discover, DBXplorer Directed: BANKS Multi-Query Optimization Sudarshan Prasan Roy writes author paper E.g., Sudarshan Roy
7
Response Ranking Edge Score = E A Smaller tree => higher score E.g., BANKS: E A = 1/ ( edge weights) Node Score = N A Measure of authority of nodes in tree E.g., BANKS: N A = (leaf and root node authorities) Overall score = f (E A, N A ) E.g., BANKS: f (E A, N A ) E A. N A
8
Finding Answer Trees Backward Expanding Search: BANKS [ICDE02] Intuition: travel backwards from keyword nodes till you hit a common node SudarshanPrasan Roy authors MultiQuery Optimization paper Query: sudarshan roy writes
9
Backward Search: Algorithm Algorithm Run concurrent single source shortest path iterators from each node matching a keyword Traverse the graph edges in reverse direction Output next nearest node on each get-next() call Do best-first search across iterators Output node if in the intersection of sets of nodes reached from each keyword
10
Backward Search: Limitations Wasteful exploration of graph: Frequently occurring keywords “Hub” nodes in the graph (high in-degree) … Database Shashank Sudarshan … author paper writes Schema Legend “Shashank Sudarshan Database”
11
Bidirectional Search: Motivation
12
Bidir Search: Intuition First cut solution: Don’t go backward if keyword matches many nodes Don’t go backward if node points to a hub Instead explore forward from other keywords
13
Bidir Search: Example … … author paper writes Schema Legend “Shashank Sudarshan Database” Database Shashank Sudarshan …
14
Bidir Search: Issues What should threshold for not expanding be? Our solution: prioritize expansion of nodes based on spreading activation to penalize frequent keywords and bushy trees How to manage exploration in both directions?
15
Bidir Search: Spreading Activation Spreading Activation Node with highest activation explored first Every node given an initial activation Gives low activation to frequently occurring keywords “John” 1/5
16
Bidir Search: Spreading Activation Spreading Activation Node with highest activation explored first Activation spread to neighbors (μ = 0.3) Gives low activation to neighbors of hubs 0.3 x 1/5 0.7 x 1/5 x 1/4 1/5 0 0 0 00.7 x 1/5 x 1/4 1 1 1 1
17
How to manage exploration in both directions? Single backward iterator + single forward iterator w/ suitable datastructures E.g., to keep track of parents of nodes Details in full paper Bidir Search: Iterators … 1 [0,∞][0,∞][∞,0][∞,0] [1,∞][1,∞] [∞,∞][∞,∞] [∞,1][∞,1][∞,1][∞,1] “A”“B” [2,3 ∞][2,3 ∞] 7 3 2 4 5 6 [∞,∞ 2][∞,∞ 2] [Dist from “A”, Dist from “B”] [2,∞][2,∞]
18
Bidir Search: Algorithm Algorithm Activate matching nodes; insert into backward iterator while (iterators are not empty) Choose iterator for expansion in best-first manner Explore node with highest activation Spread activation to neighbors Update path weights (and other datastructures) Propagate values to ancestors if necessary Insert nodes explored in the backward direction into the forward iterator /* for future forward exploration */ Stop when top-k results are produced
19
Bidir Search: top-k results Results need not be generated “in-order” Naïve solution Store results in an intermediate heap Output top k results after mk total results have been generated (m ~ 10) Can do better Compute upper bound on score of next result; output answers with a higher score Similar to NRA algorithm (Fagin et al., PODS’01)
20
Experiments Datasets DBLP, IMDB ~ 2 million nodes, 9 million edges US Patent DB ~ 4 million nodes, 15 million edges Workload Keywords randomly picked from results of SQL join statements Search algorithms MI-Bkwd: original backward search Iterator for every node matching a keyword SI-Bkwd: backward search with single backward iterator Bidirec: bidirectional search Time taken/nodes explored Measured when 10 th answer is generated (or last answer if #answers < 10) Origin size #nodes matched by keywords in the query
21
Experiments (2) MI-Bkwd versus SI-Bkwd SI-Bkwd gain increases with origin size, # keywords
22
Experiments (3) SI-Bkwd versus Bidirec Bidirec gain increases with origin size, # keywords
23
Experiments (4) Precision/Recall experiments Relevant answers are well-defined; can be generated through SQL statements Both MI-Backward and Bidirectional show similar performance Recall ~ 100% Precision ~ 100% at near full recall Few irrelevant answers produced before generating all relevant answers Bidirectional runs faster, yet minimal loss of relevant results!
24
Experiments (5) Comparison with Sparse: Hristidis et al. [VLDB’03] Generate join expressions leading to query results Use DB-provided scores for ranking tuples and aggregate them to rank answer trees For top-k results: automatically determine required number of join expressions Sparse-LB Manually generate required join expressions Sparse needs to do at least this much (and usually a lot more!) Bidirectional versus Sparse-LB Bidirectional outperforms by a factor of ~ 3 (esp. when #joins is large)
25
Experiments (6) SI-Bkwd versus Bidirec: by origin size Bidirec gains more with unbalanced origin sizes A = (T,S,S,S) B = (M,M,M,M) C = (M,L,L,L) D = (M,M,L,L) E = (T,L,L,L) F = (T,S,M,L) G = (T,M,L,L) H = (T,T,T,L)
26
Discussion Bidirectional search as dynamic per-tuple join ordering Related work in this area: Eddies Bidirectional search Schema-less Prioritization based on activation instead of selectivity Generate answers in relevance order
27
Related Work Keyword querying on relational data: Discover (UCSD), DBExplorer (Microsoft) Use SQL generation, without in-memory data structures Issues: generate join plans, re-use common sub-expressions, etc. Keyword querying on XML XRank (Cornell), Schema-Free XQuery (Michigan), … Tree model is too limited ObjectRank
28
Conclusions Graph model Convenient common denominator representation Schema-free querying leads to graph search Purely backward strategy inadequate Bidirectional search with spreading activation performs much better Dynamically choose join order on per-tuple basis
29
Thank You! Questions??
30
Future of Keyword Search in DBs Next generation of intelligent search will require context information E.g. search email, files, calendar,.. Information integration will be important Graph structured data will be a key component Is there a killer app? Deep web? Display of answers Users don’t want to see schema details Can we leverage off existing (Web) apps?
31
BANKS Future Work Applications of BANKS Soumen Chakrabarti, Sunita Sarawagi and students Exploit BANKS to integrate different sources of data Extract information, Infer soft links BANKS for personal information management SPIN: Search Personal Information Networks Ongoing/future work on BANKS: More sysadmin/user control on ranking One size does not fit all BANKS provides infrastructure Characterize bidirectional search better And find other applications Security
32
Bidir Search: top-k results (2) Compute upper bound on score of next result; output answers with a higher score Computing the bound m i = minimum path length explored backward from keyword i unseen answer node: 1/(m 1 + m 2 + … + m n ) visited answer node: suppose reached from first x keywords with distance d i 1/[(d 1 + d 2 + … + d x ) + (m x+1 + m x+2 + … + m n )] combine this with max node prestige We simply use 1/(m 1 + m 2 + … + m n ) ! Experiments show no significant loss in using this heuristic
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.