Keyword Searching and Browsing in Databases using BANKS 2/22/2019 Keyword Searching and Browsing in Databases using BANKS Charuta Nakhe Joint work with: Arvind Hulgeri, Gaurav Bhalotia, Soumen Chakrabarti, S. Sudarshan I.I.T. Bombay 2/22/2019
Motivation Keyword search of documents on the Web has been enormously successful Simple and intuitive, no need to learn any query language Database querying using keywords is desirable SQL is not appropriate for casual users Form interfaces cumbersome: Require separate form for each type of query — confusing for casual users of Web information systems Not suitable for ad hoc queries 2/22/2019
Motivation Many Web documents are dynamically generated from databases 2/22/2019 Motivation Many Web documents are dynamically generated from databases E.g. Catalog data Keyword querying of generated Web documents May miss answers that need to combine information on different pages Suffers from duplication overheads Changed the 2nd bullet 2/22/2019
Examples of Keyword Queries On a railway reservation database “mumbai bangalore” On a university database “database course” On an e-store database “camcorder panasonic” On a book store database “sudarshan databases” 2/22/2019
Differences from IR/Web Search Related data split across multiple tuples due to normalization E.g. Paper (paper-id, title, journal), Author (author-id, name) Writes (author-id, paper-id, position) Different keywords may match tuples from different relations What joins are to be computed can only be decided on the fly Cites(citing-paper-id, cited-paper-id) 2/22/2019
Connectivity Tuples may be connected by Foreign key and object references Inclusion dependencies and join conditions Implicit links (shared words), etc. Would like to find sets of (closely) connected tuples that match all given keywords 2/22/2019
Basic Model Database: modeled as a graph Nodes = tuples Edges = references between tuples foreign key, inclusion dependencies, .. Edges are directed. BANKS: Keyword search… MultiQuery Optimization paper writes Charuta S. Sudarshan Prasan Roy author 2/22/2019
Answer Example Query: sudarshan roy paper MultiQuery Optimization writes writes author author S. Sudarshan Prasan Roy 2/22/2019
The BANKS Answer Model Query: set of keywords {k1, k2, .., kn} Each keyword ki matches set of nodes Si Answer: rooted, directed tree connecting nodes, with one node from each Si Root node has special significance, may be restricted to some relations E.g. relations representing entities, not relationships May include intermediate nodes not in any Si and hence a steiner tree. Multiple answers Ranking based on proximity + prestige 2/22/2019
Edge Directionality Some popular tuples are connected to many other tuples E.g. Students -> departments -> university Popular tuples would create misleading shortcuts from every tuple to every other E.g. every student would be closely linked with every other student via the department/university Solution: define different forward and backward edge weights Forward edges: In the direction of the foreign key reference 2/22/2019
Edge Weight Weight of forward edge based on schema e.g. citation link weights > “writes” link weights Weight of backward edge = indegree of edges pointing to the node 3 1 1 1 2/22/2019
Edge Weight Scaling Problem: Some backward edges have unduly large weights Scale edge weights by using log(1+raw-edgeweight) total-edge-weight = edge-weights Edge score E = 1 / total-edge-weight 2/22/2019
Node Weight Nodes have prestige weights too Set node weight = indegree Observation: nodes with intuitively greater prestige tend to have greater indegree Set node weight = indegree Problem: Nodes with many in-edges result in skewed answers Subdue extreme node weights by using log(1+indegree) Node score N = root-node-weight + leaf-node-weights 2/22/2019
Combining Scores Problem: how to combine two independent metrics: node weight and edge weight Normalize each to 0-1 Combine using weighting factor Additive: (1- ) E + N Multiplicative: E N Performance study to compare alternatives and to find reasonable values for 2/22/2019
Finding Answer Trees Backward Expanding Search Algorithm: Intuition: find vertices from which a forward path exists to at least one node from each Si. Run concurrent single source shortest path algorithm from each node matching a keyword Create an iterator for each node matching a keyword Traverse the graph edges in reverse direction Output a node whenever it is on the intersection of the sets of nodes reached from each keyword 2/22/2019
Backward Expanding Search Query: sudarshan roy MultiQuery Optimization paper writes S. Sudarshan Prasan Roy authors 2/22/2019
Result Ordering Answer trees may not be generated in relevance order Solution: Best-first search across all iterators, based on path length Output answers to a buffer Output highest ranked answer from buffer to user when buffer is full 2/22/2019
2/22/2019 The BANKS System BANKS provides keyword search coupled with extensive browsing facilities Schema browsing + data browsing Graphical display of data Implemented using Java + servlets Keyword search response times typically 1 to 3 seconds on DBLP database with 100,000 tuples/300,000 edges P3 600 MHz, 512 MB RAM Try it out at www.cse.iitb.ac.in/banks/ New slide, with stuff on browsing, and one more on browsing next 2/22/2019
Example of Browsing in BANKS 2/22/2019
Anecdotes “Mohan” “Transaction” “Sunita Seltzer” Returns C. Mohan at top based on prestige (number of papers written) “Transaction” Returns Jim Gray’s classic paper and textbook as top answers based on prestige (number of citations) “Sunita Seltzer” No common papers, but both have papers with Stonebraker: system finds this connection 2/22/2019
Effect of Parameters Log scaling of edge weights worked well (1- ) E + N versus E N -- made little difference Best with = .2 (subdue node weights but not entirely) EdgeLog 2/22/2019
Related Work DataSpot (DTL)/Mercado Intuifind [VLDB 98] 2/22/2019 Related Work DataSpot (DTL)/Mercado Intuifind [VLDB 98] Based on patent by Palmon (filed 1995, granted 1998) Based on hypergraph model, similar answer model to ours Differences: our model of backward link weights and prestige Proximity Search [VLDB98] Different model of proximity based on adding up support No edge weights, prestige, different evaluation algorithm Information units (linked Web pages) [WWW10] No directionality, only studied in Web context Microsoft DBExplorer (this conference) No ranking, based on SQL generation Addresses efficient construction of text indexes Microsoft English query Changed DataSpot bullets added English Query and verify claims on DBExplorer with Surajit 2/22/2019
Conclusions and Future Work 2/22/2019 Conclusions and Future Work The next big wave: keyword searching and browsing of databases? Future work: Keyword queries on XML Disambiguating queries by selecting Nodes: G.W.Bush: “Bush Jr” or “Bush Sr” Tree structure: “coauthors” or “cites” Boolean queries, stemming, thesaurus Metadata: column/relation names NOTE!!!: Changed first bullet to something cheeky. You can ask viewers to decide for themselves if its true Changed future work description significantly with new examples 2/22/2019
Thank You 2/22/2019
BANKS Query Result Example Result of “Soumen Sunita” 2/22/2019
2/22/2019
Browsing Features Hyperlinks are automatically added to all displayed results Template facilities to do a variety of tasks Browsing data by grouping and creating crosstabs e.g., theses grouped by department and year Hierarchical views of data Nested XML style, even on relational data Graphical displays Bar charts, pie charts, etc Templates are generic and can be applied on any data matching assumed schema Can be applied after applying selections New templates can be created by user, interactively 2/22/2019
Combining Keyword Search and Browsing Catalog searching applications Keywords may restrict answers to a small set, then user needs to browse answers If there are multiple answers, hierarchical browsing required on the answers 2/22/2019
The BANKS System Available on the web, with (part of) DBLP data http://www.cse.iitb.ac.in/banks Connects to any database using JDBC JDBC metadata features used to provide schema browsing No programming needed for customization Minimal preprocessing of database to create indices and give weights to links Extensive set of browsing features User HTTP BANKS JDBC Web Server + Servlets Database 2/22/2019