XRANK: Ranked Keyword Search over XML Documents Presentation by: Meghana Kshirsagar Nitin Gupta Indian Institute of Technology, Bombay Lin Guo Feng Shao.

Slides:

Advertisements

Similar presentations

Cooperative Query Answering for Semistructured Data Speakers: Chuan Lin & Xi Zhang By Michael Barg and Raymond K. Wong.

Advertisements

Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.

Improved TF-IDF Ranker

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,

1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.

TREES Chapter 6. Trees - Introduction  All previous data organizations we've studied are linear—each element can have only one predecessor and successor.

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

Information Retrieval in Practice

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Xyleme A Dynamic Warehouse for XML Data of the Web.

DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.

1 Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao.

CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.

Keyword Proximity Search on XML Graphs Vagelis Hristidis Yannis Papakonstatinou Andrey Presenter: Feng Shao.

COMP630 Paper Presentation by Haomian(Eric) Wang.

Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.

Overview of Search Engines

Information Retrieval in Practice

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

Keyword Search on External Memory Data Graphs Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan PVLDB 2008 Reported by: Yiqi Lu.

Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.

The CompleteSearch Engine: Interactive, Efficient, and Towards IR&DB Integration Holger Bast, Ingmar Weber Max-Planck-Institut für Informatik CIDR 2007)

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

1 Holistic Twig Joins: Optimal XML Pattern Matching ACM SIGMOD 2002.

Ralf Schenkel joint work with Jens Graupmann and Gerhard Weikum The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

Querying Structured Text in an XML Database By Xuemei Luo.

April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.

EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.

Chapter 6: Information Retrieval and Web Search

Introduction to Digital Libraries hussein suleman uct cs honours 2003.

Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003.

[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.

Algorithmic Detection of Semantic Similarity WWW 2005.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

1/16/20161 Introduction to Graphs Advanced Programming Concepts/Data Structures Ananda Gunawardena.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.

Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.

DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.

Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.

1 Holistic Twig Joins: Optimal XML Pattern Matching Nicolas Bruno, Nick Koudas, Divesh Srivastava ACM SIGMOD 2002 Presented by Jun-Ki Min.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

General Architecture of Retrieval Systems 1Adrienn Skrop.

Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.

XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.

Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.

Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.

1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.

1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.

Information Retrieval in Practice

Neighborhood - based Tag Prediction

Database Management System

XRANK: Ranked Keyword Search over XML Documents

Information Retrieval and Web Search

Information Retrieval and Web Design

Introduction to XML IR XML Group.

Presentation transcript:

XRANK: Ranked Keyword Search over XML Documents Presentation by: Meghana Kshirsagar Nitin Gupta Indian Institute of Technology, Bombay Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram

Outline ● Motivation ● Problem Definition, Query Semantics ● Ranking Function ● A New Datastructure – Dewey Inverted List (DIL) ● Algorithms ● Performance Evaluation

Motivation

Motivation - I ● Why do we need search over XML data? ● Why not use search techniques used on WWW (keyword search on HTML)?

Motivation - II Keyword Search: XML Vs HTML HTML structural ● Links: document-to-document ● Tags: Format specifiers ranking ● Result: Document ● Page-level ranking ● Proximity: ● width: distance between words XML structural ● Links: IDREFs and Xlinks ● Tags: Content specifiers ranking ● Result: XML element (a tree) ● Element-level ranking ● Proximity: ● width ● height

Problem Definition, Query Semantics, and Ranking

Problem Definition ● Input:Set of keywords ● Output:Ranked XML elements What is a result? How to rank results ?

Bird's eye view of the system Query Evaluator Data Structures (DIL) XML doc repository Preprocessing (ElemRank computation) Query Keywords Results

What is a result? ● A minimal Steiner tree of XML elements ● Result-set is a set of XML elements that ● includes a subset of elements containing all query-keywords at least once, after excluding the occurrences of keywords in contained results (if any).

result 1 result 2

Result: Graphical representation containment edge descendant ancestor

Ranking: Which results to return first? Properties: The Ranking function should ● reflect Result Specificity ● consider Keyword-Proximity ● be Hyperlink Aware Ranking function: f (height, width, link-structure)

Less specific result More specific result

Ranking Function r (v 1, k i ) = ElemRank ( v t ). decay t-1 v1v1 vtvt kiki For a single XML element (node):

Ranking Function Combining ranks in case of multiple occurrences: Overall Rank:

Semantics of the ranking function r (v 1, k i ) = ElemRank ( v t ). decay t-1 Specificity (height) Proximity Link structure

ElemRank Computation – adopt PageRank?? ● PageRank ● Short-comings: Fails to capture: ✗ bidirectional transfer of “ElemRanks” ✗ discrimination between edge-types (containment and hyperlink) ✗ doesn't aggregate “ElemRanks” for reverse containment relationships

ElemRank Computation - I ● N e = total # of XML elements ● N h (u) = # hyperlinks from 'u' ● N c (u) = # children of 'u' ● E = HE U CE U CE' ● CE' = reverse containment edges ● Consider Both forward and reverse ElemRank propagation.

ElemRank Computation - II ● Seperate containment and hyperlink edges ● CE = containment edges ● HE = hyperlink edges ● ElemRank (sub elements) α 1 / ( # sibling sub-elements )

ElemRank Computation - III ● Sum over the reverse-containment edges, instead of distributing the weight ● N d (u) = total # XML documents ● N de (v) = # elements in the XML doc containing v ● ElemRank (parent) α Sum (ElemRank(sub-elements))

Datastructures and Algorithms

Naïve Algorithm Approach: ● XML element ~ doc ● Use “keyword search on WWW” Limitations: ● Space overhead (in inverted indices) ● Failure to model Hierarchical relationships (ancestor~decendent). ● Inaccurate Ranking Need a new datastructure which can model hierarchical relationships !! Answer: Dewey Inverted Lists

Labeling nodes using Dewey Ids

Dewey Inverted Lists ● One entry per keyword ● Entry for keyword 'k' has Dewey-IDs of elements directly containing 'k' Simple equi merge-join of Dewey-ID-lists won't work ! Need to compute prefixes.

System Architecture

DIL : Query Processing ● Simple equality merge-join will not work ● Need to find LCP (longest common prefix) over all elements with query keyword-match. ● Single pass over the inverted lists suffices! ● Compute LCP while merging the ILs of individual keywords. ● ILs are sorted on Dewey-IDs

Datastructures ● Array of all inverted lists : invertedList[] ● invertedList[i] for keyword 'i' ● each invertedList[i] is sorted on Dewey-ID ● Heap to maintain top-m results : resultHeap ● Stack to store current Dewey-ID, ranks, position List, longest common prefixes : deweyStack

Algorithm on DILs - Abstract While all inverted-lists are not processed ● Read the next entry from DIL having smallest Dewey-ID ● call this 'currentEntry' ● Find the longest common prefix (lcp) between stack components and entry read from DIL ● lcp (deweyStack, currentEntry) ● Pop non-matching entries from Dewey-stack; Add result to heap if appropriate ● check if current top-of-stack contains all keywords ● if yes, compute OverallRank, put this result onto heap ● else ● non-matching entries are popped one component at a time and update (rank, posList) on each pop ● Push non-matching part of 'currentEntry' to 'deweyStack' ● non-matching components of 'currentEntry.deweyID' are pushed onto stack ● Update components of top entry of deweyStack

Example Query: “XQL Ricardo”

Algorithm Trace – Step 1 Rank[i] = Rank due to keyword 'i' PosList[i] = List of occurrences of keyword 'i' DIL: invertedList[] DeweyStack push all components and find rank, posL Smallest ID:

Algorithm Trace – Step 2 DIL: invertedList[] DeweyStack find lcp and pop nonmatching components Smallest ID:

Algorithm Trace – Step 3 DIL: invertedList[] DeweyStack updated rank, posL Smallest ID:

Algorithm Trace – Step 4 DIL: invertedList[] DeweyStack push non-matching components Smallest ID:

Algorithm Trace – Step 5 DIL: invertedList[] DeweyStack find lcp, update, finally pop all components Smallest ID:

Problems with DIL ● Scans the entire inverted-list for all keywords before a result is output ● Very inefficient for top-k computation

Other Techniques - RDIL ● Ranked Dewey Inverted List: ● For efficient top-k result computation ● IL is ordered by ElemRank ● Each IL has a B+ tree index on the Dewey- IDs ● Algorithm with RDIL uses a threshold

Algorithm using RDIL (Abstract) ● Choose the next entry from one of the invertedList[] in a Round- Robin fashion. ● say chosen IL = invertedList[i] ● d = top-ranked Dewey-ID from invertedList[i] ● Find the longest common prefix that contains all query-keywords ● Probe the B+ tree index of all other keyword ILs, for the longest common prefix ● Claim: ● d2 = smallest Dewey-ID in invertedList[j] of query-keyword 'j' ● d3 = immediate predecessor of d2 ● lcp = max_prefix (lcp ( d, d2), lcp ( d, d3)) ● Check if 'lcp' is a complete result ● Recompute 'threshold' = sum (ElemRank of last processed element in each query keyword IL) ● If (rank of top-k results on heap) >= threshold) return;

Performance of RDIL ● Works well for queries with highly correlated keywords ● BUT ! becomes equivalent (actually worse) to DIL for totally uncorrelated keywords ● Need an intermediate technique

HDIL ● Uses both DIL and RDIL ● Adaptive strategy: – Start with RDIL – Switch to DIL if performance is bad ● Performance? – Estimated remaining time for RDIL = (m – r ) * t / r ● t = time spent so far ● r = no. of results above threshold so far ● m = desired no. of results – Estimated remaining time for DIL ? ● No. of query-keywords is known ● Size of each IL is known

HDIL ● Datastructures? – Store full IL sorted on Dewey-ID – Store small fraction of IL sorted on ElemRank – Share the leaf level between IL and B+ tree (in RDIL) – Overhead : top levels of B+ tree

Updating the lists ● Updation is easy ● Insertion – very bad! – techniques from Tatarinov et al. – we've seen a better technique in this course :) – OrdPath

Evaluation ● Criteria: ● no. of query-keywords ● correlation between query-keywords ● desired no. of query results ● selectivity of keywords ● Setup: ● Datasets used: DBLP, Xmark ● d1 = 0.35, d2 = 0.25, d3 = 0.25 ● 2.8GHz Pentium IV + 1GB RAM + 80GB HDD

Performance - 1

Performance - 2

Critique ● New datastructure (DIL) defined to represent hierarchical relationships accurately and efficiently. ● Hyperlinks and IDREFs are considered only while computing ElemRank. Not used while returning results. ● Only containment edges (ancestor-descendant) are considered while computing result trees. ● Works only on trees, can't handle graphs.

The SphereSearch Engine for Unified Banked Retrieval of Heterogenous XML and Web Documents Jens Graupmann Ralf Schenkel Gerhard Weikum Max-Plack-Institut fur Informatik Presentation by: Nitin Gupta Meghana Kshirsagar Indian Institute of Technology Bombay

Why another search engine ? ● To cope with diversity in the structures and annotations of the data ● Ranked retrieval paradigm for producing relevance ordered results lists rather than a mere boolean retrieval. ● Short comings of the current search engines – Concept aware – Context aware (or link-awareness) – Abstraction aware – Query Language

Concept awareness ● Example: researcher max planck yields many results about researchers who work at the institute “Max Plack” Society ● Better formulation would be researcher person=“max planck” ● Objective attained by – Transformation to XML – Data Annotation

Concept awareness :: Transformation Experiments... Text1... Settings... Text Text Text

Abstraction Awareness ● Example: Synonyms, Ontologies ● Is connection to various encyclopedias/ Wiki's possible? ● Objective attained by using – Ontology Service: provides quantified ontological information to the system – Preprocessed information based on focused web crawls to estimate statistical correlations between the characteristic words of related concepts

Context Awareness ● Query may not be answered by web search engines as no single web page may be a match ● Unlike usual navigation axes in XML, context should go beyond trees. ● Consider graph structure spanned by Xlink/XPointer references and href hyperlinks ● Objective attained by – introduction of the concept of a SPHERE

Context Awareness :: Sphere ● What is a sphere? – Relevance of an element for a group of query conditions is not just determined by its own content, but also by the content of other neighboring elements, including linked documents, in an environment - called Sphere - of the element.

Query Language ● Query S = (Q, J) consists of – set Q = { G 1.. G q } of query groups – set J = { J 1.. J m } of join conditions ● Each Q i consists of – set of keyword conditions t 1.. t k – set of concept value conditions c 1 = v 1... c l = v l ● Each join has the form Q i.v = (or ~) Q j.w

Query Language ● Example: – P(professor, location=~Germany) – C(course, ~databases) – R(~project, ~XML) – A(gothic, church) – B(romanic, church) – A.location = B.location German professors who teach database courses and have projects on XML Gothic and Romanic churches at the same location

Data Model ● Collection X = (D, L) of XML documents D together with a set L of (href, Xpointer, or Xlink) links between their elements ● Consider all attributes as elements: then element level graph G E (X) = (V E (X), E E (X)) has the union of all the elements of the document as nodes and undirected edges between them ● Each edge has nonnegative weight – 1 for parent-child; ‘λ’ for links ● A distance function δ X (x,y) : computes weight of a shortest path in G E (X) between x and y

Spheres and Query Groups ● Node-score ns(n,t) is computed using Okapi BM25 model ● Similarity condition ~K: Compute exp(K) for the keyword. The node score is defined as max xЄexp(K) sim(K,x) * ns(n,x) where sim(K,x) is the ontological similarity ● Concept value: – ns(n, c=v) = 0if name(n) ≠ c – ns(n,v) otherwise ● Similarity concept value: ~c = v: sim(name(n), c) * ns(n,v) ● This is insufficient – in the presence of linked documents – when content is spread over several elements {

Spheres and Query Groups Sphere S d (n): set of nodes at distance d from node n s d (n,t) = ∑ v Є Sd(n) ns(v,t) s(n,t) = ∑ s i (n,t) * α i s(1,t) = 1 + 4* * *0.125 = s(2,t) = 3 + 0* * *0.125 = s(n, G) = ∑ j s(n,t j ) + ∑ j s(n, c j =v j )

Spheres and Query Groups :: Ranking Create a connection graph G(N) = (V(N), E(N)) Weight of an edge between x,y: 0if x and y are not connected 1/ δ x (x,y)+1otherwise Compactness C(N) of a potential answer N is then the sum of the total edge weights of a maximal spanning tree for G(N), and the score is given by: s(N, S) = β C(N) + (1- β ) ∑ i s(n i, G i )

Spheres and Query Groups :: Joins New virtual links to form an extended collection X' = (D, L') – Connect the elements that match the join – Similarity join: For Qi.v ~ Qj.w, consider sets N(v) (resp N(w)) with name v (w) or contain v (w) in their content. For each pair x N(v), y N(w) add a link {x,y} with weight 1/csim(x,y)

System Architecture Content stored in inverted lists with corresponding tf*idf- style term statistics Indexer stores with each element the corresponding Dewer encoding of its position within the document Focused web crawls used to estimate statistical correlations between the characteristic words of related concepts. Current version uses Dice coefficient.

Query Processor 1.First compute a result list for each query group 2.Add virtual links for join conditions 3.Compute the compactness of a subset of all potential answers of the query in order to return the top-k results 1.Compute a list of results for each of query keywords and concept-value conditions. 2.Candidate nodes: Nodes that are at distance at most D from any node that occurs in at least one of the lists. Sphere score is computed only for these nodes since only these can have a non-zero score! 3.For eachl candidate node N, look up the node scores of nodes in the sphere of N, and adding these scores with a proper damping factor.

Query Processor ● Virtual links: Processor considers only a limited set of possible end points for efficient computation ● Nodes in the spheres upto distance D around nodes with nonzero sphere score for any query group – Why? Any other node will have distance atleast D+1 to any results node and thus contributes at most 1/ (D+1)+1 to the compactness, which is negligible – This set of candidate nodes can be computed on the fly ● Set further reduced by testing join attributes, for example A.x = B.y results in two sets of potential end points.

Query Processor ● Generating answers – Naïve method: generate all possible potential answers from the answers to query groups, compute connection graphs and compactness, and finally their score – For top-k answers, use Fagin's Threshold Algorithm with sorted lists only ● Input: Sorted list of node scores and pairwise node scores (edges) ● Output: k potential answers with the best scores

Experiments ● Sun V40z, 16GB RAM, Windows 2003 Server, Tomcat environment, Oracle 10g database ● Benchmarks: XMach, Xmark, INEX, TREC Designed for XQuery-style exact match Semantically poor tags Does not consider XML at all Wikipedia Collection from the Wikipedia project: HTML Collection transformed into XML and annotated Wikipedia++ Collection: Extension of Wikipedia with IMDB data, with generated XML files for each movie and actor DBLP++ Collection: Based on the DBLP project which indexes more than 480,000 publications INEX: Set of 12,107 XML documents, a set of queries with and without structural constraints

Experiments Conversion from HTML to XML Dataset Statistics

Experiments ● SSE-basic: basic version limited to keyword conditions using sphere- based scoring ● SSE-CV: basic version plus concept-value conditions ● SSE-QC: CV version plus query groups (full contest awareness) ● SSE-Join: full version will all features ● SSE-KW: very restricted version with simple keyword search ● GoogleWiki: Google search restricted to Wikipedia.org ● Google~Wiki: Google on wikipedia.org with Google's ~ operator for query expansion ● GoogleWeb: Google search on the entire web ● Google~Web: Google search on the entire web with query expansion

Experiments Aggregated results for Wikipedia

Experiments Aggregated results for Wikipedia++ and DBLP++

Experiments Graph showing the average runtimes for different versions

Thank you