Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.

Slides:



Advertisements
Similar presentations
BAH DAML Tools XML To DAML Query Relevance Assessor DAML XSLT Adapter.
Advertisements

Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Covering Indexes for XML Queries by Prakash Ramanan
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Chapter 5: Introduction to Information Retrieval
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Indexing Dataspaces Presenter : Sravanth Palepu CSE 718 Xin DongAlon Halevy University of WashingtonGoogle Inc.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Xyleme A Dynamic Warehouse for XML Data of the Web.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
IR Models: Structural Models
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
1 Chapter 19: Information Retrieval. ©Silberschatz, Korth and Sudarshan19.2Database System Concepts - 5 th Edition, Sep 2, 2005 Chapter 19: Information.
COMP630 Paper Presentation by Haomian(Eric) Wang.
Chapter 19: Information Retrieval
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Information Retrieval
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
Overview of Search Engines
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,
Querying Structured Text in an XML Database By Xuemei Luo.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003.
Facilitating Document Annotation using Content and Querying Value.
Algorithmic Detection of Semantic Similarity WWW 2005.
The Semistructured-Data Model Programming Languages for XML Spring 2011 Instructor: Hassan Khosravi.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Vector Space Models.
Session 1 Module 1: Introduction to Data Integrity
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Holistic Twig Joins Optimal XML Pattern Matching Nicolas Bruno Columbia University Nick Koudas Divesh Srivastava AT&T Labs-Research SIGMOD 2002.
Facilitating Document Annotation Using Content and Querying Value.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Database System Concepts, 5th Ed. ©Sang Ho Lee Chapter 19: Information Retrieval.
Database Management System
XRANK: Ranked Keyword Search over XML Documents
Information Retrieval
Information Retrieval and Web Search
Information Retrieval
Chapter 31: Information Retrieval
Recuperação de Informação B
Information Retrieval and Web Design
Chapter 19: Information Retrieval
Introduction to XML IR XML Group.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ

Gökay Burak AKKUŞ Ece AKSU This Paper... Describes the architecture, implementation and evaluation of the XRANK system The contributions of the paper are: (a) the problem definition and system architecture (b) an algorithm for computing the ranking of XML elements (c) new inverted list index structures and associated query processing algorithms (d) an experimental evaluation of XRANK

Gökay Burak AKKUŞ Ece AKSU Overview Problem: Efficiently producing ranked results for keyword search queries over hierarchical XML documents. New challanges 1. Returns deeply nested XML elements. 2. Ranking is at the granularity of an XML element (not the document) 3. Keyword proximity is more complex.

Gökay Burak AKKUŞ Ece AKSU Overview - 2 This paper pesents XRANK system to handle these features of XML keyword search. XRANK offers both space & performance benefits XRANK generalizes a hyperlink based HTML search engine such as Google. XRANK can be used to query both HTML and XML documents.

Gökay Burak AKKUŞ Ece AKSU Keyword Search Querying - 1 Keyword search querying Adv: simple users do not have to learn a complex query language can issue queries without any prior knowledge about the structure of the underlying data. Consequence: Interface is fexible Queries may not always be precise and can return large number of query results.

Gökay Burak AKKUŞ Ece AKSU Keyword Search Querying - 2 An important requirement for keyword search is to rank the query results so that the most relevant results appear first. C ertain limitations of the HTML data model make such systems ineffective in many domains. HTML is a presentation language HTML cannot capture much semantics

Gökay Burak AKKUŞ Ece AKSU Keyword Search Querying - 3 The XML data model addresses this limitation by allowing for extensible element tags. (Example: Figure.1)

Gökay Burak AKKUŞ Ece AKSU

Querying XML Documents One approach is the sophisticated query language XQUERY Effective in some cases Users have to learn a complex query language and understand the schema of underlying XML An alternative approach is XRANK Retain the simple keyword search query interface Exploit XML’s tagged and nested structure during query processing.

Gökay Burak AKKUŞ Ece AKSU New Challanges Keyword searching over XML introduces many new challenges. 1. The result of the keyword search query can be a deeply nested XML element. return the ‘deepest’ node 2. Ranking is not solely based on hyperlinks. semantics of containment links (relating parent and child elements) is very different from that of hyperlinks (such as IDREFs and XLinks)

Gökay Burak AKKUŞ Ece AKSU New Challanges 3. The notion of proximity among keywords is more complex In HTML, proximity among keywords translates directly to the distance between keywords in a document. For XML there is a 2-dimensional proximity metric. Keyword distance Ancestor distance

Gökay Burak AKKUŞ Ece AKSU XML Data Model XML is a hierarchical format for data representation and exchange. An XML document consists of: Root element, nested sub-elements, attributes and values, supports intra-document and inter-document references.

Gökay Burak AKKUŞ Ece AKSU XML Data Model-2 Intra-document referencees are represented using IDREFs. Inter-document references are represented using XLink. Both IDREFs and XLinks are reffered as hyperlinks!

Gökay Burak AKKUŞ Ece AKSU Definitions A collection of hyperlinked XML documents can be defined as a directed graph: G = (N, CE, HE) N : The set of nodes N = NE U NV NE : The set of elements NV : The set of values CE : The set of containment edges relating nodes HE : The set of hyperlink edges relating nodes

Gökay Burak AKKUŞ Ece AKSU Definitions - 2 The edge (u, v)  CE iff v is a value/nested sub- element of u. The edge (u, v)  HE iff u contains a hyperlink reference to v. An element u is a sub-element of an element v if (v,u)  CE. An element u is the parent of node v if (u,v)  CE. The predicate contains*(v, k) is true if the node v directly or indirectly contains the keyword k.

Gökay Burak AKKUŞ Ece AKSU Keyword Query Results There are two possible semantics for keyword search queries: conjunctive keyword query semantics contain all of the query keywords are returned. disjunctive keyword query semantics contain at least one of the query keywords are returned This paper focuses on conjunctive keyword query semantics.

Gökay Burak AKKUŞ Ece AKSU Keyword Query Results - 2 Q={k 1,…, k n }. R 0 = {v  v  NE   k  Q(contains*(v,k))} the set of elements that directly or indirectly contain all of the query keywords. Result(Q)={v   k  Q  c  N ((v,c)  CE  c  R 0  contains*(c,k))} ensures that only the most specific results are returned. ensures that an element that has multiple independent occurrences of the query keywords is returned, CE are considered for result set, HE are considered for ranking

Gökay Burak AKKUŞ Ece AKSU Keyword Query Results - 3 XML elements provides more context information Also poses interesting user-interface challenges. One solution is to allow the user to navigate up to the ancestors of the query result Another solution, is to predefine a set of “answer nodes” AN. XRANK supports both may require knowledge of the domain and underlying XML schema

Gökay Burak AKKUŞ Ece AKSU Ranking Keyword Query Results Desired Properties of Ranking Function: 1) Result specificity: more specific results higher than less specific results. one dimension of result proximity. 2) Keyword proximity: another dimension of result proximity. 3) Hyperlink Awareness: hyperlinked structure of XML documents.

Gökay Burak AKKUŞ Ece AKSU Ranking Function: Definition ElemRank is defined at the granularity of an element and takes the nested structure of XML into account. Similar to Google’s PageRank Q = (k 1, k 2, …, k n ) R = Result(Q) A result element v 1  R First define the ranking of v 1 with respect to one query keyword k i, r(v 1,k i ) before defining the overall rank, rank(v 1, Q).

Gökay Burak AKKUŞ Ece AKSU Ranking with respect to one keyword There exists a sub-element/value node v 2 of v 1 such that v 2  R 0 and contains*(v 2, k i ). There is a sequence of containment edges in CE of the form (v 1, v 2 ), (v 2, v 3 ), …, (v t, v t+1 ) such that v t+1 is a value node that directly contains the keyword k i.

Gökay Burak AKKUŞ Ece AKSU Ranking with respect to one keyword r(v 1, k i ) does not depend on the ElemRank of the result node v 1, except when v 1 = v t for 2 reasons: 1. less specific results indeed get lower ranks. 2. in fact related to ElemRank(v 1 ) due to certain properties of containment edges. For multiple occurences of k i in v 1 combined rank is: f = max

Gökay Burak AKKUŞ Ece AKSU Overall Ranking The overall ranking is the sum of the ranks with respect to each query keyword, multiplied by a measure of keyword proximity p(v 1, k 1, k 2, …, k n ).

Gökay Burak AKKUŞ Ece AKSU XRANK System Architecture

Gökay Burak AKKUŞ Ece AKSU XRANK System Architecture-2 ElemRank Computation Module Computes the ElemRanks of XML elements Combined with ancestor info HDIL Generates an index structure called HDIL The Query Evaluator Module Evaluates queries using HDIL Returns ranked results.

Gökay Burak AKKUŞ Ece AKSU ElemRank Computational Module ElemRank is a measure of the objective importance of an XML element and is based on the hyperlinked structure of XML docs. PageRank function is sum of 2 probabilities Visiting v at random (d=0.85) Visiting v by navigating

Gökay Burak AKKUŞ Ece AKSU ElemRank Computational Module PageRank is unidirectional Forward ElemRank propagation Paper  section Reverse ElemRank propagation Paper -- > workshop

Gökay Burak AKKUŞ Ece AKSU Refinements of PageRank Bi-directional transfer of ElemRanks Discrimination between containment and hyperlink edges Aggregate ElemRanks for reverse containment relationships

Gökay Burak AKKUŞ Ece AKSU Bi-directional Transfer of ElemRanks A simple solution is to add reverse containment edges, does not distinguish between containment and hyperlink edges

Gökay Burak AKKUŞ Ece AKSU Discrimination between containment and hyperlink edges It weights forward and reverse containment relationships similarly.

Gökay Burak AKKUŞ Ece AKSU Aggregate ElemRanks for reverse containment relationships

XRANK System Efficiently Evaluating XML Keyword Search Queries

Gökay Burak AKKUŞ Ece AKSU Efficiently Evaluating XML Keyword Search Queries Naïve Approach Dewey Inverted List (DIL) Ranked Dewey Inverted List (RDIL) Hybrid Dewey Inverted List (HDIL)

Gökay Burak AKKUŞ Ece AKSU Naïve Approach Main Difference between XML and HTML keyword search: The granularity of query results XML keyword search returns elements HTML keyword search returns documents One way to do XML keyword search Treat each element as a document

Gökay Burak AKKUŞ Ece AKSU Problems of Naïve Approach Space Overhead Spurious Query Results Inaccurate ranking of results

Gökay Burak AKKUŞ Ece AKSU Space Overhead An inverted list contains for each keyword, the list of documents that contain the keyword For XML documents, the list of elements A large space overhead; because each inverted list contains XML element that directly contains the keyword(1) All of (1)s ancestors redundantly

Gökay Burak AKKUŞ Ece AKSU Spurious Query Results The naïve approach ignores ancestor- descendant relationships. All elements treated as independent documents Results will not correspond to the desired semantics for XML keyword search

Gökay Burak AKKUŞ Ece AKSU Inaccurate Ranking of Results Existing approaches do not take result specificity into account when ranking results.

Gökay Burak AKKUŞ Ece AKSU Dewey Inverted List (DIL) Naïve approach has drawbacks: Decouples representation of ancestors and descendants. Dewey encoding of Element IDs jointly captures ancestor and descendant information.

Gökay Burak AKKUŞ Ece AKSU

DIL An interesting feature: ID of an ancestor is a prefix of the ID of a descendant. Ancestor-descendant relationships are implicitly captured in the Dewey ID.

Gökay Burak AKKUŞ Ece AKSU DIL Data Structure The inverted list for a keyword k contains the Dewey IDs of all the XML elements that directly contain the keyword k. For multiple documents : First component of each Dewey ID is the document ID

Gökay Burak AKKUŞ Ece AKSU DIL Data Structure -2 An entry in DIL: ElemRank of corresponding XML element The list of all positions where the keyword k appears in that element. Entries are sorted by Dewey IDs The size of DIL is smaller than that of Naïve Approach.

Gökay Burak AKKUŞ Ece AKSU

DIL Query Processing An algorithm that works in a single pass over the query keyword inverted lists. The key idea: Merge the query keyword inverted lists Simultaneously compute the longest common prefix of the Dewey IDs in different lists.

Gökay Burak AKKUŞ Ece AKSU

Ranked Dewey Inverted List (RDIL) “If inverted lists are long (due to common keywords or large document collections) even the cost of a single scan of the inverted list can be expensive, especially if the users want only the top few results.”

Gökay Burak AKKUŞ Ece AKSU RDIL -2 One solution: Order the inverted lists by the ElemRank instead of by the Dewey ID. Higher ranked results will appear first in the inverted list. Threshold Algorithm.

Gökay Burak AKKUŞ Ece AKSU RDIL Data Structure RDIL is similar to DIL except that: Inverted lists are ordered by ElemRank, Each inverted list has a B+-tree index of the Dewey ID field.

Gökay Burak AKKUŞ Ece AKSU

RDIL Query Processing Consider an entry retrieved from the inverted list of keyword k i. The entry contains the Dewey ID d of a top-ranked element that directly contains the query keyword k i. To determine a query result the longest prefix of d that also contains the other query keywords needs to be determined.

Gökay Burak AKKUŞ Ece AKSU

Hybrid Dewey Inverted List (HDIL) In many cases RDIL is likely to perform well. It may perform worse than DIL when there is a query where keywords are not correlated.

Gökay Burak AKKUŞ Ece AKSU HDIL -2 The individual query keywords occur relatively frequently in the document collection but rarely occur together in the same document. Since the number of results is small: RDIL has to scan most (or all) of the inverted lists to produce the output. Can we combine the benefits of DIL and RDIL without replicating the entire inverted list index?

Gökay Burak AKKUŞ Ece AKSU

HDIL Query Processing An adaptive strategy: Periodically monitor performance. Calculate; Time spent – t The number of results above the threshold – r Estimated time remaining for RDIL = (m-r)*t/r m: desired number of query results If estimated time is more than the expected time for DIL, then switch to DIL.

Gökay Burak AKKUŞ Ece AKSU Experimental Evaluation Experimental Setup Quality and Ranking Function Space requirements Query Performance (1) the number of query keywords; (2) the correlation between the keywords; (3) the desired number of query results; (4) the selectivity of the keywords.