Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Information Retrieval in Practice
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
COMP630 Paper Presentation by Haomian(Eric) Wang.
Chapter 19: Information Retrieval
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Information Retrieval
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
SMS-Based web Search for Low- end Mobile Devices Jay Chen New York University Lakshmi Subramanian New York University
Information Retrieval in Practice
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
1 Chapter 19: Information Retrieval Chapter 19: Information Retrieval Relevance Ranking Using Terms Relevance Using Hyperlinks Synonyms., Homonyms,
Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Querying Structured Text in an XML Database By Xuemei Luo.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Gökay Burak AKKUŞ Ece AKSU XRANK XRANK: Ranked Keyword Search over XML Documents Ece AKSU Gökay Burak AKKUŞ.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
[ Part III of The XML seminar ] Presenter: Xiaogeng Zhao A Introduction of XQL.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
1 Information Retrieval LECTURE 1 : Introduction.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Dec. 13, 2002 WISE2002 Processing XML View Queries Including User-defined Foreign Functions on Relational Databases Yoshiharu Ishikawa Jun Kawada Hiroyuki.
Search engine note. Search Signals “Heuristics” which allow for the sorting of search results – Word based: frequency, position, … – HTML based: emphasis,
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
APEX: An Adaptive Path Index for XML data Chin-Wan Chung, Jun-Ki Min, Kyuseok Shim SIGMOD 2002 Presentation: M.S.3 HyunSuk Jung Data Warehousing Lab. In.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Keyword search on encrypted data. Keyword search problem  Linux utility: grep  Information retrieval Basic operation Advanced operations – relevance.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Text Search over XML Documents Jayavel Shanmugasundaram Cornell University.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
An Efficient Algorithm for Incremental Update of Concept space
Search Engine Architecture
XRANK: Ranked Keyword Search over XML Documents
Information Retrieval and Web Search
Information Retrieval and Web Search
Information Retrieval and Web Search
Information Retrieval and Web Search
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Information Retrieval
Data Mining Chapter 6 Search Engines
Information Retrieval and Web Design
Information Retrieval and Web Search
Introduction to XML IR XML Group.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003 Bernadette Blum, Christian Nicolaus, Markus Uhl

Ranked Information Retrieval on XML Data2/48 Outline 1. Introduction in Information Retrieval 2. Information Retrieval on XML Data 3. Approaches 1.ELIXIR -The ELIXIR language -The ELIXIR query processing algorithm -Experiments, Conclusion 2.XRANK -Data model -Ranking function -Data structures and algorithms -Experiments 4. Conclusion

Ranked Information Retrieval on XML Data3/48 1. Introduction in Information Retrieval Definition: –Information Retrieval (IR) is the technology for searching in collections (corpora, intranets, Web) of weakly structured documents: text, HTML, XML,... –search engines, digital libraries, similarity search on scientific data Vector space model (text analysis): –based on word occurrence frequency –documents and queries are vectors –result ranking based on similarity metric in vector space

Ranked Information Retrieval on XML Data4/48 1. Introduction in Information Retrieval (II) Link analysis (structure analysis): –weighting documents –improve result ranking Page rank approach (I): –web as directed graph G –“random walk” of a web surfer follow hyperlinks with probability (1-  ) “random jump” with probability 

Ranked Information Retrieval on XML Data5/48 Page rank approach (II): 1. Introduction in Information Retrieval (III) “random jump”hyperlinks Hyperlink Probability of “random jump”  Probability of following hyperlink (1- ) + “random jump” Document p(q)= q (1-  )/3  /5

Ranked Information Retrieval on XML Data6/48 2. Information Retrieval on XML Data XML: standard for exchange of structured data and documents existing query languages (e.g. XML-QL, Quilt, XQL, …  XQuery) –no ranked or weighted results based on textual similarity –but extensions (XXL, XIRQL …) 2 Approaches ELIXIR SQL-like approach XRANK Keyword based approach

Ranked Information Retrieval on XML Data7/ ELIXIR ELIXIR = “expressive and efficient language for XML information retrieval” extension to XML-QL: similarity operator “~” “~” computed by WHIRL returns best r answers

Ranked Information Retrieval on XML Data8/48 ELIXIR – The ELIXIR language Syntax: –XML-QL Syntax (SQL-like) CONSTRUCT $b WHERE $b in “db.xml”, $c in “db.xml”, $yb > 1990, $b ~ $c. output format pattern statements + predicates boolean operators ELIXIR’s similarity operator similarity calculation even between 2 variables (  expressiveness) no nested queries

Ranked Information Retrieval on XML Data9/48 ELIXIR – The ELIXIR language (II) WHIRL (I): Word-based Heterogeneous Information Retrieval Logic extends DATALOG with “~” only relational data efficiently supports ranked IR Syntax (Horn clause): output($y, $a, $t) :- book($y, $a, $t), $y>1950, $t~$a. output relationinput relation conjunction of relational predicates boolean operator similarity operator

Ranked Information Retrieval on XML Data10/48 WHIRL (II): Similarity computation “~”: –standard IR term vector techniques –weighting terms (TF-IDF values) –cosine measure: (V Vocabulary of distinct terms; Terms t  V; Documents d, d’  R |V| ) ELIXIR – The ELIXIR language (III)

Ranked Information Retrieval on XML Data11/48 ELIXIR – The ELIXIR query processing algorithm Example (naïve approach): { CONSTRUCT $b $c WHERE $b in “db.xml”, $c in “db.xml” } XML-QL query Q 2 Similarity computation for every tupel ($b, $c) full cross product !

Ranked Information Retrieval on XML Data12/48 ELIXIR – The ELIXIR query processing algorithm (II) Problem: full cross product !

Ranked Information Retrieval on XML Data13/48 Solution: not simply map the full XML data into relational model invoke WHIRL as a “subroutine” (  efficiency) Avoid generating full cross product! ELIXIR – The ELIXIR query processing algorithm (III)

Ranked Information Retrieval on XML Data14/48 2 pattern statements with variables that are compared with a similarity predicate => distinct Q 2 j queries ELIXIR – The ELIXIR query processing algorithm (IV) Start query Q 1 3 Stages: intermediate queries Q 2, Q 3, Q 4 1. Partition into a set, Q 2 1 … Q 2 N, of XML-QL queries -avoid generating full cross product -ordinary predicates 2. WHIRL query Q 3 -similarity predicates -ordered table of the r best answers 3. XML-QL query Q 4 –transformation of Q 3 ’s output –specified XML structure by Q 1

Ranked Information Retrieval on XML Data15/48 Example (Step I – Partition in Q 2 n queries): { CONSTRUCT $b WHERE $b in "db.xml" } { CONSTRUCT $c WHERE $c in "db.xml" } XML-QL query Q 2 1 XML-QL query Q 2 2 Ukrainian folk music Being there Milk cow blues Traditional Ukrainian cookery Being and nothingness Shooting Elvis Avoid generating full cross product! ELIXIR – The ELIXIR query processing algorithm (V)

Ranked Information Retrieval on XML Data16/48 Example (Step II – WHIRL query Q 3 ): q3($b) :- q21($b), q22($c), $b ~ $c. WHIRL query Q 3 Traditional Ukrainian cookery Being and nothingness Ukrainian folk music Being there Milk cow blues Traditional Ukrainian cookery Being and nothingness Shooting Elvis ELIXIR – The ELIXIR query processing algorithm (VI)

Ranked Information Retrieval on XML Data17/48 Example (Step III – XML-QL query Q 4 ): { CONSTRUCT $b WHERE $b in "q3.xml“ } XML-QL query Q 4 Traditional Ukrainian cookery Being and nothingness Final XML OUTPUT Traditional Ukrainian cookery Being and nothingness ELIXIR – The ELIXIR query processing algorithm (VII)

Ranked Information Retrieval on XML Data18/48 ELIXIR – Experiments, Conclusion Experiments: Total processing time … –… depends on details of each query and input data –… increases marginal with number of answers r –… increases linearly with number of similarity join predicates –Partition (Step 1) of initially query dominate (expensive parsing and traversing)

Ranked Information Retrieval on XML Data19/48 ELIXIR – Experiments, Conclusion (II) Conclusion: ELEXIR extends XML-QL by supporting IR-similarity-features for ranking similarity joins even between 2 variables (expressiveness) Algorithm: –rewrite original ELIXIR query in a series of intermediate XML-QL and WHIRL queries. –no full cross product, only filtered tuples of variable bindings (efficiency) But … –only non-nested queries –strict three-stage approach may be suboptimal in some cases (partition)

Ranked Information Retrieval on XML Data20/48 XRANK: Ranked Keyword Search over XML Documents

Ranked Information Retrieval on XML Data21/48 Introduction XRANK - Keyword Search over XML documents results:  XML elements that contain all searched keywords ranking:  at granularity of XML elements  based on hyperlink structure advantages:  user does not have to learn a query language  no knowledge about the structure of XML documents is needed generalized keyword search engine (both HTML and XML are possible)

Ranked Information Retrieval on XML Data22/48 G = (V, CE, HE) : collection of XML documents V : set of XML elements (tags and attributes) CE : set of containment edges HE : set of hyperlinked edges (u,v) in CE  v is a sub-element of u (u,v) in HE  u contains a hyperlink to v contains(v,k)  v (in)directly contains the keyword k Data Model

Ranked Information Retrieval on XML Data23/48 Example: XML Graph... XML elementvalue

Ranked Information Retrieval on XML Data24/48 How to define results of keyword search queries over XML documents? elements with at least one sub-element containining all keywords & at least one sub-element containing some keywords elements that contain all keywords – no sub-element contains all keywords! ⋃ Keyword Query Results (1)

Ranked Information Retrieval on XML Data25/48 Ranking Elements How to rank XML elements?  extension of PageRank at the granularity of elements  objective importance of XML elements  based on hyperlinked and nested structure of XML elements ElemRank

Ranked Information Retrieval on XML Data26/48 n : # XML elements n c (u) : # sub-elements of u n h (u) : # outgoing hyperlinks from u CE -1 : {(v,u) | (u,v)  CE} “reverse containment edges“ E : HE  CE  CE -1 u n c (u) = 3 n h (u) = 3 containment edgereverse containment edgehyperlink edge ElemRank (1)

Ranked Information Retrieval on XML Data27/48  : prob. for following a hyperlink 1-  -  -  : prob. for a random jump  : prob. for using a containment edge  : prob. for using a reverse containment edge containment edgereverse containment edgehyperlink edge  / 3 + ε / 10  / 1 + ε / 10 ε  / 3 + ε / 10 ε / 10 ElemRank (2)

Ranked Information Retrieval on XML Data28/48 e(u) n h (u) e(u) n c (u) ElemRank e(v) = (0 ≤ , ,  ≤ 1) randomnavigation via hyperlinks via forward containment edges (u,v)  HE(u,v)  CE(u,v)  CE -1 e(u) 1 via reverse containment edges (1-  -  -  ) * 1/n +  * ∑ +  * ∑ +  * ∑ ElemRank (3)

Ranked Information Retrieval on XML Data29/48 ranking functions should take into account:  result specifity  hyperlinks  keyword proximity based on hyperlinked structure result specifity contains(v,k)  ∃ sequence (v 1,v 2 ),..., (v n-1,v n ) s.t. v n directly contains k r(v,k) = ElemRank(v n ) * decay n-1 (0 ≤ decay ≤ 1) Ranking Function (1)

Ranked Information Retrieval on XML Data30/48 m occurences of keyword k  computation of r 1,..., r m  r*(v,k) = f(r 1,..., r m ) query q consists of keywords k 1,..., k n  R(v,q) = (  r*(v,k i )) * p(v,k 1,..., k n ) keyword proximity p = proximity measure (with accumulation function f - e.g. max or sum) Ranking Function (2)

Ranked Information Retrieval on XML Data31/48 R.E.M. – Out Of Time Radio Song 4:12 Losing My Religion 4:26... R.E.M. – Automatic For

Ranked Information Retrieval on XML Data32/48 ElemRank computation XML documents index structures & algorithms Query Evaluator XML elements with ElemRanks data acces keyword search query ranked result list XRANK Architecture

Ranked Information Retrieval on XML Data33/48 naïve inverted list: contains all XML elements that contain the keyword key1 elem 11 elem key2 elem 21 elem etc.  space overhead  spurious results  inaccurate ranking Naïve Approach

Ranked Information Retrieval on XML Data34/ R.E.M. – Automatic For The People :26Losing My Religion :12Radio Song R.E.M. – Out Of Time Dewey IDs

Ranked Information Retrieval on XML Data35/48 Dewey inverted list: contains the Dewey IDs of all XML elements that directly contain the keyword sorted by Dewey ID (ascending) Dewey IDElemRankposition list R.E.M. Religion [0] … Dewey IDElemRankposition list [2] … DIL – Data Structure

Ranked Information Retrieval on XML Data36/48 key idea: computation of longest common prefix (lcp) of Dewey IDs DeweyID rank [1]rank [2] posList [1]posList [2] pot_result y n n DIL – Query Processing (1)

Ranked Information Retrieval on XML Data37/48 y DeweyID rank [1]rank [2] posList [1]posList [2] pot_result DeweyID rank [1]rank [2] posList [1]posList [2] pot_result y n n n n n lcp DIL – Query Processing (2)

Ranked Information Retrieval on XML Data38/48 y y DeweyID rank [1]rank [2] posList [1]posList [2] pot_result DeweyID rank [1]rank [2] posList [1]posList [2] pot_result y n n n n n n n 20 { 0.0, 0 } lcp DIL – Query Processing (3)

Ranked Information Retrieval on XML Data39/48 ranked Dewey inverted list: each Dewey ID in the list has a position in the B+-tree B+-tree sorted by Dewey ID (ascending) inverted list sorted by ElemRank (descending) Dewey IDElemRank R.E.M … … B+-tree on Dewey IDs RDIL – Data Structure

Ranked Information Retrieval on XML Data40/48 key 1 key 3 entry 21 entry 22 entry 23 entry 31 entry 32 entry 33 sorted by ElemRank... key 2 entry 11 entry 12 entry B+ on Dewey IDs RDIL – Query Processing (1) lcp with Dewey ID 11 ⇨ result heap

Ranked Information Retrieval on XML Data41/48 key 1 key 3 entry 31 entry 32 entry 33 sorted by ElemRank... key 2... B+ on Dewey IDs RDIL – Query Processing (2) lcp with Dewey ID 21 ⇨ result heap entry 22 entry 23 entry 21 entry 11 entry 12 entry 13 etc.

Ranked Information Retrieval on XML Data42/48 key 1 key 3 entry 21 entry 22 entry 23 entry 31 entry 32 entry 33 sorted by ElemRank... key 2... B+ on Dewey IDs RDIL – Query Processing (3) entry 11 entry 12 entry 13 ∑ Ranking = threshold Ω max. reachable Ranking ≤

Ranked Information Retrieval on XML Data43/48 RDIL algorithm stops if threshold Ω < lowest ElemRank in result heap because max. reachable ranking ≤ Ω < lowest ElemRank in result heap ⇨ max. reachable ranking < lowest ElemRank in result heap ! RDIL – Query Processing (4)

Ranked Information Retrieval on XML Data44/48 DIL / RDIL ElemRank computation XML documents Query Evaluator data acces keyword search query ranked result list XML elements with ElemRanks XRANK Architecture

Ranked Information Retrieval on XML Data45/48 Experimental Results (1)

Ranked Information Retrieval on XML Data46/48 Experimental Results (2)

Ranked Information Retrieval on XML Data47/48 DILRDIL inverted lists sorted by Dewey ID compute longest common prefix on Dewey IDs extracts the minimum of all remaining Dewey IDs all lists are completely scanned outperforms RDIL if keyword correlation is low inverted lists sorted by ElemRank chooses next list sequentially stops if a certain threshold is reached outperforms DIL if keyword correlation is high Comparison DIL - RDIL

Ranked Information Retrieval on XML Data48/48 2 Approaches ELIXIR: –SQL-like structure based search –extends XML-QL by supporting IR-similarity- features for ranking –ranked results based only on textual similarity (even between 2 variables) XRANK: –keyword based search à la Google –ranked results based on textual similarity –hierarchical and hyperlinked structure Conclusion