Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003.

Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003 Bernadette Blum, Christian Nicolaus, Markus Uhl

Ranked Information Retrieval on XML Data2/48 Outline 1. Introduction in Information Retrieval 2. Information Retrieval on XML Data 3. Approaches 1.ELIXIR -The ELIXIR language -The ELIXIR query processing algorithm -Experiments, Conclusion 2.XRANK -Data model -Ranking function -Data structures and algorithms -Experiments 4. Conclusion

Ranked Information Retrieval on XML Data3/48 1. Introduction in Information Retrieval Definition: –Information Retrieval (IR) is the technology for searching in collections (corpora, intranets, Web) of weakly structured documents: text, HTML, XML,... –search engines, digital libraries, similarity search on scientific data Vector space model (text analysis): –based on word occurrence frequency –documents and queries are vectors –result ranking based on similarity metric in vector space

Ranked Information Retrieval on XML Data4/48 1. Introduction in Information Retrieval (II) Link analysis (structure analysis): –weighting documents –improve result ranking Page rank approach (I): –web as directed graph G –“random walk” of a web surfer follow hyperlinks with probability (1-  ) “random jump” with probability 

Ranked Information Retrieval on XML Data5/48 Page rank approach (II): 1. Introduction in Information Retrieval (III) “random jump”hyperlinks Hyperlink Probability of “random jump”  Probability of following hyperlink (1- ) + “random jump” Document p(q)= q (1-  )/3  /5

Ranked Information Retrieval on XML Data6/48 2. Information Retrieval on XML Data XML: standard for exchange of structured data and documents existing query languages (e.g. XML-QL, Quilt, XQL, …  XQuery) –no ranked or weighted results based on textual similarity –but extensions (XXL, XIRQL …) 2 Approaches ELIXIR SQL-like approach XRANK Keyword based approach

Ranked Information Retrieval on XML Data7/48 3.1 ELIXIR ELIXIR = “expressive and efficient language for XML information retrieval” extension to XML-QL: similarity operator “~” “~” computed by WHIRL returns best r answers

Ranked Information Retrieval on XML Data8/48 ELIXIR – The ELIXIR language Syntax: –XML-QL Syntax (SQL-like) CONSTRUCT $b WHERE $b in “db.xml”, $c in “db.xml”, $yb > 1990, $b ~ $c. output format pattern statements + predicates boolean operators ELIXIR’s similarity operator similarity calculation even between 2 variables (  expressiveness) no nested queries

Ranked Information Retrieval on XML Data9/48 ELIXIR – The ELIXIR language (II) WHIRL (I): Word-based Heterogeneous Information Retrieval Logic extends DATALOG with “~” only relational data efficiently supports ranked IR Syntax (Horn clause): output($y, $a, $t) :- book($y, $a, $t), $y>1950, $t~$a. output relationinput relation conjunction of relational predicates boolean operator similarity operator

Ranked Information Retrieval on XML Data10/48 WHIRL (II): Similarity computation “~”: –standard IR term vector techniques –weighting terms (TF-IDF values) –cosine measure: (V Vocabulary of distinct terms; Terms t  V; Documents d, d’  R |V| ) ELIXIR – The ELIXIR language (III)

Ranked Information Retrieval on XML Data11/48 ELIXIR – The ELIXIR query processing algorithm Example (naïve approach): { CONSTRUCT $b $c WHERE $b in “db.xml”, $c in “db.xml” } XML-QL query Q 2 Similarity computation for every tupel ($b, $c) full cross product !

Ranked Information Retrieval on XML Data12/48 ELIXIR – The ELIXIR query processing algorithm (II) Problem: full cross product !

Ranked Information Retrieval on XML Data13/48 Solution: not simply map the full XML data into relational model invoke WHIRL as a “subroutine” (  efficiency) Avoid generating full cross product! ELIXIR – The ELIXIR query processing algorithm (III)

Ranked Information Retrieval on XML Data14/48 2 pattern statements with variables that are compared with a similarity predicate => distinct Q 2 j queries ELIXIR – The ELIXIR query processing algorithm (IV) Start query Q 1 3 Stages: intermediate queries Q 2, Q 3, Q 4 1. Partition into a set, Q 2 1 … Q 2 N, of XML-QL queries -avoid generating full cross product -ordinary predicates 2. WHIRL query Q 3 -similarity predicates -ordered table of the r best answers 3. XML-QL query Q 4 –transformation of Q 3 ’s output –specified XML structure by Q 1

Ranked Information Retrieval on XML Data15/48 Example (Step I – Partition in Q 2 n queries): { CONSTRUCT $b WHERE $b in "db.xml" } { CONSTRUCT $c WHERE $c in "db.xml" } XML-QL query Q 2 1 XML-QL query Q 2 2 Ukrainian folk music Being there Milk cow blues Traditional Ukrainian cookery Being and nothingness Shooting Elvis Avoid generating full cross product! ELIXIR – The ELIXIR query processing algorithm (V)

Ranked Information Retrieval on XML Data16/48 Example (Step II – WHIRL query Q 3 ): q3($b) :- q21($b), q22($c), $b ~ $c. WHIRL query Q 3 Traditional Ukrainian cookery Being and nothingness Ukrainian folk music Being there Milk cow blues Traditional Ukrainian cookery Being and nothingness Shooting Elvis ELIXIR – The ELIXIR query processing algorithm (VI)

Ranked Information Retrieval on XML Data17/48 Example (Step III – XML-QL query Q 4 ): { CONSTRUCT $b WHERE $b in "q3.xml“ } XML-QL query Q 4 Traditional Ukrainian cookery Being and nothingness Final XML OUTPUT Traditional Ukrainian cookery Being and nothingness ELIXIR – The ELIXIR query processing algorithm (VII)

Ranked Information Retrieval on XML Data18/48 ELIXIR – Experiments, Conclusion Experiments: Total processing time … –… depends on details of each query and input data –… increases marginal with number of answers r –… increases linearly with number of similarity join predicates –Partition (Step 1) of initially query dominate (expensive parsing and traversing)

Ranked Information Retrieval on XML Data19/48 ELIXIR – Experiments, Conclusion (II) Conclusion: ELEXIR extends XML-QL by supporting IR-similarity-features for ranking similarity joins even between 2 variables (expressiveness) Algorithm: –rewrite original ELIXIR query in a series of intermediate XML-QL and WHIRL queries. –no full cross product, only filtered tuples of variable bindings (efficiency) But … –only non-nested queries –strict three-stage approach may be suboptimal in some cases (partition)

Ranked Information Retrieval on XML Data20/48 XRANK: Ranked Keyword Search over XML Documents

Ranked Information Retrieval on XML Data21/48 Introduction XRANK - Keyword Search over XML documents results:  XML elements that contain all searched keywords ranking:  at granularity of XML elements  based on hyperlink structure advantages:  user does not have to learn a query language  no knowledge about the structure of XML documents is needed generalized keyword search engine (both HTML and XML are possible)

Ranked Information Retrieval on XML Data22/48 G = (V, CE, HE) : collection of XML documents V : set of XML elements (tags and attributes) CE : set of containment edges HE : set of hyperlinked edges (u,v) in CE  v is a sub-element of u (u,v) in HE  u contains a hyperlink to v contains(v,k)  v (in)directly contains the keyword k Data Model

Ranked Information Retrieval on XML Data23/48 Example: XML Graph... XML elementvalue

Ranked Information Retrieval on XML Data24/48 How to define results of keyword search queries over XML documents? elements with at least one sub-element containining all keywords & at least one sub-element containing some keywords elements that contain all keywords – no sub-element contains all keywords! ⋃ Keyword Query Results (1)

Ranked Information Retrieval on XML Data25/48 Ranking Elements How to rank XML elements?  extension of PageRank at the granularity of elements  objective importance of XML elements  based on hyperlinked and nested structure of XML elements ElemRank

Ranked Information Retrieval on XML Data26/48 n : # XML elements n c (u) : # sub-elements of u n h (u) : # outgoing hyperlinks from u CE -1 : {(v,u) | (u,v)  CE} “reverse containment edges“ E : HE  CE  CE -1 u n c (u) = 3 n h (u) = 3 containment edgereverse containment edgehyperlink edge ElemRank (1)

Ranked Information Retrieval on XML Data27/48  : prob. for following a hyperlink 1-  -  -  : prob. for a random jump  : prob. for using a containment edge  : prob. for using a reverse containment edge containment edgereverse containment edgehyperlink edge  / 3 + ε / 10  / 1 + ε / 10 ε  / 3 + ε / 10 ε / 10 ElemRank (2)

Ranked Information Retrieval on XML Data28/48 e(u) n h (u) e(u) n c (u) ElemRank e(v) = (0 ≤ , ,  ≤ 1) randomnavigation via hyperlinks via forward containment edges (u,v)  HE(u,v)  CE(u,v)  CE -1 e(u) 1 via reverse containment edges (1-  -  -  ) * 1/n +  * ∑ +  * ∑ +  * ∑ ElemRank (3)

Ranked Information Retrieval on XML Data29/48 ranking functions should take into account:  result specifity  hyperlinks  keyword proximity based on hyperlinked structure result specifity contains(v,k)  ∃ sequence (v 1,v 2 ),..., (v n-1,v n ) s.t. v n directly contains k r(v,k) = ElemRank(v n ) * decay n-1 (0 ≤ decay ≤ 1) Ranking Function (1)

Ranked Information Retrieval on XML Data30/48 m occurences of keyword k  computation of r 1,..., r m  r*(v,k) = f(r 1,..., r m ) query q consists of keywords k 1,..., k n  R(v,q) = (  r*(v,k i )) * p(v,k 1,..., k n ) keyword proximity p = proximity measure (with accumulation function f - e.g. max or sum) Ranking Function (2)

Ranked Information Retrieval on XML Data31/48 R.E.M. – Out Of Time Radio Song 4:12 Losing My Religion 4:26... R.E.M. – Automatic For.........

Ranked Information Retrieval on XML Data32/48 ElemRank computation XML documents index structures & algorithms Query Evaluator XML elements with ElemRanks data acces keyword search query ranked result list XRANK Architecture

Ranked Information Retrieval on XML Data33/48 naïve inverted list: contains all XML elements that contain the keyword key1 elem 11 elem 12... key2 elem 21 elem 22... etc.  space overhead  spurious results  inaccurate ranking Naïve Approach

Ranked Information Retrieval on XML Data34/48... 0 0.0 0.1 0.1.0 R.E.M. – Automatic For The People 0.0.2 0.0.2.1 0.0.2.0 4:26Losing My Religion 0.0.00.0.1 0.0.1.10.0.1.0 4:12Radio Song R.E.M. – Out Of Time Dewey IDs

Ranked Information Retrieval on XML Data35/48 Dewey inverted list: contains the Dewey IDs of all XML elements that directly contain the keyword sorted by Dewey ID (ascending) Dewey IDElemRankposition list R.E.M. Religion 0.0.0 0.1.0 75 80 [0] … Dewey IDElemRankposition list 0.0.2.088 [2] … DIL – Data Structure

Ranked Information Retrieval on XML Data36/48 key idea: computation of longest common prefix (lcp) of Dewey IDs DeweyID rank [1]rank [2] posList [1]posList [2] pot_result 1. 0 0 0 75 70 650 0 0y n n DIL – Query Processing (1)

Ranked Information Retrieval on XML Data37/48 y DeweyID rank [1]rank [2] posList [1]posList [2] pot_result DeweyID rank [1]rank [2] posList [1]posList [2] pot_result 1. 2. 0 0 0 0 2 0 0 75 70 650 0 0y n n70 65 0 0 88 83 78 73n n n 2 2 2 2 lcp DIL – Query Processing (2)

Ranked Information Retrieval on XML Data38/48 y y DeweyID rank [1]rank [2] posList [1]posList [2] pot_result DeweyID rank [1]rank [2] posList [1]posList [2] pot_result 1. 3. 2. 0 0 0 0 2 0 0 75 70 650 0 0y n n 0 0 1 70 65 0 0 88 83 78 73n n n 2 2 2 2 80 75 7073 0 0n n 20 { 0.0, 0 } lcp DIL – Query Processing (3)

Ranked Information Retrieval on XML Data39/48 ranked Dewey inverted list: each Dewey ID in the list has a position in the B+-tree B+-tree sorted by Dewey ID (ascending) inverted list sorted by ElemRank (descending) Dewey IDElemRank R.E.M. 80 75 0.1.0 0.0.0 … 0.1.0 … B+-tree on Dewey IDs RDIL – Data Structure

Ranked Information Retrieval on XML Data40/48 key 1 key 3 entry 21 entry 22 entry 23 entry 31 entry 32 entry 33 sorted by ElemRank... key 2 entry 11 entry 12 entry 13... B+ on Dewey IDs RDIL – Query Processing (1) lcp with Dewey ID 11 ⇨ result heap

Ranked Information Retrieval on XML Data41/48 key 1 key 3 entry 31 entry 32 entry 33 sorted by ElemRank... key 2... B+ on Dewey IDs RDIL – Query Processing (2) lcp with Dewey ID 21 ⇨ result heap entry 22 entry 23 entry 21 entry 11 entry 12 entry 13 etc.

Ranked Information Retrieval on XML Data42/48 key 1 key 3 entry 21 entry 22 entry 23 entry 31 entry 32 entry 33 sorted by ElemRank... key 2... B+ on Dewey IDs RDIL – Query Processing (3) entry 11 entry 12 entry 13 ∑ Ranking = threshold Ω max. reachable Ranking ≤

Ranked Information Retrieval on XML Data43/48 RDIL algorithm stops if threshold Ω < lowest ElemRank in result heap because max. reachable ranking ≤ Ω < lowest ElemRank in result heap ⇨ max. reachable ranking < lowest ElemRank in result heap ! RDIL – Query Processing (4)

Ranked Information Retrieval on XML Data44/48 DIL / RDIL ElemRank computation XML documents Query Evaluator data acces keyword search query ranked result list XML elements with ElemRanks XRANK Architecture

Ranked Information Retrieval on XML Data45/48 Experimental Results (1)

Ranked Information Retrieval on XML Data46/48 Experimental Results (2)

Ranked Information Retrieval on XML Data47/48 DILRDIL inverted lists sorted by Dewey ID compute longest common prefix on Dewey IDs extracts the minimum of all remaining Dewey IDs all lists are completely scanned outperforms RDIL if keyword correlation is low inverted lists sorted by ElemRank chooses next list sequentially stops if a certain threshold is reached outperforms DIL if keyword correlation is high Comparison DIL - RDIL

Ranked Information Retrieval on XML Data48/48 2 Approaches ELIXIR: –SQL-like structure based search –extends XML-QL by supporting IR-similarity- features for ranking –ranked results based only on textual similarity (even between 2 variables) XRANK: –keyword based search à la Google –ranked results based on textual similarity –hierarchical and hyperlinked structure Conclusion

Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003.

Similar presentations

Presentation on theme: "Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003.

Similar presentations

Presentation on theme: "Ranked Information Retrieval on XML Data Seminar “Informationsorganisation und -suche mit XML” Dr. Ralf Schenkel SS 2003 Saarland University 8. Juli 2003."— Presentation transcript:

Similar presentations

About project

Feedback