Presentation is loading. Please wait.

Presentation is loading. Please wait.

Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.

Similar presentations


Presentation on theme: "Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada."— Presentation transcript:

1 Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada

2 Content-Oriented XML Retrieval Documents not Data Natural Language Queries “ text index compression ” Ranked Results

3 Goal In response to a user query, select and return an appropriately ranked mix of document components… paragraphs, sections, subsections, bibliographic entries, articles, etc. In response to a user query, select and return an appropriately ranked

4 IEEE Journal Article in XML Text Compression for Dynamic Document Databases Alistair Moffat Justin Zobel Neil Sharman Abstract For compression of text... INTRODUCTION Modern document databases contain vast quantities of text... There are good reasons to compress the text stored in a... REDUCING MEMORY REQUIREMENTS... 2.1 Method A The first method......

5 XML Tree article fmbdy atlau abs p b sec stip1p sec st ss1 st

6 Ranking Document Components 1) Split into components 2) Treat each as a “document” 3) Rank using standard IR methods #1 #6 #2 #103 #8 #4 Problem: Top ranks dominated by related components

7 eg. “ text index compression ” 32.000923 /co/2000/ry037.xml /article[1]/bdy[1] 31.861366 /co/2000/ry037.xml /article[1] 31.083460 /co/2000/ry037.xml /article[1]/bdy[1]/sec[2] 30.174324 /co/2000/ry037.xml /article[1]/bdy[1]/sec[5] 29.420393 /tk/1997/k0302.xml /article[1] 29.250019 /tk/1997/k0302.xml /article[1]/bdy[1]/sec[1] 29.118382 /tk/1997/k0302.xml /article[1]/bdy[1] 29.075621 /co/2000/ry037.xml /article[1]/bdy[1]/sec[3] 28.417294 /tk/1997/k0302.xml /article[1]/bdy[1]/sec[6] 28.106693 /tp/2000/i0385.xml /article[1] 27.761749 /co/2000/ry037.xml /article[1]/bdy[1]/sec[7] 27.686905 /tk/1997/k0302.xml /article[1]/bdy[1]/sec[3] 27.584927 /tp/2000/i0385.xml /article[1]/bdy[1] 27.273247 /co/2000/ry037.xml /article[1]/bdy[1]/sec[4] 27.186977 /tp/2000/i0385.xml /article[1]/bdy[1]/sec[1] 27.181036 /tk/1997/k0302.xml /article[1]/bm[1] 27.072521 /tk/1997/k0302.xml /article[1]/bdy[1]/sec[3]/ss1[1] 26.992224 /co/2000/ry037.xml /article[1]/bdy[1]/sec[5]/ss1[1]

8 Is overlap always bad? General overview Topic-focused components article fmbdy atlau abs p b sec stip1p sec st ss1 st

9 Overview Baseline retrieval method Evaluation methodology Re-ranking algorithm

10 Baseline Retrieval Method Treat each component as a separate document. Rank using Okapi BM25. Tune retrieval parameters to task and collection.

11 Okapi BM25 Given a query Q, a component x is scored: where: D= number of documents in the collection D t = number of documents containing t q t =frequency that t occurs in the query x t =frequency that t occurs in x K=k 1 ((1 – b) + b·l x /l avg ) l x =length of x l avg =average document length Σ qtqt (k 1 + 1) x t K + x t D – D t + 0.5 D t + 0.5 () log t є Qt є Q k 1 = 10.0 b = 0.80

12 Overview Baseline retrieval method Evaluation methodology Re-ranking algorithm

13 XML Evaluation - INEX 2004 Third year of the “INitiative for the Evaluation of XML retrieval” (INEX). Encourages research into XML information retrieval technology (similar to TREC). Over 50 participating groups.

14 INEX 2004 Content-Only Retrieval Task Test collection of articles taken from IEEE journals between 1995 and 2002. 40 “adhoc” queries: –text index compression –software quality –new fortran 90 compiler Each group could submit up to three runs consisting of the top 1500 components for each query.

15 Relevance Assessment INEX uses two dimensions for relevance –exhaustivity: the degree to which a component covers a topic –specificity: the degree to which a component is focused on a topic A four-point scale is used in both dimensions (eg. a (3,3) component is highly exhaustive and highly specific)

16 INEX 2004 Evaluation Metrics Mean average precision (MAP) XML cumulated gain (XCG) Various quantization functions: –strict quantization - (3,3) elements only –generalized quantization –specificity-oriented generalization

17 Overview Baseline retrieval method Evaluation methodology Re-ranking algorithm

18 Controlling Overlap Starting with a component ranking generated by the baseline method, elements are re-ranked to control overlap. Scores of those components containing or contained within higher ranking components are iteratively adjusted.

19 Controlling Overlap #1 #2 #4 #6 #10 #87 # 203 #21 #863

20 Re-ranking Algorithm 1.Report the highest ranking component. 2.Adjust the scores of the unreported components. 3.Repeat steps 1 and 2 until the top m components have been reported.

21 Adjusting Scores x t = frequency that t occurs in component x Σ qtqt (k 1 + 1) x t K + x t D – D t + 0.5 D t + 0.5 () log t є Qt є Q f t = frequency that t occurs in x g t =frequency that t has been reported in subcomponents of x x t = f t – α·g t

22 Adjusting Scores (α = 0.5) f t = 6 g t = 0 x t = 6 f t = 13 g t = 0 x t = 13 #1 f t = 13 g t = 6 x t = 10

23 Re-ranking Algorithm 1.Maintain tree nodes in a priority queue ordered by score. 2.Report component at front of queue. 3.Propagate adjustments up and down the tree re-ordering nodes in the queue. 4.Repeat steps 2 and 3 until the top m components have been reported.

24 Mean Average Precision vs. XCG

25 Extended Re-Ranking Algorithm One parameter (α) not sufficient. Reported in descendant (α) vs. ancestor (β). Number of times term reported in ancestor: 1 = β 0 ≥ β 1 ≥ β 2 ≥ … ≥ β n = 0 x t = β i · ( f t – α·g t )

26 Concluding Comments Overlap isn’t always bad Overlap should be controlled not eliminated Current and future work: –Modifying k 1 to reflect term frequency adjustments –More appropriate evaluation methodology –Learning α, β i

27 Questions? Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada


Download ppt "Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada."

Similar presentations


Ads by Google