XML Ranking Querying, Dagstuhl, 9-13 Mar, 20081 An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab.

XML Ranking Querying, Dagstuhl, 9-13 Mar, 20081 An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab

XML Ranking Querying, Dagstuhl, 9-13 Mar, 20082 The XML retrieval tasks Query formulation CO – Content only CAS – Content and structure (NEXI) Retrieval tasks Thorough: “find all highly exhaustive and specific elements” Retrieval results can be (possibly overlapping) XML elements of varying granularity that fulfill the query Focussed : “ find the most exhaustive and specific element in a path” No overlap in returned results

XML Ranking Querying, Dagstuhl, 9-13 Mar, 20083 Approaches for XML retrieval Index full documents. Score documents and then components inside the documents Problem: Works well for “fetch and browse ” but not for the general thorough task Index only leaf elements Score leaves and propagate scores along the XML tree Problem: weights used to propagate are either set manually by the user or set empirically Index all elements into same index Score all possible elements Problem: distorted “element-level" statistics due to overlapping Can we fix the distorted statistics?

XML Ranking Querying, Dagstuhl, 9-13 Mar, 20084 An adaptive XML retrieval system Split all collection elements into separate indices such that Coverage - each element is indexed in at least one index No overlap - elements in each index do not nest. Run Query on each index Merge results to a single result list

XML Ranking Querying, Dagstuhl, 9-13 Mar, 20085 Split to indices - example Index 2 p[3] p[1] bdy[1] article[1] sec[2] sec[1] Index 0 Index 1 Index 3 Index 0: /article[1]/article[1] Index 1: /article[1]/bdy[1]/article[1]/bdy[1] Index 2: /article[1]/bdy[1]/sec[1], /article[1]/bdy[1]/sec[1] /article[1]/bdy[1]/sec[2] Index 3: /article[1]/bdy[1]/sec[2]/p[1], /article[1]/bdy[1]/sec[1]/ss1[1] /article[1]/bdy[1]/sec[2]/p[3]/article[1]/bdy[1]/sec[1]/ss1[2] article[1] bdy[1] sec[1] ss1[1] ss1[2]

XML Ranking Querying, Dagstuhl, 9-13 Mar, 20086 An adaptive indexing schema SplitToIndices(doc, minCompSize, nInd) Find all leaves in doc that are larger than minCompSize If no minimal leaves found return G 0 = {root} Let d be the longest path among all those leaves Create groups {G 0,…,G d-1 } where each G i contains all elements inferred Xpath prefixes of length i of all matched leaves. Remove repeating elements in each group Split the groups {G 1,…,G d } to indices{I 0,…, I nInd-1 } (several strategies) Return {I 0,…, I nInd-1 }

XML Ranking Querying, Dagstuhl, 9-13 Mar, 20087 Examples – cut long paths Minimal element - /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1]/td[2] Split to Indices index 0 : /article[1] index 1 : /article[1]/body[1] index 2 : /article[1]/body[1]/section[7] index 3: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1] index 4: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2] index 5: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1] index 6: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1]/td[2]

XML Ranking Querying, Dagstuhl, 9-13 Mar, 20088 Experiements IEEE collection 1995-2004 17,000 articles, 700MB Average document length ~41K Average depth 6.9 29 topics from INEX 2005 Wikipedia collection 660,000 pages, 4.5GB Average document length 6.8K Average depth 6.72 111 topics from INEX 2006

XML Ranking Querying, Dagstuhl, 9-13 Mar, 20089 Coverage For nInd=7 and minCompSize=10. 87% coverage for IEEE collection recall base 75% coverage for Wikipedia collection filtered recall base The filtered recall base was generated by removing all link elements from the recall base We still miss some small elements and some in-between elements which has depth > 7

XML Ranking Querying, Dagstuhl, 9-13 Mar, 200810 Doc pivot Some low level indices have partial content of the collection thus missing statistics Solution: compensate by containing document’s score Score’(e) = docPivot * Score(doc(e)) + (1 – docPivot) * Score(e))

XML Ranking Querying, Dagstuhl, 9-13 Mar, 200811 Elements distribution

XML Ranking Querying, Dagstuhl, 9-13 Mar, 200812 Tuning number of Indices needle Set minCompSize=10

XML Ranking Querying, Dagstuhl, 9-13 Mar, 200813 Tuning min Component Size Set num indices = 7 Set num indices nInd=7

XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 14 Summary Adaptive Indexing schema –split XML elements to separate indices –Same parameters for different collections XML retrieval system –achieved by running existing IR engines on each index Can be used for CAS Relatively low MAep results –Does XML structure reflect any semantic structure?

XML Ranking Querying, Dagstuhl, 9-13 Mar, 2008 15 Thank you!

XML Ranking Querying, Dagstuhl, 9-13 Mar, 20081 An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab.

Similar presentations

Presentation on theme: "XML Ranking Querying, Dagstuhl, 9-13 Mar, 20081 An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

XML Ranking Querying, Dagstuhl, 9-13 Mar, 20081 An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab.

Similar presentations

Presentation on theme: "XML Ranking Querying, Dagstuhl, 9-13 Mar, 20081 An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab."— Presentation transcript:

Similar presentations

About project

Feedback