A Ranking Scheme for XML Information Retrieval Based on Benefit and Reading Effort Toshiyuki Shimizu (Kyoto University) Masatoshi Yoshikawa (Kyoto University) ICADL 2007 12th December
XML-IR systems Growing demand for XML Information Retrieval (XML-IR) Systems We can identify meaningful document fragments by encoding documents in XML ex) Sections, subsections and paragraphs in scholarly articles Browsing only document fragments relevant to a certain topic The most simple form of queries for XML-IR is just a set of keywords Simple, intuitively understandable, yet useful form of queries, especially for unskilled end-users Active research area as in INEX* * INitiative for the Evaluation of XML Retrieval (http://inex.is.informatik.uni-duisburg.de/)
Results of XML-IR Systems <?xml version="1.0"?> <article> <sec> <p>XML labeling</p> <p>The structure of XML is a tree, and each node in the XML is labeled.</p> <p>We can get tag name of each XML element.</p> </sec> <p>Tree index</p> <p>XML index is constructed using the labels</p> </article> Document fragment (element) With relevance degree (Score) ex) Query term was “XML” e0 0.56 article 0.64 0.35 e1 e5 sec sec 0.4 0.9 0.33 0.8 e6 e7 e2 e3 e4 p p p p p Score
Naïve XML-IR System Thorough strategy of INEX 2005 e3 (0.9) e7 (0.8) Simply retrieves relevant elements from all elements and ranks them in order of relevance e3 (0.9) e7 (0.8) e1 (0.64) e0 (0.56) e2 (0.4) e5 (0.35) e4 (0.33) Score e0 0.56 article 0.64 0.35 e1 e5 sec sec 0.4 0.9 0.33 0.8 e6 e7 e2 e3 e4 p p p p p Thorough is considered for system evaluation User behavior of browsing search results must be considered
Problems of Thorough Retrieval for XML-IR Nesting elements Browsing both elements is useless Ancestor element ea Descendant element ed ed has been fully seen Descendant element ed Ancestor element ea ea has been partially seen before Element size Elements retrieved by XML-IR systems varies widely in size Large element, such as article (whole document) Small element, such as p (paragraph) Total output size of top-k elements is uncontrollable by simply giving an integer k
Overview of our Approach Introduction of the concepts of benefit and reading effort Users can control the total output size Systems can retrieve non-overlapping elements
Properties of Benefit and Reading Effort (1/2) The benefit of an element is the amount of gain about the query by reading the element Assumption 1: The benefit of an element is greater than or equal to the sum of the benefit of the child elements Information complementation among sibling elements ex) For two query terms A and B e6 contains topics about A e7 contains topics about B The benefit of e5 seems to be greater than the sum of benefit of e6 and e7 e5 sec e6 e7 p p
Properties of Benefit and Reading Effort (2/2) The reading effort of an element is the amount of cost by reading the content of the element Assumption 2: The reading effort of an element is less than or equal to the sum of the reading effort of the child elements Readability of continuous reading ex) Users can read the same content more easily by reading e5 rather than separate e6 and e7 e5 sec : e7 e6 : e5 e6 e7 p p
Overview of our Approach Introduction of the concepts of benefit and reading effort Users can control the total output size Systems can retrieve non-overlapping elements Flexible retrieval Users specify a threshold for the total amount of reading effort The systems return relevant elements that provide larger benefit and that can be read within specified reading effort
Retrieve {e2, e3} (Total benefit : 11) Flexible Retrieval Systems calculate benefit and reading effort A variant of knapsack problems ex) Threshold of reading effort : 15 Retrieve {e2, e3} (Total benefit : 11) e0 article e1 e5 sec sec e2 e3 e4 e6 e7 p p p p p
Retrieve {e3, e7} (Total benefit : 17) Flexible Retrieval Systems calculate benefit and reading effort A variant of knapsack problems ex) Threshold of reading effort : 20 Retrieve {e3, e7} (Total benefit : 17) e0 article e1 e5 sec sec e2 e3 e4 e6 e7 p p p p p
Search Result Continuity article e0 p e4 e3 sec e1 e5 e2 e6 e7 Search Result Continuity ex) reading effort : 15 Retrieve {e2, e3} (benefit : 11) reading effort : 20 Retrieve {e3, e7} (benefit : 17) The running example violate search result continuity The content of element set for reading effort r must be contained in the content of element set for reading effort r’ if r <= r’ The optimal solution is NP-hard (A variant of knapsack problems) may violate search result continuity Greedy retrieval algorithm
Retrieval Algorithm e0 e3 (0.9) e7 (0.8) e1 (0.64) e0 (0.56) e2 (0.4) Based on the result of Thorough strategy* Adjust benefit and reading effort for nesting elements of retrieved element, and rerank Remove overlapping contents by nestings * Simply retrieves relevant elements from all elements and ranks them in order of relevance e0 Result of Thorough e3 (0.9) e7 (0.8) e1 (0.64) e0 (0.56) e2 (0.4) e5 (0.35) e4 (0.33) article 0.56 e1 e5 0.64 0.35 sec sec 0.4 e2 e3 0.9 e4 0.33 e6 e7 0.8 p p p p p
Amount of reading effort : 10 Retrieval Algorithm Result of Thorough Our result e3 (0.9) e3 (0.9) Amount of benefit : 9 e7 (0.8) Amount of reading effort : 10 e1 (0.64) e1 (0.5) Adjust e1 , e0 e0 (0.48) e0 (0.56) e0 0.56 0.48 e2 (0.4) article e5 (0.35) e4 (0.33) e1 e5 0.64 0.5 0.35 sec sec Threshold of reading effort : 40 0.4 e2 e3 0.9 e4 0.33 e6 e7 0.8 p p p p p
Retrieval Algorithm e3 (0.9) e3 (0.9) e7 (0.8) e7 (0.8) e1 (0.5) Result of Thorough Our result e3 (0.9) e3 (0.9) Amount of benefit : 9 Amount of benefit : 17 e7 (0.8) e7 (0.8) Amount of reading effort : 20 Amount of reading effort : 10 e1 (0.5) e0 (0.37) e0 (0.48) e0 Adjust and rerank e5 , e0 0.48 0.37 e2 (0.4) article e5 (0) e5 (0.35) e4 (0.33) e1 e5 0.5 0.35 sec sec Threshold of reading effort : 40 0.4 e2 e3 0.9 e4 0.33 e6 e7 0.8 p p p p p
Retrieval Algorithm e3 (0.9) e3 (0.9) e7 (0.8) e7 (0.8) e1 (0.5) Result of Thorough Our result e3 (0.9) e3 (0.9) Amount of benefit : 17 Amount of benefit : 26 e7 (0.8) e7 (0.8) Amount of reading effort : 38 Amount of reading effort : 20 e1 (0.5) e1 (0.5) e0 e2 (0.4) 0.17 e0 (0.37) e0 (0.17) article 0.37 Adjust and rerank e0 e4 (0.33) e5 (0) e1 e5 0.5 sec sec Threshold of reading effort : 40 0.4 e2 e3 0.9 e4 0.33 e6 e7 0.8 p p p p p
Retrieval Algorithm e3 (0.9) e7 (0.8) e7 (0.8) e1 (0.5) e1 (0.5) e0 Result of Thorough Our result e3 (0.9) e7 (0.8) Amount of benefit : 26 Amount of benefit : 26 e7 (0.8) e1 (0.5) Amount of reading effort : 38 Amount of reading effort : 38 e1 (0.5) e0 e2 (0.4) 0.17 e4 (0.33) article e0 (0.17) e5 (0) e1 e5 0.5 sec sec Threshold of reading effort : 40 0.4 e2 e3 0.9 e4 0.33 e6 e7 0.8 p p p p p
Retrieval Algorithm e3 (0.9) e7 (0.8) e7 (0.8) e1 (0.5) e1 (0.5) e0 Result of Thorough Our result e3 (0.9) e7 (0.8) Amount of benefit : 26 Amount of benefit : 26 e7 (0.8) e1 (0.5) Amount of reading effort : 38 Amount of reading effort : 38 e1 (0.5) e0 e2 (0.4) 0.17 e4 (0.33) article e0 (0.17) e5 (0) e1 e5 0.5 sec sec Threshold of reading effort : 40 0.4 e2 e3 0.9 e4 0.33 e6 e7 0.8 p p p p p
Retrieval Algorithm e3 (0.9) e7 (0.8) e7 (0.8) e1 (0.5) e1 (0.5) e0 Result of Thorough Our result e3 (0.9) e7 (0.8) Amount of benefit : 26 Amount of benefit : 26 e7 (0.8) e1 (0.5) Amount of reading effort : 38 Amount of reading effort : 38 e1 (0.5) e0 e2 (0.4) 0.17 e4 (0.33) article e0 (0.17) e5 (0) e1 e5 0.5 sec sec Threshold of reading effort : 40 0.4 e2 e3 0.9 e4 0.33 e6 e7 0.8 p p p p p
Evaluation Metrics Based on benefit and reading effort b/e graph (benefit/effort graph) Comparison with BTIL (Best Thorough Input List) BTIL system is the system which use actual benefit and reading effort Actual benefit is calculated using manually constructed assessments (e.g. INEX) We can observe relative effectiveness of benefit changing the specified threshold of reading effort Use the same values for reading effort between implemented system and BTIL system
For the threshold value 30 of reading effort article article e1 e5 e1 e5 sec sec sec sec e2 e3 e4 e6 e7 e2 e3 e4 e6 e7 p p p p p p p p p p Calculated benefit / reading effort Actual benefit / reading effort For the threshold value 30 of reading effort BTIL system retrieves {e3, e6} Obtained actual benefit is 23 Implemented system retrieves {e3, e7} Obtained actual benefit is 10
Examples of b/e Graph using INEX 2005 Test Collection (1/2) XML document set, Topics, Assessments Calculate actual benefit and reading effort from Assessments ex (Exhaustivity): Highly exhaustive (HE) 1 Partially exhaustive (PE) 0.5 Not exhaustive(NE) 0 rsize: relevant text length (in number of characters) size: element length (in number of characters) We implemented a system using tf-ief ief stands for inverse element frequency satisfies Assumptions for benefit and reading effort : parameter
Examples of b/e Graph using INEX 2005 Test Collection (2/2) Topic 207 Topic 206 We can observe relative effectiveness of implemented systems against BTIL system
Conclusions and Future Works Introduction of benefit and reading effort Handling nesting elements Variety of element size Algorithm for flexible retrieval Result elements change depending on the specified reading effort System evaluation Future Works Introduction of switching effort Cost of switching a result item in the results list Retrieving numerous results increases the cost of browsing Integration with user interface