Toshiyuki Shimizu (Kyoto University)

Slides:



Advertisements
Similar presentations
Even More TopX: Relevance Feedback Ralf Schenkel Joint work with Osama Samodi, Martin Theobald.
Advertisements

Evaluating the Robustness of Learning from Implicit Feedback Filip Radlinski Thorsten Joachims Presentation by Dinesh Bhirud
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Evaluation of Relevance Feedback Algorithms for XML Retrieval Silvana Solomon 27 February 2007 Supervisor: Dr. Ralf Schenkel.
1 Abdeslame ALILAOUAR, Florence SEDES Fuzzy Querying of XML Documents The minimum spanning tree IRIT - CNRS IRIT : IRIT : Research Institute for Computer.
XML Ranking Querying, Dagstuhl, 9-13 Mar, An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab.
A Fairy Tale of Greedy Algorithms Yuli Ye Joint work with Allan Borodin, University of Toronto.
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Dynamic Element Retrieval in a Structured Environment Crouch, Carolyn J. University of Minnesota Duluth, MN October 1, 2006.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Information Retrieval February 24, 2004
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Re-ranking Documents Segments To Improve Access To Relevant Content in Information Retrieval Gary Madden Applied Computational Linguistics Dublin City.
Vector Space Model CS 652 Information Extraction and Integration.
On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009.
Query Biased Snippet Generation in XML Search Yi Chen Yu Huang, Ziyang Liu, Yi Chen Arizona State University.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Fundamental Techniques
Overview of Search Engines
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
NUITS: A Novel User Interface for Efficient Keyword Search over Databases The integration of DB and IR provides users with a wide range of high quality.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
XML as a Boxwood Data Structure Feng Zhou, John MacCormick, Lidong Zhou, Nick Murphy, Chandu Thekkath 8/20/04.
1 Ranking Inexact Answers. 2 Ranking Issues When inexact querying is allowed, there may be MANY answers –different answers have a different level of incompleteness.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Workshop on Software Product Archiving and Retrieving System Takeo KASUBUCHI Hiroshi IGAKI Hajimu IIDA Ken’ichi MATUMOTO Nara Institute of Science and.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
Users and Assessors in the Context of INEX: Are Relevance Dimensions Relevant? Jovan Pehcevski, James A. Thom School of CS and IT, RMIT University, Australia.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Chapter 13. Structured Text Retrieval With Mounia Lalmas 무선 / 이동 시스템 연구실 김민혁.
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
1 Keyword Search over XML. 2 Inexact Querying Until now, our queries have been complex patterns, represented by trees or graphs Such query languages are.
Information Retrieval in Practice
An Efficient Algorithm for Incremental Update of Concept space
Search Engine Architecture
Indexing & querying text
Enhancing Internet Search Engines to Achieve Concept-based Retrieval
Information Retrieval and Web Search
MG4J – Managing GigaBytes for Java Introduction
Multimedia Information Retrieval
Information Retrieval
IR Theory: Evaluation Methods
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
digital libraries and human information behavior
Introduction to XML IR XML Group.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

A Ranking Scheme for XML Information Retrieval Based on Benefit and Reading Effort Toshiyuki Shimizu (Kyoto University) Masatoshi Yoshikawa (Kyoto University) ICADL 2007 12th December

XML-IR systems Growing demand for XML Information Retrieval (XML-IR) Systems We can identify meaningful document fragments by encoding documents in XML ex) Sections, subsections and paragraphs in scholarly articles Browsing only document fragments relevant to a certain topic The most simple form of queries for XML-IR is just a set of keywords Simple, intuitively understandable, yet useful form of queries, especially for unskilled end-users Active research area as in INEX* * INitiative for the Evaluation of XML Retrieval (http://inex.is.informatik.uni-duisburg.de/)

Results of XML-IR Systems <?xml version="1.0"?> <article> <sec> <p>XML labeling</p> <p>The structure of XML is a tree, and each node in the XML is labeled.</p> <p>We can get tag name of each XML element.</p> </sec> <p>Tree index</p> <p>XML index is constructed using the labels</p> </article> Document fragment (element) With relevance degree (Score) ex) Query term was “XML” e0 0.56 article 0.64 0.35 e1 e5 sec sec 0.4 0.9 0.33 0.8 e6 e7 e2 e3 e4 p p p p p Score

Naïve XML-IR System Thorough strategy of INEX 2005 e3 (0.9) e7 (0.8) Simply retrieves relevant elements from all elements and ranks them in order of relevance e3 (0.9) e7 (0.8) e1 (0.64) e0 (0.56) e2 (0.4) e5 (0.35) e4 (0.33) Score e0 0.56 article 0.64 0.35 e1 e5 sec sec 0.4 0.9 0.33 0.8 e6 e7 e2 e3 e4 p p p p p Thorough is considered for system evaluation User behavior of browsing search results must be considered

Problems of Thorough Retrieval for XML-IR Nesting elements Browsing both elements is useless Ancestor element ea  Descendant element ed ed has been fully seen Descendant element ed  Ancestor element ea ea has been partially seen before Element size Elements retrieved by XML-IR systems varies widely in size Large element, such as article (whole document) Small element, such as p (paragraph) Total output size of top-k elements is uncontrollable by simply giving an integer k

Overview of our Approach Introduction of the concepts of benefit and reading effort Users can control the total output size Systems can retrieve non-overlapping elements

Properties of Benefit and Reading Effort (1/2) The benefit of an element is the amount of gain about the query by reading the element Assumption 1: The benefit of an element is greater than or equal to the sum of the benefit of the child elements Information complementation among sibling elements ex) For two query terms A and B e6 contains topics about A e7 contains topics about B  The benefit of e5 seems to be greater than the sum of benefit of e6 and e7 e5 sec e6 e7 p p

Properties of Benefit and Reading Effort (2/2) The reading effort of an element is the amount of cost by reading the content of the element Assumption 2: The reading effort of an element is less than or equal to the sum of the reading effort of the child elements Readability of continuous reading ex) Users can read the same content more easily by reading e5 rather than separate e6 and e7 e5 sec : e7 e6 : e5 e6 e7 p p

Overview of our Approach Introduction of the concepts of benefit and reading effort Users can control the total output size Systems can retrieve non-overlapping elements Flexible retrieval Users specify a threshold for the total amount of reading effort The systems return relevant elements that provide larger benefit and that can be read within specified reading effort

 Retrieve {e2, e3} (Total benefit : 11) Flexible Retrieval Systems calculate benefit and reading effort A variant of knapsack problems ex) Threshold of reading effort : 15  Retrieve {e2, e3} (Total benefit : 11) e0 article e1 e5 sec sec e2 e3 e4 e6 e7 p p p p p

 Retrieve {e3, e7} (Total benefit : 17) Flexible Retrieval Systems calculate benefit and reading effort A variant of knapsack problems ex) Threshold of reading effort : 20  Retrieve {e3, e7} (Total benefit : 17) e0 article e1 e5 sec sec e2 e3 e4 e6 e7 p p p p p

Search Result Continuity article e0 p e4 e3 sec e1 e5 e2 e6 e7 Search Result Continuity ex) reading effort : 15  Retrieve {e2, e3} (benefit : 11) reading effort : 20  Retrieve {e3, e7} (benefit : 17) The running example violate search result continuity The content of element set for reading effort r must be contained in the content of element set for reading effort r’ if r <= r’ The optimal solution is NP-hard (A variant of knapsack problems) may violate search result continuity Greedy retrieval algorithm

Retrieval Algorithm e0 e3 (0.9) e7 (0.8) e1 (0.64) e0 (0.56) e2 (0.4) Based on the result of Thorough strategy* Adjust benefit and reading effort for nesting elements of retrieved element, and rerank Remove overlapping contents by nestings * Simply retrieves relevant elements from all elements and ranks them in order of relevance e0 Result of Thorough e3 (0.9) e7 (0.8) e1 (0.64) e0 (0.56) e2 (0.4) e5 (0.35) e4 (0.33) article 0.56 e1 e5 0.64 0.35 sec sec 0.4 e2 e3 0.9 e4 0.33 e6 e7 0.8 p p p p p

Amount of reading effort : 10 Retrieval Algorithm Result of Thorough Our result e3 (0.9) e3 (0.9) Amount of benefit : 9 e7 (0.8) Amount of reading effort : 10 e1 (0.64) e1 (0.5) Adjust e1 , e0 e0 (0.48) e0 (0.56) e0 0.56 0.48 e2 (0.4) article e5 (0.35) e4 (0.33) e1 e5 0.64 0.5 0.35 sec sec Threshold of reading effort : 40 0.4 e2 e3 0.9 e4 0.33 e6 e7 0.8 p p p p p

Retrieval Algorithm e3 (0.9) e3 (0.9) e7 (0.8) e7 (0.8) e1 (0.5) Result of Thorough Our result e3 (0.9) e3 (0.9) Amount of benefit : 9 Amount of benefit : 17 e7 (0.8) e7 (0.8) Amount of reading effort : 20 Amount of reading effort : 10 e1 (0.5) e0 (0.37) e0 (0.48) e0 Adjust and rerank e5 , e0 0.48 0.37 e2 (0.4) article e5 (0) e5 (0.35) e4 (0.33) e1 e5 0.5 0.35 sec sec Threshold of reading effort : 40 0.4 e2 e3 0.9 e4 0.33 e6 e7 0.8 p p p p p

Retrieval Algorithm e3 (0.9) e3 (0.9) e7 (0.8) e7 (0.8) e1 (0.5) Result of Thorough Our result e3 (0.9) e3 (0.9) Amount of benefit : 17 Amount of benefit : 26 e7 (0.8) e7 (0.8) Amount of reading effort : 38 Amount of reading effort : 20 e1 (0.5) e1 (0.5) e0 e2 (0.4) 0.17 e0 (0.37) e0 (0.17) article 0.37 Adjust and rerank e0 e4 (0.33) e5 (0) e1 e5 0.5 sec sec Threshold of reading effort : 40 0.4 e2 e3 0.9 e4 0.33 e6 e7 0.8 p p p p p

Retrieval Algorithm e3 (0.9) e7 (0.8) e7 (0.8) e1 (0.5) e1 (0.5) e0 Result of Thorough Our result e3 (0.9) e7 (0.8) Amount of benefit : 26 Amount of benefit : 26 e7 (0.8) e1 (0.5) Amount of reading effort : 38 Amount of reading effort : 38 e1 (0.5) e0 e2 (0.4) 0.17 e4 (0.33) article e0 (0.17) e5 (0) e1 e5 0.5 sec sec Threshold of reading effort : 40 0.4 e2 e3 0.9 e4 0.33 e6 e7 0.8 p p p p p

Retrieval Algorithm e3 (0.9) e7 (0.8) e7 (0.8) e1 (0.5) e1 (0.5) e0 Result of Thorough Our result e3 (0.9) e7 (0.8) Amount of benefit : 26 Amount of benefit : 26 e7 (0.8) e1 (0.5) Amount of reading effort : 38 Amount of reading effort : 38 e1 (0.5) e0 e2 (0.4) 0.17 e4 (0.33) article e0 (0.17) e5 (0) e1 e5 0.5 sec sec Threshold of reading effort : 40 0.4 e2 e3 0.9 e4 0.33 e6 e7 0.8 p p p p p

Retrieval Algorithm e3 (0.9) e7 (0.8) e7 (0.8) e1 (0.5) e1 (0.5) e0 Result of Thorough Our result e3 (0.9) e7 (0.8) Amount of benefit : 26 Amount of benefit : 26 e7 (0.8) e1 (0.5) Amount of reading effort : 38 Amount of reading effort : 38 e1 (0.5) e0 e2 (0.4) 0.17 e4 (0.33) article e0 (0.17) e5 (0) e1 e5 0.5 sec sec Threshold of reading effort : 40 0.4 e2 e3 0.9 e4 0.33 e6 e7 0.8 p p p p p

Evaluation Metrics Based on benefit and reading effort b/e graph (benefit/effort graph) Comparison with BTIL (Best Thorough Input List) BTIL system is the system which use actual benefit and reading effort Actual benefit is calculated using manually constructed assessments (e.g. INEX) We can observe relative effectiveness of benefit changing the specified threshold of reading effort Use the same values for reading effort between implemented system and BTIL system

For the threshold value 30 of reading effort article article e1 e5 e1 e5 sec sec sec sec e2 e3 e4 e6 e7 e2 e3 e4 e6 e7 p p p p p p p p p p Calculated benefit / reading effort Actual benefit / reading effort For the threshold value 30 of reading effort BTIL system retrieves {e3, e6} Obtained actual benefit is 23 Implemented system retrieves {e3, e7} Obtained actual benefit is 10

Examples of b/e Graph using INEX 2005 Test Collection (1/2) XML document set, Topics, Assessments Calculate actual benefit and reading effort from Assessments ex (Exhaustivity): Highly exhaustive (HE)  1 Partially exhaustive (PE)  0.5 Not exhaustive(NE)  0 rsize: relevant text length (in number of characters) size: element length (in number of characters) We implemented a system using tf-ief ief stands for inverse element frequency satisfies Assumptions for benefit and reading effort : parameter

Examples of b/e Graph using INEX 2005 Test Collection (2/2) Topic 207 Topic 206 We can observe relative effectiveness of implemented systems against BTIL system

Conclusions and Future Works Introduction of benefit and reading effort Handling nesting elements Variety of element size Algorithm for flexible retrieval Result elements change depending on the specified reading effort System evaluation Future Works Introduction of switching effort Cost of switching a result item in the results list Retrieving numerous results increases the cost of browsing Integration with user interface