Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.

Slides:



Advertisements
Similar presentations
INEX: Evaluating content-oriented XML retrieval Mounia Lalmas Queen Mary University of London
Advertisements

Even More TopX: Relevance Feedback Ralf Schenkel Joint work with Osama Samodi, Martin Theobald.
Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
XML Ranking Querying, Dagstuhl, 9-13 Mar, An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
1 Retrieval Performance Evaluation Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 3)
XML R ETRIEVAL Tarık Teksen Tutal I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric.
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Evaluating Search Engine
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Dynamic Element Retrieval in a Structured Environment Crouch, Carolyn J. University of Minnesota Duluth, MN October 1, 2006.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
Re-ranking Documents Segments To Improve Access To Relevant Content in Information Retrieval Gary Madden Applied Computational Linguistics Dublin City.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Evaluating the Performance of IR Sytems
Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
Overview of Search Engines
XML Information Retrieval and INEX Norbert Fuhr University of Duisburg-Essen.
INEX : Understanding XML Retrieval Evaluation Mounia Lalmas and Anastasios Tombros Queen Mary, University of London Norbert Fuhr University.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Linking Wikipedia to the Web Antonio Flores Bernal Department of Computer Sciencies San Pablo Catholic University 2010.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Which of the two appears simple to you? 1 2.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Users and Assessors in the Context of INEX: Are Relevance Dimensions Relevant? Jovan Pehcevski, James A. Thom School of CS and IT, RMIT University, Australia.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG,
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Performance Measurement. 2 Testing Environment.
Information Retrieval
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Chapter 13. Structured Text Retrieval With Mounia Lalmas 무선 / 이동 시스템 연구실 김민혁.
Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.
Evaluation Anisio Lacerda.
Information Retrieval and Web Search
Toshiyuki Shimizu (Kyoto University)
IR Theory: Evaluation Methods
Data Mining Chapter 6 Search Engines
Retrieval Evaluation - Reference Collections
Introduction to Search Engines
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada

Content-Oriented XML Retrieval Documents not Data Natural Language Queries “ text index compression ” Ranked Results

Goal In response to a user query, select and return an appropriately ranked mix of document components… paragraphs, sections, subsections, bibliographic entries, articles, etc. In response to a user query, select and return an appropriately ranked

IEEE Journal Article in XML Text Compression for Dynamic Document Databases Alistair Moffat Justin Zobel Neil Sharman Abstract For compression of text... INTRODUCTION Modern document databases contain vast quantities of text... There are good reasons to compress the text stored in a... REDUCING MEMORY REQUIREMENTS Method A The first method......

XML Tree article fmbdy atlau abs p b sec stip1p sec st ss1 st

Ranking Document Components 1) Split into components 2) Treat each as a “document” 3) Rank using standard IR methods #1 #6 #2 #103 #8 #4 Problem: Top ranks dominated by related components

eg. “ text index compression ” /co/2000/ry037.xml /article[1]/bdy[1] /co/2000/ry037.xml /article[1] /co/2000/ry037.xml /article[1]/bdy[1]/sec[2] /co/2000/ry037.xml /article[1]/bdy[1]/sec[5] /tk/1997/k0302.xml /article[1] /tk/1997/k0302.xml /article[1]/bdy[1]/sec[1] /tk/1997/k0302.xml /article[1]/bdy[1] /co/2000/ry037.xml /article[1]/bdy[1]/sec[3] /tk/1997/k0302.xml /article[1]/bdy[1]/sec[6] /tp/2000/i0385.xml /article[1] /co/2000/ry037.xml /article[1]/bdy[1]/sec[7] /tk/1997/k0302.xml /article[1]/bdy[1]/sec[3] /tp/2000/i0385.xml /article[1]/bdy[1] /co/2000/ry037.xml /article[1]/bdy[1]/sec[4] /tp/2000/i0385.xml /article[1]/bdy[1]/sec[1] /tk/1997/k0302.xml /article[1]/bm[1] /tk/1997/k0302.xml /article[1]/bdy[1]/sec[3]/ss1[1] /co/2000/ry037.xml /article[1]/bdy[1]/sec[5]/ss1[1]

Is overlap always bad? General overview Topic-focused components article fmbdy atlau abs p b sec stip1p sec st ss1 st

Overview Baseline retrieval method Evaluation methodology Re-ranking algorithm

Baseline Retrieval Method Treat each component as a separate document. Rank using Okapi BM25. Tune retrieval parameters to task and collection.

Okapi BM25 Given a query Q, a component x is scored: where: D= number of documents in the collection D t = number of documents containing t q t =frequency that t occurs in the query x t =frequency that t occurs in x K=k 1 ((1 – b) + b·l x /l avg ) l x =length of x l avg =average document length Σ qtqt (k 1 + 1) x t K + x t D – D t D t () log t є Qt є Q k 1 = 10.0 b = 0.80

Overview Baseline retrieval method Evaluation methodology Re-ranking algorithm

XML Evaluation - INEX 2004 Third year of the “INitiative for the Evaluation of XML retrieval” (INEX). Encourages research into XML information retrieval technology (similar to TREC). Over 50 participating groups.

INEX 2004 Content-Only Retrieval Task Test collection of articles taken from IEEE journals between 1995 and “adhoc” queries: –text index compression –software quality –new fortran 90 compiler Each group could submit up to three runs consisting of the top 1500 components for each query.

Relevance Assessment INEX uses two dimensions for relevance –exhaustivity: the degree to which a component covers a topic –specificity: the degree to which a component is focused on a topic A four-point scale is used in both dimensions (eg. a (3,3) component is highly exhaustive and highly specific)

INEX 2004 Evaluation Metrics Mean average precision (MAP) XML cumulated gain (XCG) Various quantization functions: –strict quantization - (3,3) elements only –generalized quantization –specificity-oriented generalization

Overview Baseline retrieval method Evaluation methodology Re-ranking algorithm

Controlling Overlap Starting with a component ranking generated by the baseline method, elements are re-ranked to control overlap. Scores of those components containing or contained within higher ranking components are iteratively adjusted.

Controlling Overlap #1 #2 #4 #6 #10 #87 # 203 #21 #863

Re-ranking Algorithm 1.Report the highest ranking component. 2.Adjust the scores of the unreported components. 3.Repeat steps 1 and 2 until the top m components have been reported.

Adjusting Scores x t = frequency that t occurs in component x Σ qtqt (k 1 + 1) x t K + x t D – D t D t () log t є Qt є Q f t = frequency that t occurs in x g t =frequency that t has been reported in subcomponents of x x t = f t – α·g t

Adjusting Scores (α = 0.5) f t = 6 g t = 0 x t = 6 f t = 13 g t = 0 x t = 13 #1 f t = 13 g t = 6 x t = 10

Re-ranking Algorithm 1.Maintain tree nodes in a priority queue ordered by score. 2.Report component at front of queue. 3.Propagate adjustments up and down the tree re-ordering nodes in the queue. 4.Repeat steps 2 and 3 until the top m components have been reported.

Mean Average Precision vs. XCG

Extended Re-Ranking Algorithm One parameter (α) not sufficient. Reported in descendant (α) vs. ancestor (β). Number of times term reported in ancestor: 1 = β 0 ≥ β 1 ≥ β 2 ≥ … ≥ β n = 0 x t = β i · ( f t – α·g t )

Concluding Comments Overlap isn’t always bad Overlap should be controlled not eliminated Current and future work: –Modifying k 1 to reflect term frequency adjustments –More appropriate evaluation methodology –Learning α, β i

Questions? Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada