INEX: Evaluating content-oriented XML retrieval Mounia Lalmas Queen Mary University of London

Slides:



Advertisements
Similar presentations
XIRQL: Eine Anfragesprache für Information Retrieval in XML-Dokumenten
Advertisements

Evaluating content-oriented XML retrieval: The INEX initiative Mounia Lalmas Queen Mary University of London
Evaluating XML retrieval: The INEX initiative Mounia Lalmas Queen Mary University of London
XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London qmir.dcs.qmul.ac.uk.
Even More TopX: Relevance Feedback Ralf Schenkel Joint work with Osama Samodi, Martin Theobald.
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
XML Ranking Querying, Dagstuhl, 9-13 Mar, An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Structure/XML Retrieval Mounia Lalmas Department of Computer Science Queen Mary University of London.
Video retrieval using inference network A.Graves, M. Lalmas In Sig IR 02.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
A Mobile World Wide Web Search Engine Wen-Chen Hu Department of Computer Science University of North Dakota Grand Forks, ND
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Computer comunication B Information retrieval Repetition Retrieval models Wildcards Web information retrieval Digital libraries.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
1 - Fuhr: Information Retrieval Methods for XML Documents XIRQL: Eine Anfragesprache für Information Retrieval in XML- Dokumenten Norbert Fuhr Universität.
Information Retrieval in Practice
XML Information Retrieval and INEX Norbert Fuhr University of Duisburg-Essen.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
INEX : Understanding XML Retrieval Evaluation Mounia Lalmas and Anastasios Tombros Queen Mary, University of London Norbert Fuhr University.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
INEX – a broadly accepted data set for XML database processing? Pavel Loupal, Michal Valenta.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
Chapter 6: Information Retrieval and Web Search
Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Search Engine Architecture
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Vector Space Models.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Information Retrieval
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Text Based Information Retrieval
Information Retrieval and Web Search
Search Engine Architecture
Information Retrieval and Web Search
“INEX 2005: Playground for XML-retrieval” Sergey Chernov
Toshiyuki Shimizu (Kyoto University)
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Introduction to Information Retrieval
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

INEX: Evaluating content-oriented XML retrieval Mounia Lalmas Queen Mary University of London

Outline Content-oriented XML retrieval Content-oriented XML retrieval Evaluating XML retrieval: INEX Evaluating XML retrieval: INEX

XML Retrieval Traditional IR is about finding relevant documents to a users information need, e.g. entire book. Traditional IR is about finding relevant documents to a users information need, e.g. entire book. XML retrieval allows users to retrieve document components that are more focussed to their information needs, e.g a chapter of a book instead of an entire book. XML retrieval allows users to retrieve document components that are more focussed to their information needs, e.g a chapter of a book instead of an entire book. The structure of documents is exploited to identify which document components to retrieve. The structure of documents is exploited to identify which document components to retrieve.

Structured Documents Linear order of words, sentences, paragraphs … Hierarchy or logical structure of a books chapters, sections … Links (hyperlink), cross- references, citations … Temporal and spatial relationships in multimedia documents Book Chapters Sections Paragraphs World Wide Web This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of todays research it issues to make se last sentence..

Structured Documents Explicit structure formalised through document representation standards (mark-up languages) Explicit structure formalised through document representation standards (mark-up languages) Layout Layout LaTeX (publishing), HTML (Web publishing) Structure Structure SGML, XML (Web publishing, engineering), MPEG-7 (broadcasting) Content/Semantic Content/Semantic RDF, DAML + OIL, OWL (semantic web) World Wide Web This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of todays research it issues to make se last sentence.. SDR … …

XML: eXtensible Mark-up Language Meta-language (user-defined tags) currently being adopted as the document format language by W3C Meta-language (user-defined tags) currently being adopted as the document format language by W3C Used to describe content and structure (and not layout) Used to describe content and structure (and not layout) Grammar described in DTD ( used for validation) Grammar described in DTD ( used for validation) Structured Document Retrieval Smith John Introduction into XML retrieval …. … …

XML: eXtensible Mark-up Language Use of XPath notation to refer to the XML structure chapter/title: title is a direct sub-component of chapter //title: any title chapter//title: title is a direct or indirect sub-component of chapter chapter/paragraph[2]: any direct second paragraph of any chapter chapter/*: all direct sub-components of a chapter Structured Document Retrieval Smith John Introduction into SDR …. …

Querying XML documents Content-only (CO) queries Content-only (CO) queries ' open standards for digital video in distance learning ' Content-and-structure (CAS) queries Content-and-structure (CAS) queries //article [about(., 'formal methods verify correctness aviation systems')] /body//section /body//section [about(.,'case study application model checking theorem proving')] [about(.,'case study application model checking theorem proving')] Structure-only (SA) queries Structure-only (SA) queries/article//*section/paragraph[2]

Content-oriented XML retrieval Return document components of varying granularity (e.g. a book, a chapter, a section, a paragraph, a table, a figure, etc), relevant to the users information need both with regards to content and structure.

Content-oriented XML retrieval Retrieve the best components according to content and structure criteria: INEX: most specific component that satisfies the query, while being exhaustive to the query INEX: most specific component that satisfies the query, while being exhaustive to the query Shakespeare study: best entry points, which are components from which many relevant components can be reached through browsing Shakespeare study: best entry points, which are components from which many relevant components can be reached through browsing ??? ???

Article ?XML,?retrieval Article ?XML,?retrieval ?authoring ?authoring 0.9 XML 0.5 XML 0.2 XML 0.9 XML 0.5 XML 0.2 XML 0.4 retrieval 0.7 authoring 0.4 retrieval 0.7 authoring Challenges Title Section 1 Section 2 no fixed retrieval unit + nested elements + element types how to obtain document and collection statistics? which component is a good retrieval unit? which components contribute best to content of Article? how to estimate? how to aggregate?

Approaches … vector space model probabilistic model bayesian network language model extending DB model boolean model natural language processing cognitive model ontology parameter estimation tuning smoothing fusion phrase term statistics collection statistics component statistics proximity search logistic regression belief model relevance feedback

Vector space model article index abstract index section index sub-section index paragraph index RSVnormalised RSV RSVnormalised RSV RSVnormalised RSV RSVnormalised RSV RSVnormalised RSV merge tf and idf as for fixed and non-nested retrieval units (IBM Haifa, INEX 2003 )

Language model element language model collection language model smoothing parameter element score element size element score article score query expansion with blind feedback ignore elements with 20 terms high value of leads to increase in size of retrieved elements results with = 0.9, 0.5 and 0.2 similar rank element (University of Amsterdam, INEX 2003)

Evaluation of XML retrieval: INEX Evaluating the effectiveness of content-oriented XML retrieval approaches Evaluating the effectiveness of content-oriented XML retrieval approaches Collaborative effort participants contribute to the development of the collection Collaborative effort participants contribute to the development of the collectionqueries relevance assessments Similar methodology as for TREC, but adapted to XML retrieval Similar methodology as for TREC, but adapted to XML retrieval 40+ participants worldwide 40+ participants worldwide Workshop in Schloss Dagstuhl in December (20+ institutions) Workshop in Schloss Dagstuhl in December (20+ institutions)

INEX Test Collection Documents (~500MB), which consist of 12,107 articles in XML format from the IEEE Computer Society; 8 millions elements Documents (~500MB), which consist of 12,107 articles in XML format from the IEEE Computer Society; 8 millions elements INEX 2002 INEX CO and 30 CAS queries inex2002 metric INEX 2003 INEX CO and 30 CAS queries CAS queries are defined according to enhanced subset of XPath inex2002 and inex2003 metrics INEX 2004 is just starting INEX 2004 is just starting

Tasks CO: aim is to decrease user effort by pointing the user to the most specific relevant portions of documents. CO: aim is to decrease user effort by pointing the user to the most specific relevant portions of documents. SCAS: retrieve relevant nodes that match the structure specified in the query. SCAS: retrieve relevant nodes that match the structure specified in the query. VCAS: retrieve relevant nodes that may not be the same as the target elements, but are structurally similar. VCAS: retrieve relevant nodes that may not be the same as the target elements, but are structurally similar.

Relevance in XML A element is relevant if it has significant and demonstrable bearing on the matter at hand A element is relevant if it has significant and demonstrable bearing on the matter at hand Common assumptions in IR Common assumptions in IR Objectivity Objectivity Topicality Topicality Binary nature Binary nature Independence Independence section paragraph article

Relevance in INEX Exhaustivity Exhaustivity how exhaustively a document component discusses the query: 0, 1, 2, 3 Specificity Specificity how focused the component is on the query: 0, 1, 2, 3 Relevance Relevance (3,3), (2,3), (1,1), (0,0), … (3,3), (2,3), (1,1), (0,0), … section article all sections relevant article very relevant all sections relevant article better than sections one section relevant article less relevant one section relevant section better than article …

Relevance assessment task Completeness Completeness Element parent element, children element Element parent element, children element Consistency Consistency Parent of a relevant element must also be relevant, although to a different extent Parent of a relevant element must also be relevant, although to a different extent Exhaustivity increase going Exhaustivity increase going Specificity decrease going Specificity decrease going Use of an online interface Use of an online interface Assessing a query takes a week! Assessing a query takes a week! Average 2 topics per participants Average 2 topics per participants section paragraph article

Interface Current assessments Navigation Groups

Assessments With respect to the elemens to assess With respect to the elemens to assess 26 % assessments on elements in the pool (66 % in INEX 2002). 68 % highly specific elements not in the pool 7 % elements automatically assessed 7 % elements automatically assessed INEX 2002 INEX inconsistent assessments per query for one rule

Metrics Need to consider: Two dimensions of relevance Two dimensions of relevance Independency assumption does not hold Independency assumption does not hold No predefined retrieval unit No predefined retrieval unit Overlap Overlap Linear vs. clustered ranking Linear vs. clustered ranking section article

INEX 2002 metric Quantization:strictgeneralized

Precision as defined by Raghavan89 (based on ESL) where n is estimated

Overlap problem

INEX 2003 metric Ideal concept space (Wong & Yao 95) c t

INEX 2003 metric Quantization:strictgeneralised

Ignoring overlap:

INEX 2003 metric Considering overlap:

INEX 2003 metric Penalises overlap by only scoring novel information in overlapping results Penalises overlap by only scoring novel information in overlapping results Assume uniform distribution of relevant information Assume uniform distribution of relevant information Issue of stability Issue of stability Size considered directly in precision (is it intuitive that large is good or not?) Size considered directly in precision (is it intuitive that large is good or not?) Recall defined using exh only Recall defined using exh only Precision defined using spec only Precision defined using spec only

Alternative metrics User-effort oriented measures User-effort oriented measures Expected Relevant Ratio Tolerance to Irrelevance Discounted Cumulated Gain Discounted Cumulated Gain

Lessons learnt Good definition of relevance Good definition of relevance Expressing CAS queries was not easy Expressing CAS queries was not easy Relevance assessment process must be improved Relevance assessment process must be improved Further development on metrics needed Further development on metrics needed User studies required User studies required

Conclusion XML retrieval is not just about the effective retrieval of XML documents, but also about how to evaluate effectiveness XML retrieval is not just about the effective retrieval of XML documents, but also about how to evaluate effectiveness INEX 2004 tracks INEX 2004 tracks Relevance feedback Relevance feedback Interactive Interactive Heterogeneous collection Heterogeneous collection Natural language query Natural language query

INEX: Evaluating content-oriented XML retrieval Mounia Lalmas Queen Mary University of London