Structure/XML Retrieval Mounia Lalmas Department of Computer Science Queen Mary University of London.

Slides:



Advertisements
Similar presentations
INEX: Evaluating content-oriented XML retrieval Mounia Lalmas Queen Mary University of London
Advertisements

Evaluating content-oriented XML retrieval: The INEX initiative Mounia Lalmas Queen Mary University of London
Evaluating XML retrieval: The INEX initiative Mounia Lalmas Queen Mary University of London
XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London qmir.dcs.qmul.ac.uk.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Modern Information Retrieval Chapter 1: Introduction
Improved TF-IDF Ranker
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
XML Ranking Querying, Dagstuhl, 9-13 Mar, An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Information Retrieval in Practice
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
IR Models: Structural Models
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
1 - Fuhr: Information Retrieval Methods for XML Documents XIRQL: Eine Anfragesprache für Information Retrieval in XML- Dokumenten Norbert Fuhr Universität.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Information Retrieval in Practice
XML Information Retrieval and INEX Norbert Fuhr University of Duisburg-Essen.
INEX : Understanding XML Retrieval Evaluation Mounia Lalmas and Anastasios Tombros Queen Mary, University of London Norbert Fuhr University.
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.
Search Engines and Information Retrieval Chapter 1.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
An Architecture for Emergent Semantics Sven Herschel, Ralf Heese, and Jens Bleiholder Humboldt-Universität zu Berlin/ Hasso-Plattner-Institut.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Vector Space Models.
Information Retrieval
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
XRANK: RANKED KEYWORD SEARCH OVER XML DOCUMENTS Lin Guo Feng Shao Chavdar Botev Jayavel Shanmugasundaram Abhishek Chennaka, Alekhya Gade Advanced Database.
Text Similarity: an Alternative Way to Search MEDLINE James Lewis, Stephan Ossowski, Justin Hicks, Mounir Errami and Harold R. Garner Translational Research.
Chapter 13. Structured Text Retrieval With Mounia Lalmas 무선 / 이동 시스템 연구실 김민혁.
Information Retrieval in Practice
Text Based Information Retrieval
Information Retrieval and Web Search
Information Retrieval and Web Search
Toshiyuki Shimizu (Kyoto University)
Information Retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Introduction to Information Retrieval
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Information Retrieval and Web Design
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Structure/XML Retrieval Mounia Lalmas Department of Computer Science Queen Mary University of London

Outline 1.Structured document retrieval (SDR) 2.Queries/information needs in SDR 3.Structured documents / XML 4.Conceptual models 5.Approaches for SDR 6.Approaches for XML retrieval 7.User aspects

Structured Document Retrieval (SDR)  Traditional IR is about finding relevant documents to a user’s information need, e.g. entire book.  SDR allows users to retrieve document components that are more focussed to their information needs, e.g a chapter, a page, several paragraphs of a book instead of an entire book.  The structure of documents is exploited to identify which document components to retrieve. Structure improves precision Exploit visual memory

Queries in SDR  Content-only (CO) queries  Standard IR queries but here we are retrieving document components “London tube strikes”  Content-and-structure (CAS) queries  Put constraints on which types of components are to be retrieved E.g. “Sections of an article in the Times about congestion charges” E.g. Articles that contain sections about congestion charges in London, and that contain a picture of Ken Livingstone, and return titles of these articles”  Inner constraints (support elements), target elements

DocumentsQuery Document representation Retrieval results Query representation IndexingFormulation Retrieval function Relevance feedback Conceptual model for IR

Conceptual model for SDR Structured documentsContent + structure Inverted file + structure index tf, idf, … Matching content + structure Presentation of related components DocumentsQuery Document representation Retrieval results Query representation IndexingFormulation Retrieval function Relevance feedback

Conceptual model for SDR Structured documents Content + structure Inverted file + structure index tf, idf, acc, … Matching content + structure Presentation of related components e.g. acc can be used to capture the importance of the structure query languages referring to content and structure are being developed for accessing XML documents, e.g. XIRQL, NEXI, XQUERY XML is the currently adopted format for structured documents structure index captures in which document component the term occurs (e.g. title, section), as well as the type of document components (e.g. XML tags) additional constraints are imposed from the structure e.g. a chapter and its sections may be retrieved

Structured Documents Book Chapters Sections Paragraphs  In general, any document can be considered structured according to one or more structure-type Linear order of words, sentences, paragraphs … Hierarchy or logical structure of a book’s chapters, sections … Links (hyperlink), cross-references, citations … Temporal and spatial relationships in multimedia documents World Wide Web This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence..

Structured Documents  The structure can be implicit or explicit  Explicit structure is formalised through document representation standards (Mark- up Languages) — Layout  LaTeX (publishing), HTML (Web publishing) — Structure  SGML, XML (Web publishing, engineering), MPEG-7 (broadcasting) — Content/Semantic  RDF (ontology) World Wide Web This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence.. SDR … …

XML: eXtensible Mark-up Language  Meta-language (user-defined tags) currently being adopted as the document format language by W3C  Used to describe content and structure (and not layout)  Grammar described in DTD (  used for validation) Structured Document Retrieval Smith John Introduction into SDR …. … …

XML: eXtensible Mark-up Language  Use of XPath notation to refer to the XML structure chapter/title: title is a direct sub-component of chapter //title: any title chapter//title: title is a direct or indirect sub-component of chapter chapter/paragraph[2]: any direct second paragraph of any chapter chapter/*: all direct sub-components of a chapter Structured Document Retrieval Smith John Introduction into SDR …. …

Approaches  Region algebra  Passage retrieval  Aggregation-based  XML retrieval

Region algebra  Manipulates text intervals - “between which positions in the document?”; and uses containment relationships - “in which components?” — Various methods but with similar aims: Simple Concordance List, Generalised Concordance List, Proximal Nodes … — Ranking based on word distances — Suited for CO and CAS queries Structured Document Retrieval Introduction into Structured Document Retrieval … SDR … Query: “document” and “retrieval” Intervals: {(102, 103)(107, 108)} Query: [chapter] containing SDR Intervals: {(103.2, 167.2)} (SIGIR 1992, Baezia-Yates etal 1999, XML retrieval Mihajlovic etal CIKM 2005)

Passage retrieval  Passage: continuous part of a document, Document: set of passages  A passage can be defined in several ways: — Fixed-length e.g. (300-word windows, overlapping) — Discourse (e.g. sentence, paragraph)  e.g. according to logical structure but fixed (e.g. passage = sentence, or passage = paragraph) — Semantic (TextTiling based on sub-topics)  Apply IR techniques to passages — Retrieve passage or document based on highest ranking passage or sum of ranking scores for all passages — Deal principally with CO queries p1 p2 p3 p4 p5 p6 doc (SIGIR 1993, SIGIR 1994)

Aggregation-based  Approaches that exploit the hierarchical structure; particularly suited to XML retrieval  General idea: the representation or rank score of a composite component (e.g. article and section) is defined as the aggregated representation or rank score of its sub-components — Aggregation of retrieval status values — Aggregation of representations paragraph article section p1 is about “XML”, “retrieval” p2 is about “XML”, “authoring” sec3 is then also about “XML” (in fact very much about “XML”), “retrieval”, “authoring” (SIGIR 1997, JDoc 1998, SPIRE 2002, Frisse 1988, Chiaramella etal, 1996, ECIR 2002)

XML Retrieval  Data-centric views of XML documents — XML database retrieval  Document-centric view of XML — XML information retrieval: content-oriented approaches  XML retrieval — Challenges — Approaches

Data Centric XML Documents Thomas Mounia Tony

Document Centric XML Documents Mounia James Bond is the best student in the class. He scored 95 points out of 100. His presentation of Using Materialized Views in Data Warehouse was brilliant. Donald Duck is not a very good student. He scored 20 points…

Database approaches to XML retrieval Relational OO Native Flexibility, expressiveness, complexity Efficiency  Data-oriented retrieval — containment and not aboutness — no relevance-based ranking  Aims/challenges tend to focus on efficiency performance  … but ranking is slowly becoming important too … XQuery (XXL system; SIGMOD 2005, VLDB 2005)

Content-oriented XML retrieval Return document components of varying granularity (e.g. a book, a chapter, a section, a paragraph, a table, a figure, etc), relevant to the user’s information need both with regards to content and structure. SEARCHING = QUERYING + BROWSING

Book Chapters Sections Subsections World Wide Web This is only only another to look one le to show the need an la a out structure of and more a document and so ass to it doe not necessary text a structured document have retrieval on the web is an it important topic of today’s research it issues to make se last sentence.. XML retrieval allows users to retrieve document components that are more focussed, e.g. a subsection of a book instead of an entire book. SEARCHING = QUERYING + BROWSING

Content-oriented XML retrieval Retrieve the best components according to content and structure criteria:  INEX: most specific component that satisfies the query, while being exhaustive to the query  Shakespeare study: best entry points, which are components from which many relevant components can be reached through browsing  ???

Article ?XML,?retrieval ?authoring 0.9 XML 0.5 XML 0.2 XML 0.4 retrieval 0.7 authoring Challenge 1: term weights Title Section 1Section 2 No fixed retrieval unit + nested document components:  how to obtain document and collection statistics (e.g. tf, idf)  which aggregation formalism to use?  inner or outer aggregation?

Article ?XML,?retrieval ?authoring 0.9 XML 0.5 XML 0.2 XML 0.4 retrieval 0.7 authoring Challenge 2: augmentation weights Title Section 1Section 2 Nested document components:  which components contribute best to content of Article (e.g. acc)?  how to estimate (or learn) augmentation weights (e.g. size, number of children, depth)?  how to aggregate term and/or augmentation weights?

Article ?XML,?retrieval ?authoring 0.9 XML 0.5 XML 0.2 XML 0.4 retrieval 0.7 authoring Challenge 3: component weights Title Section 1Section 2 Different types of document components:  which component is a good retrieval unit?  is element size an issue?  how to estimate (or learn) component weights (frequency, user studies, size, depth)?  how to aggregate term, augmentation and/or component weights?

Article XML,retrieval authoring XML XML XML retrieval authoring Challenge 4: overlapping elements Title Section 1Section 2 Nested (overlapping) elements:  Section 1 and article are both relevant to “XML retrieval”  which one to return so that to reduce overlap?  should the decision be based on user studies, size, types, etc?

Approaches … vector space model probabilistic model Bayesian network language model extending DB model Boolean model natural language processing cognitive model ontology parameter estimation tuning smoothing fusion phrase term statistics collection statistics component statistics proximity search logistic regression belief model relevance feedback divergence from randomness machine learning

Vector space model article index abstract index section index sub-section index paragraph index RSVnormalised RSV RSVnormalised RSV RSVnormalised RSV RSVnormalised RSV RSVnormalised RSV merge tf and idf as for fixed and non-nested retrieval units (IBM Haifa, INEX 2003)

Language model element language model collection language model smoothing parameter element score element size element score article score query expansion with blind feedback ignore elements with  20 terms high value of leads to increase in size of retrieved elements rank element (University of Amsterdam, INEX 2003)

Ranker - Merger Topic Processor Filter Indexer Extractor Relevant documents RankerMerger Relevant fragments Fragments augmented with ranking scores TopicResult Indices IEEE Digital Library Ranker 5 Ranker 4 Ranker 3 Ranker 2 Ranker 1 (Ben-Aharon, INEX 2003) –Word Number –IDF –Similarity –Proximity –TFIDF

Merging - Normalisation (Amati et al, INEX 2004)

Probabilistic algebra // article [about(.,"bayesian networks")] // sec [about(., learning structure)]  “Vague” sets — R(…) defines a vague set of elements — label -1 (…) can be defined as strict for SCAS or vague for VCAS  Intersections and Unions are computed as probabilistic “and” and fuzzy-or. ( Vittaut, INEX 2004)

Controlling Overlap Start with a component ranking, elements are re- ranked to control overlap. Retrieval status values (RSV) of those components containing or contained within higher ranking components are iteratively adjusted 1.Select the highest ranking component. 2.Adjust the RSV of the other components. 3.Repeat steps 1 and 2 until the top m components have been selected. (SIGIR 2005)

Controlling overlap (IBM Haifa, INEX 2005) Thorough run Cluster of highly ranked results 1. score N1 > score N2 2. concentration of good nodes 3. Even distribution

Controlling overlap What about a model where overlap is removed not at post-processing? Bayesian network can be used model the structure?

…..  See INEX proceedings, SIGIR XML IR workshops, INEX Information Retrieval special issue 2005, forthcoming TOIS special issue, SIGIR 2004, SIGIR 2005, …  Also have a look in work in the database community, who are also doing ranking, e.g. WebDB, SIGMOD, VLDB, …

User aspects  User study - INEX interactive track  Incorporating user behaviour

Evaluation of XML retrieval: INEX  Evaluating the effectiveness of content-oriented XML retrieval approaches  Collaborative effort  participants contribute to the development of the collection queries relevance assessments  Similar methodology as for TREC, but adapted to XML retrieval

 Investigate behaviour of searchers when interacting with XML components  Content-only Topics — topic type an additional source of context  Background topics / Comparison topics — 2 topic types, 2 topics per type — 2004 INEX topics have added task information  Searchers — “distributed” design, with searchers spread across participating sites Interactive Track in 2004

Topic Example +new +Fortran +90 +compiler How does a Fortran 90 compiler differ from a compiler for the Fortran before it. I've been asked to make my Fortran compiler compatible with Fortran 90 so I'm interested in the features Fortran 90 added to the Fortran standard before it. I'd like to know about compilers (they would have been new when they were introduced), especially compilers whose source code might be available. Discussion of people's experience with these features when they were new to them is also relevant. An element will be judged as relevant if it discusses features that Fortran 90 added to Fortran. new Fortran 90 compiler

Baseline system

Some results  How far down the ranked list?  83 % from rank 1-10  10 % from rank  Query operators rarely used  80 % of queries consisted of 2, 3, or 4 words  Accessing components  ~2/3 was from the ranked list  ~1/3 was from the document structure (ToC)  1 st viewed component from the ranked list  40% article level, 36% section level, 22% ss1 level, 4% ss2 level  ~ 70 % only accessed 1 component per document

Interpretation of structural constraints  Content-only (CO): aim is to decrease user effort by pointing the user to the most specific relevant elements ( )  Strict content-and-structure (SCAS): retrieve relevant elements that exactly match the structure specified in the query (2002, 2003)  Vague content-and-structure (VCAS):  retrieve relevant elements that may not be the same as the target elements, but are structurally similar (2003)  retrieve relevant elements even if do not exactly meet the structural conditions; treat structure specification as hints as to where to look (2004) CO CAS

Relevance in information retrieval relevant  A document is relevant if it “has significant and demonstrable bearing on the matter at hand”.  Common assumptions in information retrieval laboratory experimentation:  Objectivity  Topicality  Binary nature  Independence (Borlund, JASIST 2003)

Relevance in XML retrieval relevant  A document is relevant if it “has significant and demonstrable bearing on the matter at hand”.  Common assumptions in laboratory experimentation:  Objectivity  Topicality  Binary nature  Independence XML retrieval evaluation XML retrieval article ss1 ss2 s1 s2 s3 XML evaluation

Relevance in XML retrieval: INEX  Relevance = (0,0) (1,1) (1,2) (1,3) (2,1) (2,2) (2,3) (3,1) (3,2) (3,3) exhaustivity = how much the section discusses the query: 0, 1, 2, 3 specificity = how focused the section is on the query: 0, 1, 2, 3  If a subsection is relevant so must be its enclosing section,...  Topicality not enough  Binary nature not enough  Independence is wrong XML retrieval evaluation XML retrieval article ss1 ss2 s1 s2 s3 XML evaluation (based on Chiaramella etal, FERMI fetch and browse model 1996)

best Retrieve the best XML elements according to content and structure criteria:  Most exhaustive and most specific = (3,3)  Near misses = (3,3) + (2,3) (1,3) <--- specific  Near misses = (3, 3) + (3,2) (3,1) <-- exhaustive  Near misses = (3, 3) + (2,3) (1,3) (3,2) (3,1) (1,2) … overlapping  Focussed retrieval = no overlapping elements

Two four-graded dimensions of relevance  How to differentiate between (1,3) and (3,3), …?  Several “user models” — Expert and impatient: retrieval of highly exhaustive and specific elements (3,3) — Expert and patient: retrieval of highly specific elements (3,3), (2,3) (1,3) — … — Naïve and has lots of time: retrieval of any relevant elements; i.e. everything apart (0,0)  …..

XML retrieval - The Future  Novel model that are XML specific, and not standard IR models applied to XML, with some post-processing  Aggregation: inner, outer?  So far we have dealt only with syntactical aspect - logical structure - and not much has been done with respect to semantics  What do users want for XML retrieval?

Conclusions  SDR --> now mostly about XML retrieval  Efficiency: — Not just documents, but all its elements  Models — Statistics to be adapted or redefined — Aggregation / combination  User tasks  Link to web retrieval / novelty retrieval  Interface and visualisation  Clustering, categorisation, summarisation  Applications — Intranet, the Internet(?), digital libraries, publishing companies, semantic web, e-commerce

Thank you