“INEX 2005: Playground for XML-retrieval” Sergey Chernov
Why Do We Need XML Retrieval?* *Slide is taken from Prabhakar Raghavan Sergey Chernov, Info Lunch at L3S 22/11/18
Why Do We Need XML Retrieval??* Raghavan *Slide is taken from Prabhakar Raghavan Sergey Chernov, Info Lunch at L3S 22/11/18
A Scenario for Desktop Search Xuan searches for “the articles about multimedia conferences and workshops, which are titled “call for papers” or “upcoming events” and were recommended by Mounia”. Query: multimedia workshop /title upcoming events /receivedFrom Mounia affiliatedTo fn uid:123 Queen Mary Uni Mounia Lalmas family receivedFrom given http://inex.is.informatik.uni-duisburg.de/2005/index.html Lalmas accessedFrom msgid:00465 Mounia Upcoming Events storedFrom publication title type publishedIn c:\inex1.8\xml\mu\1998\u40c2.xml IEEE MULTIMEDIA 1999 issn 1070-986X Multimedia Computing and Networking 1999 (MMCN 99) … This conference … multimedia systems… year text 1998 Sergey Chernov, Info Lunch at L3S 22/11/18
What is INEX?* *Slide is taken from Norbert Fuhr Sergey Chernov, Info Lunch at L3S 22/11/18
INEX in the Pictures Paul Ogilvie Gabriella Kazai Saadia Malik Börkur Sigurbjörnsson Arjen P. de Vries Ray Larson Patrick Gallinari Roelof van Zwol Birger Larsen Andrew Trotman Norbert Fuhr Mounia Lalmas Shlomo Geva Ludovic Denoyer Benjamin Piwowarski INEX in the Pictures Sergey Chernov, Info Lunch at L3S 22/11/18
INEX in Numbers community: 58 research groups participated in 2005 collection: 17000 IEEE articles from 1995-2004, 740Mb topics (queries): 87 in total, 40 CO+S and 47 CAS topics tracks: 7 (Adhoc, Relevance Feedback, Natural Language Processing, Heterogeneous, Interactive, Document Mining, Multimedia) publications over 4 years: >125 important dates: April – start, November - finish Sergey Chernov, Info Lunch at L3S 22/11/18
Adhoc Track: Collection and Queries IEEE collection (journals and transactions) Language used for structural conditions: NEXI Topics (queries) Content-only + Structure (CO+S) – Structural part is OPTIONAL Content and Structure (CAS) – Structural part is MANDATORY Example content: "call for papers" conference workshop +multimedia Example structure: //article[about(.//atl,"upcoming events") OR about(.//atl,"call for papers")]//sec[about(., +multimedia conference workshop)] Target element: //article//sec Support elements: //article[about(.//atl,"upcoming events") ; //article[about(.//atl,"call for papers") //article//sec[about(., +multimedia conference workshop)] Sergey Chernov, Info Lunch at L3S 22/11/18
Adhoc Track: Relevance Assessment Methodology Select the top 1500 components in a topic’s retrieval results Assess w.r.t. two dimensions Exhaustivity (E), which describes the extent to which the document component discusses the topic. Specificity (S), which describes the extent to which the document component focuses on the topic. Highly exhaustive Partially exhaustive Too small Sergey Chernov, Info Lunch at L3S 22/11/18
Online Relevance Assessment System X-Rai Sergey Chernov, Info Lunch at L3S 22/11/18
Adhoc: CO Retrieval Strategies CO.Focussed : find the most exhaustive and specific element in a path. Retrieved elements cannot contain any overlapping elements. CO.Thorough : find all highly exhaustive and specific elements. Overlapping is considered as an interface and results presentation issue. CO.FetchBrowse : first identify relevant articles, and then to identify the most exhaustive and specific elements within the fetched articles. Sergey Chernov, Info Lunch at L3S 22/11/18
Adhoc: CAS Retrieval Strategies VVCAS: structural constraints in both the target elements and the support elements are interpreted as vague. SVCAS : target – strict, support - vague. VSCAS : target – vague, support - strict. SSCAS : target and support - strict. Sergey Chernov, Info Lunch at L3S 22/11/18
Adhoc: Relevance Values (RV) Sergey Chernov, Info Lunch at L3S 22/11/18
Adhoc: Metrics Consider: Two dimensions of relevance Independency assumption does not hold No predefined retrieval unit Overlap Extended Cumulative Gain xCG and normalised version nxCG Sergey Chernov, Info Lunch at L3S 22/11/18
Adhoc: Competition The nXCG curves of runs in CO. Thorough task with generalized quantization Sergey Chernov, Info Lunch at L3S 22/11/18
Other Tracks Relevance Feedback Collection: IEEE Goal: investigation of relevance feedback in the context of XML retrieval. The approach should ideally consider not only content but also the structural features of XML documents. Interactive Goal: investigation the behaviour of users when interacting with components of XML documents, and evaluates approaches for XML retrieval which are effective in user-based environments. Heterogeneous Collection: Berkeley bib, FIZ Karlsruhe, Duisburg-Essen bib, DBLP, HCI resources, QMUL db, ZDNet Goal: creation of a heterogeneous test collection, retrieval experiments with a small number of both CO and CAS queries, qualitative analysis of the results. Sergey Chernov, Info Lunch at L3S 22/11/18
Other Tracks (continued) Multimedia Collection: Lonely Planet document collection Goal: an evaluation platform/forum for structured document retrieval systems that do not only include text in the retrieval process. Document Mining Collection: IMdB collection Goal: generic tasks of classification and clustering. Natural Language Processing Collection: Any Goal: design and build software that will analyse, understand, and generate results in response to queries that humans express naturally. Sergey Chernov, Info Lunch at L3S 22/11/18
A Scenario for Desktop Search Xuan searches for “the articles about multimedia conferences and workshops, which are titled “call for papers” or “upcoming events” and were recommended by Mounia”. Query: multimedia workshop /title upcoming events /receivedFrom Mounia affiliatedTo fn uid:123 Queen Mary Uni Mounia Lalmas family receivedFrom given http://inex.is.informatik.uni-duisburg.de/2005/index.html Lalmas accessedFrom msgid:00465 Mounia Upcoming Events storedFrom publication title type publishedIn c:\inex1.8\xml\mu\1998\u40c2.xml IEEE MULTIMEDIA 1999 issn 1070-986X Multimedia Computing and Networking 1999 (MMCN 99) … This conference … multimedia systems… year text 1998 Sergey Chernov, Info Lunch at L3S 22/11/18
Desktop Metadata Missing from INEX StoredFrom - Web links as sources of publications ReceivedFrom - Email activity information, emails containing publications EmailAnnotations - Email annotations (from sender) SearchKeyword - Search keywords, which were used at Web search engine to find the document OpenLast, MovedFrom - User action history in regard to the publications Annotation - User annotations Sergey Chernov, Info Lunch at L3S 22/11/18
Challenges for Designing a Dataset for Desktop Data obtained through logging Pros: real-data Cons: privacy issues, high level of user cooperation is required, low-scalability Data created through simulations Pros: scalable, easy-to-modify, cheap, less restrictions regarding privacy Cons: can be based on wrong assumptions Sergey Chernov, Info Lunch at L3S 22/11/18
Thanks a lot and Merry Christmas! Last slide Sergey Chernov, Info Lunch at L3S 22/11/18