XML Ranking Querying, Dagstuhl, 9-13 Mar, 20081 An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab.

Slides:



Advertisements
Similar presentations
INEX: Evaluating content-oriented XML retrieval Mounia Lalmas Queen Mary University of London
Advertisements

XML Retrieval: from modelling to evaluation Mounia Lalmas Queen Mary University of London qmir.dcs.qmul.ac.uk.
Even More TopX: Relevance Feedback Ralf Schenkel Joint work with Osama Samodi, Martin Theobald.
Information Retrieval (IR) on the Internet. Contents  Definition of IR  Performance Indicators of IR systems  Basics of an IR system  Some IR Techniques.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Robust query processing Goetz Graefe, Christian König, Harumi Kuno, Volker Markl, Kai-Uwe Sattler Dagstuhl – September 2010.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Basic IR: Modeling Basic IR Task: Slightly more complex:
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
XML R ETRIEVAL Tarık Teksen Tutal I NFORMATION R ETRIEVAL XML (Extensible Markup Language) XQuery Text Centric vs Data Centric.
A Distributed Indexing Strategy for Efficient XML Retrieval Efficiency Issues in Information Retrieval Workshop 30th European Conference on Information.
Search Engines and Information Retrieval
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Dynamic Element Retrieval in a Structured Environment Crouch, Carolyn J. University of Minnesota Duluth, MN October 1, 2006.
Aki Hecht Seminar in Databases (236826) January 2009
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Searching the Web II. The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant.
1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Evaluating the Performance of IR Sytems
Hybrid XML Retrieval Revisited Jovan Pehcevski PhD Candidate School of CS and IT, RMIT University
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
E.G.M. PetrakisHashing1 Hashing on the Disk  Keys are stored in “disk pages” (“buckets”)  several records fit within one page  Retrieval:  find address.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
XML Information Retrieval and INEX Norbert Fuhr University of Duisburg-Essen.
Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different.
INEX : Understanding XML Retrieval Evaluation Mounia Lalmas and Anastasios Tombros Queen Mary, University of London Norbert Fuhr University.
Authors: Bhavana Bharat Dalvi, Meghana Kshirsagar, S. Sudarshan Presented By: Aruna Keyword Search on External Memory Data Graphs.
LOGO XML Keyword Search Refinement 郭青松. Outline  Introduction  Query Refinement in Traditional IR  XML Keyword Query Refinement  My work.
Search Engines and Information Retrieval Chapter 1.
Efficient Keyword Search over Virtual XML Views Feng Shao and Lin Guo and Chavdar Botev and Anand Bhaskar and Muthiah Chettiar and Fan Yang Cornell University.
INEX – a broadly accepted data set for XML database processing? Pavel Loupal, Michal Valenta.
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Querying Structured Text in an XML Database By Xuemei Luo.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
The Effect of Collection Organization and Query Locality on IR Performance 2003/07/28 Park,
TopX 2.0 at the INEX 2009 Ad-hoc and Efficiency tracks Martin Theobald Max Planck Institute Informatics Ralf Schenkel Saarland University Ablimit Aji Emory.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
Chapter 6: Information Retrieval and Web Search
Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Keyword Searching Weighted Federated Search with Key Word in Context Date: 10/2/2008 Dan McCreary President Dan McCreary & Associates
Users and Assessors in the Context of INEX: Are Relevance Dimensions Relevant? Jovan Pehcevski, James A. Thom School of CS and IT, RMIT University, Australia.
1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG,
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
Internal and External Sorting External Searching
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Chapter 13. Structured Text Retrieval With Mounia Lalmas 무선 / 이동 시스템 연구실 김민혁.
Federated text retrieval from uncooperative overlapped collections Milad Shokouhi, RMIT University, Melbourne, Australia Justin Zobel, RMIT University,
Querying Structured Text in an XML Database Shurug Al-Khalifa Cong Yu H. V. Jagadish (University of Michigan) Presented by Vedat Güray AFŞAR & Esra KIRBAŞ.
CADIAL search engine at INEX
B+ Tree.
Toshiyuki Shimizu (Kyoto University)
Matching Words with Pictures
Information Retrieval and Web Design
Presentation transcript:

XML Ranking Querying, Dagstuhl, 9-13 Mar, An Adaptive XML Retrieval System Yosi Mass, Michal Shmueli-Scheuer IBM Haifa Research Lab

XML Ranking Querying, Dagstuhl, 9-13 Mar, The XML retrieval tasks Query formulation CO – Content only CAS – Content and structure (NEXI) Retrieval tasks Thorough: “find all highly exhaustive and specific elements” Retrieval results can be (possibly overlapping) XML elements of varying granularity that fulfill the query Focussed : “ find the most exhaustive and specific element in a path” No overlap in returned results

XML Ranking Querying, Dagstuhl, 9-13 Mar, Approaches for XML retrieval Index full documents. Score documents and then components inside the documents Problem: Works well for “fetch and browse ” but not for the general thorough task Index only leaf elements Score leaves and propagate scores along the XML tree Problem: weights used to propagate are either set manually by the user or set empirically Index all elements into same index Score all possible elements Problem: distorted “element-level" statistics due to overlapping Can we fix the distorted statistics?

XML Ranking Querying, Dagstuhl, 9-13 Mar, An adaptive XML retrieval system Split all collection elements into separate indices such that Coverage - each element is indexed in at least one index No overlap - elements in each index do not nest. Run Query on each index Merge results to a single result list

XML Ranking Querying, Dagstuhl, 9-13 Mar, Split to indices - example Index 2 p[3] p[1] bdy[1] article[1] sec[2] sec[1] Index 0 Index 1 Index 3 Index 0: /article[1]/article[1] Index 1: /article[1]/bdy[1]/article[1]/bdy[1] Index 2: /article[1]/bdy[1]/sec[1], /article[1]/bdy[1]/sec[1] /article[1]/bdy[1]/sec[2] Index 3: /article[1]/bdy[1]/sec[2]/p[1], /article[1]/bdy[1]/sec[1]/ss1[1] /article[1]/bdy[1]/sec[2]/p[3]/article[1]/bdy[1]/sec[1]/ss1[2] article[1] bdy[1] sec[1] ss1[1] ss1[2]

XML Ranking Querying, Dagstuhl, 9-13 Mar, An adaptive indexing schema SplitToIndices(doc, minCompSize, nInd) Find all leaves in doc that are larger than minCompSize If no minimal leaves found return G 0 = {root} Let d be the longest path among all those leaves Create groups {G 0,…,G d-1 } where each G i contains all elements inferred Xpath prefixes of length i of all matched leaves. Remove repeating elements in each group Split the groups {G 1,…,G d } to indices{I 0,…, I nInd-1 } (several strategies) Return {I 0,…, I nInd-1 }

XML Ranking Querying, Dagstuhl, 9-13 Mar, Examples – cut long paths Minimal element - /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1]/td[2] Split to Indices index 0 : /article[1] index 1 : /article[1]/body[1] index 2 : /article[1]/body[1]/section[7] index 3: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1] index 4: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2] index 5: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1] index 6: /article[1]/body[1]/section[7]/table[1]/tr[1]/td[2]/tr[1]/td[2]/tr[1]/td[2]

XML Ranking Querying, Dagstuhl, 9-13 Mar, Experiements IEEE collection ,000 articles, 700MB Average document length ~41K Average depth topics from INEX 2005 Wikipedia collection 660,000 pages, 4.5GB Average document length 6.8K Average depth topics from INEX 2006

XML Ranking Querying, Dagstuhl, 9-13 Mar, Coverage For nInd=7 and minCompSize=10. 87% coverage for IEEE collection recall base 75% coverage for Wikipedia collection filtered recall base The filtered recall base was generated by removing all link elements from the recall base We still miss some small elements and some in-between elements which has depth > 7

XML Ranking Querying, Dagstuhl, 9-13 Mar, Doc pivot Some low level indices have partial content of the collection thus missing statistics Solution: compensate by containing document’s score Score’(e) = docPivot * Score(doc(e)) + (1 – docPivot) * Score(e))

XML Ranking Querying, Dagstuhl, 9-13 Mar, Elements distribution

XML Ranking Querying, Dagstuhl, 9-13 Mar, Tuning number of Indices needle Set minCompSize=10

XML Ranking Querying, Dagstuhl, 9-13 Mar, Tuning min Component Size Set num indices = 7 Set num indices nInd=7

XML Ranking Querying, Dagstuhl, 9-13 Mar, Summary Adaptive Indexing schema –split XML elements to separate indices –Same parameters for different collections XML retrieval system –achieved by running existing IR engines on each index Can be used for CAS Relatively low MAep results –Does XML structure reflect any semantic structure?

XML Ranking Querying, Dagstuhl, 9-13 Mar, Thank you!