A Distributed Indexing Strategy for Efficient XML Retrieval Efficiency Issues in Information Retrieval Workshop 30th European Conference on Information.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Clayton Sullivan PEER-TO-PEER NETWORKS. INTRODUCTION What is a Peer-To-Peer Network A Peer Application Overlay Network Network Architecture and System.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities F. M. Cuenca-Acuna, C. Peery, R. P. Martin, and T. D.
University of Cincinnati1 Towards A Content-Based Aggregation Network By Shagun Kakkar May 29, 2002.
PeerDB: A P2P-based System for Distributed Data Sharing Wee Siong Ng, Beng Chin Ooi, Kian-Lee Tan, Aoying Zhou Shawn Jeffery CS294-4 Peer-to-Peer Systems.
Peer-to-Peer Networks as a Distribution and Publishing Model Jorn De Boever (june 14, 2007)
Information Retrieval in Practice
Search Engines and Information Retrieval
Amanda Spink : Analysis of Web Searching and Retrieval Larry Reeve INFO861 - Topics in Information Science Dr. McCain - Winter 2004.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Chorus cluster meeting, Vilamoura April SAPIR Search in Audio-visual content using P2p IR Yosi Mass, Raul Santos.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Routing of Structured Queries in Large-Scale Distributed Systems Workshop on Large-Scale Distributed Systems for Information Retrieval ACM.
Evaluating the Performance of IR Sytems
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Computer comunication B Information retrieval Repetition Retrieval models Wildcards Web information retrieval Digital libraries.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
The University of Kansas Vitalseek Dr. Susan Gauch.
Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen
Parallel and Distributed IR
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
IR Techniques For P2P Networks1 Information Retrieval Techniques For Peer-To-Peer Networks Demetrios Zeinalipour-Yazti, Vana Kalogeraki and Dimitrios Gunopulos.
Query-Driven Indexing for Peer-to-Peer Text Retrieval ** WWW 2007 Banff, Canada Contact: Gleb Skobeltsyn Contact: Gleb Skobeltsyn
Search Engines and Information Retrieval Chapter 1.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Peer to Peer Research survey TingYang Chang. Intro. Of P2P Computers of the system was known as peers which sharing data files with each other. Build.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.
Web Archiving and Access Mike Smorul Joseph JaJa ADAPT Group University of Maryland, College Park.
Querying Structured Text in an XML Database By Xuemei Luo.
Search in Peer-to-Peer File-Sharing Systems: Like Metasearch Engines, But Not Really Wai Gen Yee, Dongmei Jia, Linh Thai Nguyen {yee, jiadong,
Proposal for Term Project J. H. Wang Mar. 2, 2015.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network Toan Luu, Gleb Skobeltsyn, Fabius Klemm, Maroje Puh, Ivana Podnar Zarko, Martin.
Modern Information Retrieval Chapter 9: Parallel and Distributed IR Section 9.1: Introduction Section : MIMD Architectures Inverted Files November.
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana.
Efficient Semantic Based Content Search in P2P Network Heng Tao Shen, Yan Feng Shu, and Bei Yu.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
Vector Space Models.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
1P2P4mm workshop, Vico Equense 6. June 2008 Information Society Technologies VICTORY – a multimodal, cross-platform and distributed multimedia repository.
PeerDB: A P2P-based System for Distributed Data Sharing Wee Siong NgBeng Chin OoiKian-lee TanAoying Zhou Course Number: CSI 5311 Course Name: Distributed.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
SAPIR Search in Audio-Visual Content using P2P Information Retrival For more information visit: Support.
Sudhanshu Khemka.  Treats each document as a vector with one component corresponding to each term in the dictionary  Weight of a component is calculated.
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
G.Skobeltsyn | Web Text Retrieval with a P2P Query-Driven Index Web Text Retrieval with a P2P Query-Driven Index Gleb Skobeltsyn EPFL, Lausanne Switzerland.
NeOn Components for Ontology Sharing and Reuse Mathieu d’Aquin (and the NeOn Consortium) KMi, the Open Univeristy, UK
Modern Information Retrieval
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Information Retrieval in Practice
Search Engine Architecture
Proposal for Term Project
CHAPTER 3 Architectures for Distributed Systems
CHAPTER 1 INTRODUCTION:
Implementation Issues & IR Systems
Presentation transcript:

A Distributed Indexing Strategy for Efficient XML Retrieval Efficiency Issues in Information Retrieval Workshop 30th European Conference on Information Retrieval (ECIR), Glasgow, GB, March/April 2008 Judith Winter Institute for Informatics / Telematics Group J. W. Goethe-University / Frankfurt am Main, Germany

Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 2 A Distributed Indexing Strategy for Efficient XML Retrieval Overview 1.Introduction 2.A search engine for XML IR in P2P 3.Indexing techniques 4.Outlook on current implementation 5.Questions and discussion 1. Introduction

Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 3 XML Information Retrieval in Peer-to-Peer Systems structured documents more precise search based on c/s architectures distributed autonomous peers growing amount of XML-documents vague queries relevance-ranking XML- Retrieval Information Retrieval Peer-to-Peer Challenges: bandwith consumption / communication overhead only selected information available 1.Introduction 2.Architecture 3.Indexing 4.Outlook

Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 4 Queries: content-and-structure (CAS) Indexing: include structure Fixed limit for posting list sizes; pre-computing of posting lists for popular term combinations  highly discriminative keys (HDKs) Hybrid indexing: globally or locally (distributing summaries) depending on peer status Pruning posting lists by considering structural information Ranking: extended vector space model Results/Retrieval units: document or passage retrieval System characteristics: 1.Introduction 2.Search engine 3.Indexing 4.Outlook

Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 5 Index storage component local index distributed index INFORMATION RETRIEVAL PEER-TO-PEER APPLICATION Retrieval component Ranking component P2P component variant of DHT-algorithm (Kademlia/Chord) Document index Retrieval unit index documents d n query q results for q term statistics for retrieval units(d) Graphical User Interface Indexing Indexing component Frequent XTerm index HDK index Querying & result presentation P2P network Document index HDK index frequencies Retrieval unit index File system local documents 1.Introduction 2.Search engine 3.Indexing 4.Outlook

Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 6 Use of XTerms: (content, structure)-tuples Rare tuple-combinations: Highly Discriminative Keys (HDKs) Over 80% multiterm queries  precomputed key-combinations If key is frequent (frequency exceeds threshold): combine with other frequent keys of same window (e.g. same XML element) Example HDK-based indexing: apple\book\chapter  dok1(14.5), dok2(12.4) \magazine\p  dok2(5.3), dok3(2.7), dok4(0.7) chips \book  dok4(18.4), dok1(2.3), dok2(2.1), dok3(1.5) 1.Introduction 2.Search Engine 3.Indexing 4.Outlook

Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 7 Entries sorted by score t (d i ); choose k best entries for XTerm t Considers document d i, best retrieval unit ru best, and peer p i Weighting function w: BM25f-based PeerScore: high for peers with good collections regarding t and with good performance metrics Pruning posting lists (FrequentXTermIndex): 1.Introduction 2.Search Engine 3.Indexing 4.Outlook

Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 8 Indexing depending on status of peer: Exhaustive indexing: per document Quick indexing: per peer (summaries, e.g. tf per peer) Peer status considers: Response times Available bandwidth Open IP address (vs. NAT-bound) Latency CPU/Memory … Online time ( 65% of the peers joined the system online only once, >20% of all connections lasted <1 minute, 60% of the peers kept active <10 min) Hybrid indexing: 1.Introduction 2.Search Engine 3.Indexing 4.Outlook

Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 9 Implementation of SPIRIX: Search Engine for P2P Information Retrieval in XML-Documents Indexing based on Terrier (centralized approach for text documents, Uni Glasgow) P2P-complex: Based on Kademlia/Chord, Collects peer characteristics, Adapted to special requirements of XML IR Ranking: Extension of the vector space model, BM25f-based weighing Outlook on current implementation: 1.Introduction 2.Search Engine 3.Indexing 4.Outlook

Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 10 A Distributed Indexing Strategy for Efficient XML Retrieval 1.Introduction 2.Architecture for XML IR in P2P 3.Indexing techniques 4.Outlook on current implementation 5.Questions and discussion