Download presentation
Presentation is loading. Please wait.
1
A Distributed Indexing Strategy for Efficient XML Retrieval Efficiency Issues in Information Retrieval Workshop (EIIR) @ 30th European Conference on Information Retrieval (ECIR), Glasgow, GB, March/April 2008 Judith Winter Institute for Informatics / Telematics Group J. W. Goethe-University / Frankfurt am Main, Germany
2
Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 2 A Distributed Indexing Strategy for Efficient XML Retrieval Overview 1.Introduction 2.A search engine for XML IR in P2P 3.Indexing techniques 4.Outlook on current implementation 5.Questions and discussion 1. Introduction
3
Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 3 XML Information Retrieval in Peer-to-Peer Systems structured documents more precise search based on c/s architectures distributed autonomous peers growing amount of XML-documents vague queries relevance-ranking XML- Retrieval Information Retrieval Peer-to-Peer Challenges: bandwith consumption / communication overhead only selected information available 1.Introduction 2.Architecture 3.Indexing 4.Outlook
4
Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 4 Queries: content-and-structure (CAS) Indexing: include structure Fixed limit for posting list sizes; pre-computing of posting lists for popular term combinations highly discriminative keys (HDKs) Hybrid indexing: globally or locally (distributing summaries) depending on peer status Pruning posting lists by considering structural information Ranking: extended vector space model Results/Retrieval units: document or passage retrieval System characteristics: 1.Introduction 2.Search engine 3.Indexing 4.Outlook
5
Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 5 Index storage component local index distributed index INFORMATION RETRIEVAL PEER-TO-PEER APPLICATION Retrieval component Ranking component P2P component variant of DHT-algorithm (Kademlia/Chord) Document index Retrieval unit index documents d n query q results for q term statistics for retrieval units(d) Graphical User Interface Indexing Indexing component Frequent XTerm index HDK index Querying & result presentation P2P network Document index HDK index frequencies Retrieval unit index File system local documents 1.Introduction 2.Search engine 3.Indexing 4.Outlook
6
Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 6 Use of XTerms: (content, structure)-tuples Rare tuple-combinations: Highly Discriminative Keys (HDKs) Over 80% multiterm queries precomputed key-combinations If key is frequent (frequency exceeds threshold): combine with other frequent keys of same window (e.g. same XML element) Example HDK-based indexing: apple\book\chapter dok1(14.5), dok2(12.4) \magazine\p dok2(5.3), dok3(2.7), dok4(0.7) chips \book dok4(18.4), dok1(2.3), dok2(2.1), dok3(1.5) 1.Introduction 2.Search Engine 3.Indexing 4.Outlook
7
Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 7 Entries sorted by score t (d i ); choose k best entries for XTerm t Considers document d i, best retrieval unit ru best, and peer p i Weighting function w: BM25f-based PeerScore: high for peers with good collections regarding t and with good performance metrics Pruning posting lists (FrequentXTermIndex): 1.Introduction 2.Search Engine 3.Indexing 4.Outlook
8
Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 8 Indexing depending on status of peer: Exhaustive indexing: per document Quick indexing: per peer (summaries, e.g. tf per peer) Peer status considers: Response times Available bandwidth Open IP address (vs. NAT-bound) Latency CPU/Memory … Online time ( 65% of the peers joined the system online only once, >20% of all connections lasted <1 minute, 60% of the peers kept active <10 min) Hybrid indexing: 1.Introduction 2.Search Engine 3.Indexing 4.Outlook
9
Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 9 Implementation of SPIRIX: Search Engine for P2P Information Retrieval in XML-Documents Indexing based on Terrier (centralized approach for text documents, Uni Glasgow) P2P-complex: Based on Kademlia/Chord, Collects peer characteristics, Adapted to special requirements of XML IR Ranking: Extension of the vector space model, BM25f-based weighing Outlook on current implementation: 1.Introduction 2.Search Engine 3.Indexing 4.Outlook
10
Judith Winter: A Distributed Indexing Strategy for Efficient XML Retrieval 10 A Distributed Indexing Strategy for Efficient XML Retrieval 1.Introduction 2.Architecture for XML IR in P2P 3.Indexing techniques 4.Outlook on current implementation 5.Questions and discussion
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.