Download presentation
Presentation is loading. Please wait.
1
Routing of Structured Queries in Large-Scale Distributed Systems Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS_IR'08) @ ACM 17th CIKM 2008, Napa Valley, California, USA, Oct 2008. Judith Winter Institute for Informatics / Telematics Group Goethe-University / Frankfurt am Main, Germany
2
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 2 Routing of Structured Queries in Large-Scale Distributed Systems Overview 1.Introduction 2.Concept & Architecture 3.Routing 4.Evaluation 5.Questions and Discussion 1. Introduction
3
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 3 XML Information Retrieval in P2P systems Investigate the impact of using structural information when retrieving XML-documents in a P2P network Challenge: not all information accessable / scalability issues Proposed research: How to perform & improve query routing in a large-scale P2P System by using structural information? 1.Introduction 2. Concept 3. Routing 4. Evaluation
4
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 4 XML Information Retrieval in Peer-to-Peer Systems: structured documents more precise search based on c/s architectures distributed autonomous peers growing amount of XML-documents vague queries relevance-ranking XML- Retrieval Information Retrieval Peer-to-Peer Challenges: no central index only selected information available bandwith consumption / communication overhead efficiency vs effectiveness 1.Introduction 2. Concept 3. Routing 4. Evaluation
5
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 5 Routing of Structured Queries in Large-Scale Distributed Systems 1.Introduction 2.Concept & Architecture 3.Routing 4.Evaluation 5.Questions and Discussion 2. Concept & Architecture
6
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 6 Queries: content-and-structure (CAS) Indexing: include structure Hybrid indexing: globally or locally (distributing summaries) depending on peer status index with posting lists (doc level) & peer lists (peer-level) Distributing global information into DHT Ranking: extended vector space model (using structure) Results/Retrieval units: document or element retrieval Concept for a P2P-search engine: 1.Introduction 2. Concept 3. Routing 4. Evaluation
7
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 7 Routing: Use peer lists and posting lists Use of pre-computed posting lists for popular term combinations highly discriminative keys (HDKs) Use of pruned posting lists by considering structural information Ordering of posting lists by a query-independent score (evidence from document-, element-, collection, and peer level) Select top k results according to pre-ranking regarding structural similarity between CAS query and posting key Concept for a P2P-search engine: 1.Introduction 2. Concept 3. Routing 4. Evaluation
8
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 8 P2P network Index storage component Inverted Index Statistics Index INFORMATION RETRIEVAL PEER-TO-PEER APPLICATION Retrieval component Ranking component P2P component Document index Retrieval unit index SpirixDHT GUI Indexing Querying & result presentation Frequent XTerm index HDK index DL Local documents Querying Component Routing component Similarity calculator Weighting calculator Source selector SimulationDHTChord PeerMetrics calculator 1.Introduction 2. Concept 3. Routing 4. Evaluation
9
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 9 Routing of Structured Queries in Large-Scale Distributed Systems 1.Introduction 2.Concept & Architecture 3.Routing 4.Evaluation 5.Questions and Discussion 3. Routing
10
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 10 1.Peer P 0 looks for books about apples 2.Id i 0 = hash(apple, \book) = hash(apple) is calculated 3.Peer P 5 assigned to i 0 is located in log(n) hops 4.Query q is sent to P 5 5.P 5 selects top k=2 postings for q; these relate to dok 1 and dok 2 6.Id i 1 = hash(dok 1 ) and Id i 1 = hash(dok 1 ) are calculated, their peers located 7.q is sent to P 2 and P 6 assigned to i 1 and i 2 8.P 2 and P 6 calculate relevance for dok 1 and dok 2 plus their RUs 9.P 2 and P 6 send back results to P 0 Example: P0 P1 P2 P5 P4 P3 P6 P7 q = {apple, \book} 1.Introduction 2. Concept 3. Routing 4. Evaluation assigned to hash(apple) apple, \book dok1(4.8), dok2(4.1), dok3(3.7)… apple, \novel dok2(12.9) apple, \article\p\sec ---- Dok 2 =(1,4,0,0,3,…) Dok 1 =(0,1,5,1,3,…) Result = {(dok2,12.4), (dok2/chap, 11.2)} Result = {(dok1/sec,5.4)} q q 1.(dok2,12.4) 2. (dok2/chap, 11.2) 3.(dok1/sec,5.4)
11
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 11 Routing process: 1.Introduction 2. Concept 3. Routing 4. Evaluation
12
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 12 Entries sorted by score t (d i ); choose k best entries for XTerm t Considers document d i, best retrieval unit ru best, and peer p i Weighting function w: BM25e-based PeerScore: high for peers with good collections regarding t and with good performance metrics Weighting of postings (query independent at indexing): 1.Introduction 2. Concept 3. Routing 4. Evaluation
13
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 13 apple\book\chapter dok1(12.8), dok2(12.4) \article\p dok2(25.3), dok3(12.7), dok4(10.7) chips \book\c1\section dok4(18.4), dok2(3.1), dok1(2.3), dok3(1.5) Selection of Postings (query dependend reordering): Example: q = { (apple, \book\chapter), (chips, \section) } 1.Introduction 2. Concept 3. Routing 4. Evaluation Final Posting list = {dok2( 12.4*1+3.1*0.7= 14.6), dok1( 12.8*1+2.3*0.7= 14.4), dok4( 18.4*0.7= 12.9), dok3( 1.5*0.7= 1.1) } apple\book\chapter dok1(12.8), dok2(12.4) \article\p dok2(25.3), dok3(12.7), dok4(10.7) chips \book\c1\section dok4(18.4), dok2(3.1), dok1(2.3), dok3(1.5) sim = 1 sim = 0 sim = 0.7
14
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 14 Routing of Structured Queries in Large-Scale Distributed Systems 1.Introduction 2.Concept & Architecture 3.Routing 4.Evaluation 5.Questions and Discussion 4. Evaluation
15
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 15 Implementation of SPIRIX: Search Engine for P2P Information Retrieval in XML-Documents P2P-complex: Based on OpenChord, Collects peer characteristics, Adapted to special requirements of XML IR Preliminary evaluation with INEX-Collection Implementation: 1.Introduction 2. Concept 3. Routing 4. Evaluation
16
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 16 Evaluation with INEX-Collection of 2007: Wikipedia-collection: 660.000 documents (4.6 GB) 80 CAS queries (out of 123 topics ) run on 1 peer with simulationDHT (measurement of #postings) retrieval of best 1500 results per query PL max set to indefinite ( all HDKs single XTerms) different structural similarity functions simple version of the proposed formulas (document-based) Goal: show the effect of using structural hints for routing efficiency (#postings: 100, 500, 2000 postings) effectivness (precision at different recall levels) Evaluation: 1.Introduction 2. Concept 3. Routing 4. Evaluation
17
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 17 1.Introduction 2. Concept 3. Routing 4. Evaluation
18
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 18 1.Introduction 2. Concept 3. Routing 4. Evaluation
19
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 19 1.Introduction 2. Concept 3. Routing 4. Evaluation +7,2% +8,7% +5,5%
20
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 20 1.Introduction 2. Concept 3. Routing 4. Evaluation
21
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 21 Propose to take advantage of XML structure when routing in highly distributed environments such as P2P systems Provide an infrastructure for investigation of proposed techniques to perform routing based on evidence from document-, element-, collection-, and peer-level For 80 CAS topics of INEX2007, efficiency and effectivness could be improved Future work to verify the observed improvement: evaluate formulas in full version runs with multimedia topics INEX 2007; INEX2008 measure bandwidth consumption (incl. #messages, message sizes) run on different peers; split collection Conclusion: 1.Introduction 2. Concept 3. Routing 4. Evaluation
22
Judith Winter: Routing of Structured Queries in Large-Scale Distr. Systems 22 Routing of Structured Queries in Large-Scale Distributed Systems 1.Introduction 2.Concept & Architecture 3.Routing 4.Evaluation 5.Questions and Discussion
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.