Download presentation
Presentation is loading. Please wait.
1
LSDS-IR’08, October 30, 20081 Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis Vazirgiannis 1 1 Department of Informatics Athens University of Economics and Business, Greece 2 Department of Computer Science Norwegian University of Science and Technology, Norway
2
LSDS-IR’08, October 30, 20082 Motivation Application –Digital libraries Given a document (=query), retrieve similar documents e.g. find similar papers to my research paper Efficiently locate subset of peers that store similar content to the query Challenge –Similarity search over widely distributed high-dimensional data Distributed Information Retrieval
3
LSDS-IR’08, October 30, 20083 Outline Local peer pre-processing –Feature extraction –Local clustering Semantic overlay network (SON) construction –Topological zone creation –Zone clustering Super-peer organization of SONs –Searching Experimental evaluation Conclusions & future work
4
LSDS-IR’08, October 30, 20084 Feature Extraction and Local Document Clustering Peers store documents Tokenization/stemming/ stop-word removal Each document represented by a feature vector (top-k features) –Vector Space Model (VSM) –F i = {(f ij, w ij )} Cluster feature vectors Result: –set of initial clusters per peer Each cluster represented by feature vector Peer’s initial clusters
5
LSDS-IR’08, October 30, 20085 Overlay Construction Multi-phase distributed process Starting point: unstructured P2P network Recursive application of 3 steps, until global clusters (SONs) are created
6
LSDS-IR’08, October 30, 20086 Zone Creation A certain percentage of peers becomes initiators –randomly distributed over the network. PROBE-based technique Partial synchronization In case of excessive zone sizes –zone partitioning Finally: Each initiator –knows the peer ids in its zone –knows neighboring initiators Each peer knows its initiator Initiators Initiator
7
LSDS-IR’08, October 30, 20087 Zone Clustering Initiators –collect feature vectors from peers –perform intra-zone hierarchical clustering –pick cluster representatives Cluster description –CD i = (C i, F i, {P}, R) Remaining challenge –How to bring together similar (remote) clusters? similar remote clusters
8
LSDS-IR’08, October 30, 20088 Inter-zone Clustering Level 1 Level 2Level 3Level 4 Advantages: 1)Very large networks 2)Efficient 3)Small individual load
9
LSDS-IR’08, October 30, 20089 SON Merging Create d links among the least-connected peers in merged SONs SON 1SON 2 For d=3 Super-Peer
10
LSDS-IR’08, October 30, 200810 Searching Inter-SON routing Intra-SON routing Naïve solution: flooding Q
11
LSDS-IR’08, October 30, 200811 Adaptive Clustering After global SON creation –Broadcast final cluster descriptions to all peers –Use zone hierarchy for efficient broadcasting Each peer can then –Reassign its documents to clusters –Join the appropriate SONs Similar to a feedback mechanism Advantages –see experimental results Final organization
12
LSDS-IR’08, October 30, 200812 Experimental Setup GT-ITM topology generator (1K, 5K peers) TREC.GOV2 (1M docs), Reuters (810K docs) Random querying peer Query: “Given doc X, find the top-k similar docs to X” Cosine similarity Similarity threshold T s, to determine matching docs to query Metrics –Recall –Recall@k –Precision@k –#Contacted peers
13
LSDS-IR’08, October 30, 200813 Clustering Statistics Adaptive clustering –decreases the average pair-wise similarity of clusters –Increases average pair-wise similarity of documents within a cluster (not shown here)
14
LSDS-IR’08, October 30, 200814 Search Evaluation Recall –T s =0.2 –Also tried T s =0.1 #Contacted Peers
15
LSDS-IR’08, October 30, 200815 Search Evaluation - GOV2/P5000
16
LSDS-IR’08, October 30, 200816 SON-based versus Plain Super-peer
17
LSDS-IR’08, October 30, 200817 Conclusions We presented a novel approach for P2P similarity search Peers self-organize into SONs, forming a super-peer network We showed how a high-quality searching mechanism can be deployed We presented experiments on 2 large document collections (GOV2 and Reuters) to evaluate our approach Future work: –More efficient inter-SON routing –Semantic similarity search using query expansion –Use of other clustering algorithms to improve performance
18
LSDS-IR’08, October 30, 200818 Thank you for your attention ! More info: http://www.db-net.aueb.gr/ http://www.idi.ntnu.no/grupper/db/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.