PCP2P: Probabilistic Clustering for P2P networks 32nd European Conference on Information Retrieval 28 th -31 st March 2010, Milton Keynes, UK Odysseas.

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

P2PIR'06: "Distributed Cache Table (DCT)" Gleb Skobeltsyn, Karl Aberer D istributed T able: Efficient Query-Driven Processing of Multi-Term Queries in.

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.

Fabián E. Bustamante, 2007 Meridian: A lightweight network location service without virtual coordinates B. Wong, A. Slivkins and E. Gün Sirer SIGCOM 2005.

BOAT - Optimistic Decision Tree Construction Gehrke, J. Ganti V., Ramakrishnan R., Loh, W.

Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.

1/14 Ad Hoc Networking, Eli M. Gafni and Dimitri P. Bertsekas Distributed Algorithm for Generating Loop-free Routes in Networks With Frequently.

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

Efficient, Proximity-Aware Load Balancing for DHT-Based P2P Systems Yingwu Zhu, Yiming Hu Appeared on IEEE Trans. on Parallel and Distributed Systems,

Iterative Optimization and Simplification of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial.

Locality-Aware Request Distribution in Cluster-based Network Servers 1. Introduction and Motivation --- Why have this idea? 2. Strategies --- How to implement?

P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.

Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.

Scalable Information-Driven Sensor Querying and Routing for ad hoc Heterogeneous Sensor Networks Maurice Chu, Horst Haussecker and Feng Zhao Xerox Palo.

Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.

Comparing Hybrid Peer-to-Peer Systems Beverly Yang and Hector Garcia-Molina Presented by Marco Barreno November 3, 2003 CS 294-4: Peer-to-peer systems.

A Local Facility Location Algorithm Supervisor: Assaf Schuster Denis Krivitski Technion – Israel Institute of Technology.

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?

ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.

Online Piece-wise Linear Approximation of Numerical Streams with Precision Guarantees Hazem Elmeleegy Purdue University Ahmed Elmagarmid (Purdue) Emmanuel.

“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das

Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (

Load Balancing in Structured P2P System Ananth Rao, Karthik Lakshminarayanan, Sonesh Surana, Richard Karp, Ion Stoica IPTPS ’03 Kyungmin Cho 2003/05/20.

1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.

WEMAREC: Accurate and Scalable Recommendation through Weighted and Ensemble Matrix Approximation Chao Chen ⨳ , Dongsheng Li

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.

Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.

Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.

Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.

Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.

Efficient Peer to Peer Keyword Searching Nathan Gray.

Super-peer Network. Motivation: Search in P2P Centralised (Napster) Flooding (Gnutella)  Essentially a breadth-first search using TTLs Distributed Hash.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

PODC Distributed Computation of the Mode Fabian Kuhn Thomas Locher ETH Zurich, Switzerland Stefan Schmid TU Munich, Germany TexPoint fonts used in.

AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network Toan Luu, Gleb Skobeltsyn, Fabius Klemm, Maroje Puh, Ivana Podnar Zarko, Martin.

An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.

Kaleidoscope – Adding Colors to Kademlia Gil Einziger, Roy Friedman, Eyal Kibbar Computer Science, Technion 1.

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.

Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU.

Chord Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google,

1 30 November 2006 An Efficient Nearest Neighbor (NN) Algorithm for Peer-to-Peer (P2P) Settings Ahmed Sabbir Arif Graduate Student, York University.

Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.

Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.

Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks

Bandwidth-Efficient Continuous Query Processing over DHTs Yingwu Zhu.

Stefanos Antaris Distributed Publish/Subscribe Notification System for Online Social Networks Stefanos Antaris *, Sarunas Girdzijauskas † George Pallis.

NCLAB 1 Supporting complex queries in a distributed manner without using DHT NodeWiz: Peer-to-Peer Resource Discovery for Grids Sujoy Basu, Sujata Banerjee,

On the Placement of Web Server Replicas Yu Cai. Paper On the Placement of Web Server Replicas Lili Qiu, Venkata N. Padmanabhan, Geoffrey M. Voelker Infocom.

Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.

1 CS 430: Information Discovery Lecture 17 Web Crawlers.

Peer-to-Peer Networks 05 Pastry Christian Schindelhauer Technical Faculty Computer-Networks and Telematics University of Freiburg.

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation Yee W. Teh, David Newman and Max Welling Published on NIPS 2006 Discussion.

Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,

Data Driven Resource Allocation for Distributed Learning

Efficient Multi-User Indexing for Secure Keyword Search

Scalable Load-Distance Balancing

Information Retrieval in Practice

CHAPTER 3 Architectures for Distributed Systems

DHT Routing Geometries and Chord

KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner

Presentation transcript:

PCP2P: Probabilistic Clustering for P2P networks 32nd European Conference on Information Retrieval 28 th -31 st March 2010, Milton Keynes, UK Odysseas Papapetrou * Wolf Siberski * Norbert Fuhr # * L3S Research Center, University of Hannover, Germany # Universität Duisburg-Essen, Germany

PCP2P: Probabilistic Clustering for P2P NetworksECIR Introduction Why text clustering?  Find related documents  Browse documents by topic  Extract summaries  Build keyword clouds  … Why text clustering in P2P An efficient and effective method for IR in P2P New application area: Social networking - find peers with related interests When files are distributed  too expensive to collect at a central server

PCP2P: Probabilistic Clustering for P2P NetworksECIR Preliminaries Distributed Hash Tables (DHTs)  Functionality of a hash table: put(key, value) and get(key)  Peers are organized in a ring structure  DHT Lookup: O(log n) messages get(key)  hash(key)  47

PCP2P: Probabilistic Clustering for P2P NetworksECIR Preliminaries K-Means  Create k random clusters  Compare each document to all cluster vectors/centroids  Assign the document to the cluster with the highest similarity, e.g., cosine similarity allClusters  initializeRandomClusters(k) repeat for document d in my documents do for Cluster c in allClusters do sim  cosineSimilarity(d, c) end for assign(d, cluster with max sim) end for until cluster centroids converge

PCP2P: Probabilistic Clustering for P2P NetworksECIR PCP2P An unoptimized distributed K-Means  Assign maintenance of each cluster to one peer: Cluster holders  Peer P wants to cluster its document d  Send d to all cluster holders  Cluster holders compute cosine(d,c)  P assigns d to cluster with max. cosine, and notifies the cluster holder Problem  Each document sent to all cluster holders  Network cost: O(|docs|  k)  Cluster holders get overloaded

PCP2P: Probabilistic Clustering for P2P NetworksECIR PCP2P Approximation to reduce the network cost…  Compare each document only with the most promising clusters  Observation: A cluster and a document about the same topic will share some of the most frequent topic terms, e.g., Topic “Economy”: crisis, shares, finacial, market, …  Use these most frequent terms as rendezvous terms between the documents and the clusters of each topic

PCP2P: Probabilistic Clustering for P2P NetworksECIR PCP2P Approximation to reduce the network cost…  Cluster inverted index : frequent cluster terms  summaries  Cluster summary   E.g. Centroid for Cluster 1 TermFrequency politics157 merkel149 obama121 sarkozy110 world98... Add to “politics” summary(cluster1) Add to “merkel” summary(cluster1)

PCP2P: Probabilistic Clustering for P2P NetworksECIR PCP2P Approximation to reduce the network cost…  Cluster inverted index : frequent cluster terms  summaries Centroid for Cluster 2 TermFrequency chicken138 cream132 rizzotto130 pasta109 pizza Add to “chicken” summary(cluster2) Add to “cream” summary(cluster2) Add to “rizzotto” summary(cluster2)

PCP2P: Probabilistic Clustering for P2P NetworksECIR PCP2P Approximation to reduce the network cost…  Pre-filtering step: Efficiently locate the most promising centroids from the DHT and the rendezvous terms  Lookup most frequent terms only  candidate clusters  Send d to only these clusters for comparing  Assign d to the most similar cluster New document TermFrequency politics14 germany13 merkel11 sarkozy7 france6... Which clusters published “politics” cluster1: summary cluster7: summary Which clusters published “germany” cluster4: summary Candidate Clusters cluster1 cluster7 cluster4  Cos: 0.3  Cos: 0.2  Cos: 0.4

PCP2P: Probabilistic Clustering for P2P NetworksECIR PCP2P Approximation to reduce the network cost…  Probabilistic guarantees in the paper:  The optimal cluster will be included in with high probability  Desired correctness probability  # top indexed terms per cluster, # top lookup terms per document  The cost is the minimal that satisfies the desired correctness probability

PCP2P: Probabilistic Clustering for P2P NetworksECIR PCP2P How to reduce comparisons even further…  Do not compare with all clusters in Full comparison step filtering  Use the summaries collected from the DHT to estimate the cosine similarity for all clusters in  Use estimations to filter out unpromising clusters  Send d only to the remaining  Assign d to the cluster with the maximum cosine similarity

PCP2P: Probabilistic Clustering for P2P NetworksECIR Full comparison step filtering…  Estimate cosine similarity ECos(d,c), for all c in  Send d to the cluster with maximum ECos,  Remove all clusters with ECos< Cos(d, )  Repeat until is empty  Assign to the best cluster PCP2P New document TermFrequency politics14 germany13 merkel11 sarkozy7 france6... Candidate Clusters in cluster1: ECos:0.4 cluster7: ECos:0.2 cluster4: ECos:0.5 Cos:0.38 Cos:0.37 cluster1 cluster7 cluster4 add

PCP2P: Probabilistic Clustering for P2P NetworksECIR Full comparison step filtering…  Two filtering strategies  Conservative  Compute an upper bound for ECos  always correct  Zipf-based  Estimate ECos assuming that the cluster terms follow Zipf distribution  Introduces small number of errors  Clusters filtered out more aggressively  further cost reduction  Details and proofs in the paper… PCP2P

PCP2P: Probabilistic Clustering for P2P NetworksECIR Evaluation objectives  Clustering quality  Entropy and Purity  Approximation quality (# of misclustered documents)  Cost and scalability  Number of messages, Transfer volume  Number of comparisons  Control parameters  Number of peers, documents, clusters  Desired probabilistic guarantees  Document collection:  Reuters ( documents)  Synthetic (up to1 Million) created using generative topic models Baselines  LSP2P: State-of-the-art in P2P clustering based on gossiping  DKMeans: Unoptimized distributed K-Means Evaluation

PCP2P: Probabilistic Clustering for P2P NetworksECIR Evaluation – Clustering quality Entropy Lower is better # misclustered documents Lower is better  Both conservative and Zipf-based strategy closely approximate K- Means  Conservative always better than Zipf-based  Correctness probability always satisfied  High-dimensionality + large networks  LSP2P not suitable!

PCP2P: Probabilistic Clustering for P2P NetworksECIR Evaluation – Network Cost Correctness ProbabilityNetwork size  Both conservative and Zipf-based have substantially lower cost than DKMeans  Zipf-based filters out the clusters more aggressively  more efficient than conservative  Cost of PCP2P scales logarithmically with network size

PCP2P: Probabilistic Clustering for P2P NetworksECIR Evaluation – Network cost/scalability More results in the paper:  Quality  Independent of network and dataset size  Independent of number of clusters  Independent of collection characteristics (zipf exponent)  Cost  Similar results for transfer volume and # document-cluster comparisons  Cost reduction even more substantial for higher number of clusters  PCP2P cost reduces with the collection characteristic exponent (the Zipf exponent of the documents)  Load balancing does not affect scalability

PCP2P: Probabilistic Clustering for P2P NetworksECIR Conclusions  Efficient and scalable text clustering for P2P networks with probabilistic guarantees  Pre-filtering strategy: rendezvous points on frequent terms  Two full-comparison filtering strategies  Conservative filtering  Zipf-based filtering  Outperforms current state of the art in P2P clustering  Approximates K-Means quality with a fraction of the cost  Current work  Apply the core ideas of PCP2P to different clustering algorithms, and to different application scenarios  e.g., more efficient centralized text clustering based on an inverted index

PCP2P: Probabilistic Clustering for P2P NetworksECIR Thank you… Questions?

PCP2P: Probabilistic Clustering for P2P NetworksECIR Load at Cluster Holders  Maintaining the cluster centroids (computational)  Compute cosine similarities (networking + computational) To avoid overloading, delegate the comparison task:  Helper cluster holders  Include their contact details in the summary  Each helper takes over some comparisons  Cluster size  #helpers Load Balancing

PCP2P: Probabilistic Clustering for P2P NetworksECIR Additional experiments

PCP2P: Probabilistic Clustering for P2P NetworksECIR Additional experiments

PCP2P: Probabilistic Clustering for P2P NetworksECIR Additional experiments Experimental configuration  Reuters dataset  peers, 20% churn per iteration