Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Chapter 5: Introduction to Information Retrieval
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities F. M. Cuenca-Acuna, C. Peery, R. P. Martin, and T. D.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Autonomous Replication for High Availability in Unstructured P2P Systems Francisco Matias Cuenca-Acuna, Richard P. Martin, Thu D. Nguyen Department of.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
A probabilistic approach to building large scale federated systems Francisco Matias Cuenca-Acuna
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Francisco Matias Cuenca-Acuna Christopher Peery Thu D. Nguyen Usando algoritmos probabilísticos para construir sistemas.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
Evaluating the Performance of IR Sytems
Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities Francisco Matias Cuenca-Acuna, Christopher Peery, Richard P.
Object Naming & Content based Object Search 2/3/2003.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Navigating and Sharing in a Decentralized World Francisco Matias Cuenca-Acuna
Automatic Indexing (Term Selection) Automatic Text Processing by G. Salton, Chap 9, Addison-Wesley, 1989.
Chapter 5: Information Retrieval and Web Search
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
IR Techniques For P2P Networks1 Information Retrieval Techniques For Peer-To-Peer Networks Demetrios Zeinalipour-Yazti, Vana Kalogeraki and Dimitrios Gunopulos.
Introduction to Peer-to-Peer Networks. What is a P2P network A P2P network is a large distributed system. It uses the vast resource of PCs distributed.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Search in Peer-to-Peer File-Sharing Systems: Like Metasearch Engines, But Not Really Wai Gen Yee, Dongmei Jia, Linh Thai Nguyen {yee, jiadong,
Autonomous Replication for High Availability in Unstructured P2P Systems Francisco Matias Cuenca-Acuna, Richard P. Martin, Thu D. Nguyen
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Structuring P2P networks for efficient searching Rishi Kant and Abderrahim Laabid Abderrahim Laabid.
Autonomous Replication for High Availability in Unstructured P2P Systems (Paper by Francisco Matias Cuenca-Acuna, Richard P. Martin, Thu D. Nguyen) Hristo.
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
Can Change this on the Master Slide Monday, August 20, 2007Can change this on the Master Slide0 A Distributed Ranking Algorithm for the iTrust Information.
1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.
1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Document Clustering and Collection Selection Diego Puppin Web Mining,
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Peer-to-Peer Information Systems Week 12: Naming
Information Retrieval and Web Search
CHAPTER 3 Architectures for Distributed Systems
EE 122: Peer-to-Peer (P2P) Networks
DHT Routing Geometries and Chord
Peer to Peer Information Retrieval
6. Implementation of Vector-Space Retrieval
Retrieval Performance Evaluation - Measures
Peer-to-Peer Information Systems Week 12: Naming
Presentation transcript:

Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen

Motivation It is hard to find information in current P2P infrastructures They are designed for name-based search They don’t have quality metrics They don’t rank results Most are optimized to find popular content The current Internet search model has proven to be effective to locate data Intuitive term-based query model Quality metric and ranking critical factors in success of Internet search engines Help users to quickly pinpoint relevant documents from vast repository

Goals & challenges Empower P2P communities with search capabilities similar to Internet search engines No central servers Fault tolerance Cannot employ current model used by Internet search engines No central management and administration Resources are fragmented Peers behaviors are uncontrolled

[K 1,..,K n ] Bloom filter Gossiping Local Directory NicknameStatusIPKeys AliceOnline…[K 1,..,K n ] BobOffline…[K 1,..,K n ] CharlesOnline…[K 1,..,K n ] Local Files XML Snippets Local Directory NicknameStatusIPKeys AliceOnline…[K 1,..,K n ] BobOffline…[K 1,..,K n ] CharlesOnline…[K 1,..,K n ] Local Files Bloom filter XML Snippets Summary of PlanetP Nodes maintain an index of their content Represented as Bloom filters Indexes and Directories are replicated everywhere Gossiping keeps peers synchronized

Content search in PlanetP Query Diane Local Directory [K 1,..,K n ]Gary [K 1,..,K n ]Fred [K 1,..,K n ]Edward [K 1,..,K n ]Diane [K 1,..,K n ] Keys Charles Bob Alice Nickname Bob Fred Local lookup Fred Bob Diane Rank nodes Diane Contact candidates Fred File 3 File 1 File 2 Rank results STOP

The Vector Space model Documents and queries are represented as k-dimensional vectors Word are weighted according to their relevance for the document Documents are weighted according to their words The angle between a query and a document indicates its similarity Document Query

Weight assignment (TFxIDF) Idea Use per doc. Term Frequency (TF) to weight words (W D,t ) Use inverse global popularity (IDF) to find good discriminators among the query terms Intuition TF indicates how related a document is to a particular concept Inverse Document Frequency (IDF) identify the words that are good discriminators between documents W D,t =f(Frequency of t in D) IDF t =f(No. documents/Frequency of t across documents)

Unfortunately IDF is not suited for P2P Requires an appearance count for every word in the community We introduce the use of the Inverse Peer Frequency IPF t =f(No. Peers/Peers with documents containing t) IPF can be computed with local information IPF is compatible across the community Node & document ranking in PlanetP

Stopping condition Intuitive idea: Stop as soon as k documents are retrieved Not good A node might have few highly ranked documents and many that have a low rank We propose an adaptive approach: Contact nodes one by one and keep a list of the top k documents retrieved Stop contacting candidates when p nodes in a row fail to contribute to the top k

Evaluation method We use five well known document collections Each collection comes with a set of queries and relevance judgments Here we present results for one (AP89) We measure recall and precision TraceQueriesDocuments Number of words Collection size (MBs) AP

Evaluation method We use a simulator to test our algorithm Different file distributions Against a central search engine Quantifying the effect not using an adaptive stopping condition

Results

Results cont.

More results Adjusting the stop condition according to the community size and number of results expected We provide a linear function to determine p Recall as the community grows to 1000 (scalability) Overlap between PlanetP’s results and the ones obtained by using standard TFxIDF 80% on average

Conclusions PlanetP matches TFxIDF's performance using the TFxIPF approximation Give P2P communities search capabilities as powerful as environments with centralized resources TFxIPF is applicable beyond PlanetP PlanetP matches TFxIDF’s performance regardless of how documents are distributed throughout the community Our stopping heuristic limits searches to a small subset of the community yet allow enough peers to be contacted to guarantee good results

Related Work Tapestry, Pastry, Chord and CAN Implement a distributed hash table for P2P environments Oriented towards name based searches (for FS) They already store all the information needed to implement TFxIPF Cori and Gloss Address the problem of indexing and searching distributed collections of documents They build a centralized index that has total knowledge of word usage so they don’t contact unnecessary nodes

Questions?

Example Assume k=2 and p=1 Documents with a tick (  ) have been judged relevant Documents with a cross (  ) are related but not rele D 11  D 12  D 13  D 14  D 21  D 22  D 23  D 24  D 31  D 32  D 33  D 34  Trivial stop {D 11, D 12 } {D 21, D 11 } Adaptive stop {D 11, D 12 }