Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Content-Based Retrieval in Hierarchical Peer-to-Peer.

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Content-Based Retrieval in Hierarchical Peer-to-Peer Networks Jie Lu Jamie Callan School of Computer Science Carnegie Mellon University March 8, 2004

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Peer-to-Peer Networks Peer-to-peer computing –The sharing of computer resources and services (such as contents, processing cycles, and storage space for files) by direct exchange between systems Peer-to-peer architectures leaf node hub (supernode, superpeer, ultrapeer, directory node) pure p2p hierarchical p2p structured p2p

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Information Retrieval in Peer-to-Peer Networks What is it –The search and retrieval of documents satisfying user’s information requests in peer-to-peer networks What activities are involved –Issue information requests –Respond to the requests with documents (document retrieval) –Route requests to other nodes (query routing, resource selection) What architecture to use –Pure peer-to-peer architecture –Structured peer-to-peer architecture (DHT-based) –Hierarchical peer-to-peer architecture What search mechanism to use –Name-based retrieval (simple keyword matching) –Content-based retrieval

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Why Content-Based Retrieval Name-based retrieval only suffices known-item search –Digital objects such as music, video, and software have relatively obvious naming convections and descriptions –The user is familiar with the object being requested –Any copy is as good as any other Search across networks of digital libraries with more varied content requires content-based retrieval –Text documents usually don’t have certain naming conventions and it is often difficult to describe a document in a few words –The user usually doesn’t know whether there are any relevant documents with respect to the information request –There may be zero, one, or more relevant documents for an information request

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Why Hierarchical Peer-to-Peer Networks Centralization –Ensures relatively consistent coverage and speed –Suffers from a single point of failure –Has limited scalability Pure P2P (complete decentralization) –Robust –Less efficient, may overload the network Structured P2P –Most systems only support name-based retrieval –It is not straightforward to adopt more sophisticated retrieval models Hierarchical P2P –Improves efficiency and scalability without sacrificing robustness –The special dedicated hubs can provide more sophisticated services to improve query routing efficiency as well as retrieval accuracy –It is straightforward to adopt various retrieval algorithms

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Related Work Search in pure P2P networks –Gnutella 0.4: Flooding –PlanetP: Each node collects compact summaries of other nodes’ inverted indexes through gossiping (Cuenca-Acuna et al. 2002) –Each node learns a profile for each neighboring node by recording the past queries this node answered (Kalogeraki et al. 2002) –Each node learns and maintains a list of shortcuts to other nodes based on their previous behavior w.r.t. its requests (Sripanidkulchai et al. 2003) Search in structured P2P networks –CAN, Chord, Pastry, Tapestry: Use distributed hash table to locate the hub that holds the key most similar to the request –pSearch: Use the semantic vector (generated by Latent Semantic Indexing) of each document as the key to distribute document index

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Related Work (cont’d) Search in hierarchical P2P networks –Gnutella 0.6, KaZaA, Limewire: Hubs use name-based retrieval to search among the content hashes uploaded by connecting leaf nodes; hubs flood information request to neighboring hubs –JXTA: Hubs conduct XML retrieval against the descriptions registered by connecting leaf nodes; how hubs cooperate to route information request is yet to be defined Resource selection in federated search using a single directory service –CORI: Inference network model –gGlOSS: Vector space model –K-L divergence-based algorithms: Language modelingK-L divergence-based algorithms

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Our Contributions Applied content-based resource selection and document retrieval algorithms to hierarchical peer-to-peer networks and demonstrated their effectiveness in retrieval accuracy and query routing efficiency Provided a method for a hub to automatically learn the contents that other hubs cover without requiring their cooperations Explored a simple pruning technique on resource descriptions to reduce storage costs associated with content-based retrieval Developed a large-scale peer-to-peer testbed from TREC WT10g dataset with millions of documents, thousands of nodes, and tens of thousands of queries –Available at http://www.cs.cmu.edu/~callan/Data

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Retrieval in Hierarchical Peer-to-Peer Networks Two levels of retrieval –At hubs: resource selection –At leaf nodes: document retrieval Three types of retrieval methods –Name-based: check query terms against terms from document names for resource selection and document retrieval –Match-based: match query terms against collection vocabularies for resource selection –Content-based: use collection language models for resource selection and document language models for document retrieval Two selection rules –Query matching rule: how many query terms need to be matched for a query –Number of selected neighbors: query message is routed to how many neighboring nodes

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Resource Description Acquisition Resource description describes the contents that a leaf node contains or a hub covers, it is used by resource selection algorithms to select nodes for query routing Cooperation is required between hubs and leaf nodes –Leaf nodes provide full resource descriptions to hubs –Leaf nodes provide pruned resource descriptions to hubs Resource descriptions of hubs can be acquired cooperatively or uncooperatively –A hub can provide aggregation of its leaf nodes’ resource descriptions to neighboring hubs –A hub can learn automatically a content language model for each of its neighboring hubs –The content model of a hub is learned by recording the query terms of past queries that this node has responded to

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Testbed Contents –TREC WT10g was divided into 11,485 collections based on document URLs –2,500 collections were randomly selected, each of which defined a leaf node in a hierarchical P2P network Topology –Focus on hubs that cover specific types of content –Leaf nodes were clustered into 25 clusters, each of which defined the contents of a single hub (i.e. which leaf nodes were connected to this hub) –The connections between hubs were generated randomly (minimum 1, maximum 7, average 4 neighbors) Queries –Key terms were extracted from each document to serve as a query –15,000 queries were used in our experiments

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Sample Queries of Different Lengths LengthTerms 1sdtech 2malignant hyperthermia 3cardiac surgery; anesthesia 4trade remedy; nafta law 5drug drive collision police investigate 6quarter company revenue increase sybase cash

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Evaluation Methodology Retrieval accuracy –Expensive to obtain relevance judgments for large amounts of automatically-generated queries –The 50 top-ranked documents retrieved from a single large collection (union of all leaf nodes’ contents) were used as baseline to measure how well the P2P network could reproduce this baseline –Measure the accuracy by set-based Recall, Precision and F-score Query routing efficiency –Measure the efficiency by the average number of query messages routed for each query in the network

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Experimental Results Resource Selection and Document Retrieval Algorithms PrecisionRecallF-Score Avg Num Qry Msgs Per Qry NBRS+NBDR (100% qry terms)70.97%4.98%0.093171 NBRS+NBDR (≥50% qry terms)21.88%28.85%0.2489479 MBLNS+CBDR62.12%33.76%0.4375540 Random-MBLNS+CBDR60.94%19.59%0.2965122 CBLNS-F+CBDR71.82%29.65%0.4197116 CBLNS-P+CBDR72.21%29.42%0.4181114 CBLNS-F+HSM+CBDR71.95%28.46%0.407985 CBLNS-P+HSM+CBDR72.42%28.48%0.408883 CBLNS-F+HSR-F+CBDR75.23%28.35%0.411846 CBLNS-F+HSR-P+CBDR74.59%28.13%0.408546 CBLNS-P+HSR-P+CBDR75.48%28.18%0.410446 retrieval accuracy qry routing efficiency name- based match- based content- based

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Conclusions Testbed created for use in our experiments is one of the largest so far for research on peer-to-peer systems –2,500 leaf nodes, 25 hubs, 15,000 queries, total number of documents 1,421,088 Content-based vs. name-based retrieval –Content-based retrieval is far more accurate and efficient than name- based retrieval and flooding methods Content-based vs. match-based resource selection –Content-based resource selection is more effective at restricting the query routing only to those nodes that are very likely to satisfy the user’s information need than match-based resource selection

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Conclusions (cont’d) Content-based hub selection vs. flooding –Content-based hub selection based on learned content models helps improving query routing efficiency without degrading retrieval accuracies –Content-based hub selection based on full/pruned aggregate resource descriptions can greatly improve query routing efficiency without hurting accuracies, but requires larger storage space compared with hub selection based on learned content models Pruned vs. full resource descriptions –Using pruned resource descriptions for content-based resource selection has comparable retrieval accuracies compared with using full resource descriptions, but requires around half the storage space for resource selection index

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Content-Based Retrieval in Hierarchical Peer-to-Peer.

Similar presentations

Presentation on theme: "Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Content-Based Retrieval in Hierarchical Peer-to-Peer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Content-Based Retrieval in Hierarchical Peer-to-Peer.

Similar presentations

Presentation on theme: "Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Content-Based Retrieval in Hierarchical Peer-to-Peer."— Presentation transcript:

Similar presentations

About project

Feedback