A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern.

Slides:

Advertisements

Similar presentations

CAN 1.Distributed Hash Tables a)DHT recap b)Uses c)Example – CAN.

Advertisements

Alex Cheung and Hans-Arno Jacobsen August, 14 th 2009 MIDDLEWARE SYSTEMS RESEARCH GROUP.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Chapter 5: Introduction to Information Retrieval

Peer to Peer and Distributed Hash Tables

Basic IR: Modeling Basic IR Task: Slightly more complex:

Scalable Content-Addressable Network Lintao Liu

Sylvia Ratnasamy, Paul Francis, Mark Handley, Richard Karp, Scott Schenker Presented by Greg Nims.

Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.

Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.

Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.

What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.

Small-world Overlay P2P Network

DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December

Subscription Subsumption Evaluation for Content-Based Publish/Subscribe Systems Hojjat Jafarpour, Bijit Hore, Sharad Mehrotra, and Nalini Venkatasubramanian.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

A Scalable Content-Addressable Network Authors: S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker University of California, Berkeley Presenter:

LSDS-IR’08, October 30, Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis.

Object Naming & Content based Object Search 2/3/2003.

Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.

CS218 – Final Project A “Small-Scale” Application- Level Multicast Tree Protocol Jason Lee, Lih Chen & Prabash Nanayakkara Tutor: Li Lao.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.

Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen

Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.

 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.

Clustering Vertices of 3D Animated Meshes

“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das

Clustering Unsupervised learning Generating “classes”

INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

SCAN: a Scalable, Adaptive, Secure and Network-aware Content Distribution Network Yan Chen CS Department Northwestern University.

Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Document retrieval Similarity –Vector space model –Multi dimension Search –Range query –KNN query Query processing example.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

1 Distributed Hash Tables (DHTs) Lars Jørgen Lillehovde Jo Grimstad Bang Distributed Hash Tables (DHTs)

On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.

Network Computing Laboratory Scalable File Sharing System Using Distributed Hash Table Idea Proposal April 14, 2005 Presentation by Jaesun Han.

Content Addressable Network CAN. The CAN is essentially a distributed Internet-scale hash table that maps file names to their location in the network.

A Scalable Content-Addressable Network (CAN) Seminar “Peer-to-peer Information Systems” Speaker Vladimir Eske Advisor Dr. Ralf Schenkel November 2003.

Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.

1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.

Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

AlvisP2P : Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network Toan Luu, Gleb Skobeltsyn, Fabius Klemm, Maroje Puh, Ivana Podnar Zarko, Martin.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

National Institute of Advanced Industrial Science and Technology Query Processing for Distributed RDF Databases Using a Three-dimensional Hash Index Akiyoshi.

Scalable Content- Addressable Networks Prepared by Kuhan Paramsothy March 5, 2007.

Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.

1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.

Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.

CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.

Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.

1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.

Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.

Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks

LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.

Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.

Efficient Semantic Web Service Discovery in Centralized and P2P Environments Dimitrios Skoutas 1,2 Dimitris Sacharidis.

A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.

Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications * CS587x Lecture Department of Computer Science Iowa State University *I. Stoica,

Probabilistic Data Management

Multimedia Information Retrieval

EE 122: Peer-to-Peer (P2P) Networks

A Scalable content-addressable network

Paraskevi Raftopoulou, Euripides G.M. Petrakis

Hierarchical and Ensemble Clustering

Retrieval Utilities Relevance feedback Clustering

Presentation transcript:

A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern University ACM SIGIR HDIR 2005 Chengxiang Zhai Yahoo! Inc.

Motivation Rapid information growth requires scalable and robust retrieval architecture Problems with centralized retrieval architecture –Hard to maintain freshness of information –Single-point-of-failure Peer-to-Peer IR may be a possible solution –No need for centralized indexing –Easy to maintain freshness of information –Resistant to single-point-of-failure Challenge: P2P IR architecture?

Term Index vs. Document Index Term index –Fast query execution –Insufficient for supporting sophisticated algorithms such as feedback –Hard to update (e.g., adding a doc) in a distributed environment Document index –Easy to update –Support advanced retrieval algorithms –Slow query matching

What is the Right Indexing Architecture for P2P IR?

Previous Work: pSearch [Tang et al. 03] Based on document indexing Address the problem of “slow query execution” by –Dimension reduction (using LSI) –Exploiting distributed hash tables (DHT) Problems –Lack of semantic locality (semantically similar documents may be stored in quite different nodes) –Slow index generation –Hard to add a new concept

Proposed Solution: P2PIR, Scalable Semantic Indexing Framework for IR Semantic locality –Achieved through a novel two-phase distributed semantic indexing –Documents with similar semantics will have indices stored on nearby nodes Flexible tradeoff between search accuracy and efficiency Support of sophisticated retrieval methods –E.g., feedback and personalized search Adaptation to document dynamics –Incrementally incorporate new documents/concepts

Background on Sample DHT Content-Addressable Network A B CDE Two key operations Put (key, object) Object = get (key) Partition Cartesian space into zones Each zone is assigned to a computer Neighboring zones are routing neighbors Object lookup is done through routing

Routing and Location Properties on DHT Log(N) hops need to route a key where N is the number of nodes in the overlay Log(N) maintenance overhead for routing Guaranteed success Fault-tolerant and robust DoS attack resilient Becoming increasingly practical for serious use

P2PIR Architecture Two stage document indexing –Concept vector generation –Index locator generation and placement Open for plugging in “feature extraction”, “relevance ranking” and “query refinement” XML doc text doc feature extraction structure -aware feature vector term vector concept vector construction index locator construction index placement on DHT query Index locator generation search on DHT Relevance Ranking Applications P2PIR P2P DHT system Internet Results to user query refinement

Assumptions about the Retrieval Models Documents and queries are both represented as vectors –Naturally occurring in the vector-space model –Probabilistic models can be computed as vector matching as well Euclidean distances are reasonably accurate in capturing document topic similarity –Euclidean distances are only used to prune non- promising documents –Final relevance ranking can be based on more accurate retrieval functions

Concept Vector Construction Group document into k clusters based on the feature vectors The centroid of each cluster corresponds to a concept Given a document d, the similarity between its feature vector and a concept c (e.g., cosine value between them) defines the weight of d on concept c The concept vector of d is composed of its weights on all the concepts

Two-Stage Semantic Indexing Stage 1: Fast dimension reduction –Document clustering to identify n*d clusters (d= DHT dimension) –Represent each document with a vector on this n*d dimensional space Stage 2: Semantic index locator construction –Further partition the n*d clusters into n equal-size semantically coherent groups, each with size d –Each group forms an index locator (key for searching DHT)

Fast Dimension Reduction Regular k-means clustering –Randomly start with k centroids –Iteratively re-assign documents to each cluster and re-compute the centroids –Can stop at anytime to obtain rough clusters Modification –Start with k relatively different centroids Complexity at each iteration: O(kN), where N >>k is the number of documents Can be run on a sample of documents Vector(D)= (sim(D,C1), …, sim(D,C k ))

Index Locator Construction Motivation: the dimensionality of concept vectors (e.g., a few hundreds) may be much larger than that of DHT, so hard to place index directly with concept vector Basic idea: break the concept vector into multiple chunks with the same dimensionality as that of DHT, and each chunk contains related concepts With such division, each document only has a small number of chunks with non-negligible weights for indexing Such chunks are called index locators

Index Placement on DHT For each index locator of a document d If its norm (i.e., length of the vector) is over certain threshold, we put the index locator of d along with its feature vector on the peer node whose DHT address vector matches best with the index locator.

Illustration of Two-Stage Indexing D 1 D 2 … D N C 1 C 2 … C M Semantic Chunk 1 Semantic Chunk k Concepts C 1 C 2 … C M Doc D i = (x 1, x 2, …, x d, x d+1, …, x 2d, …. …. x M ) Locator 1Locator 2 …. (x 1, x 2, …, x d )  Original vector(D) (x 1, x 2, …, x d )  Original vector(D’) …… In DHT:

Querying Contact any node on the DHT Project the query vector to find related concepts, and form the index locators Use index locators to route to DHT nodes with the indices and feature vectors of related documents Use original query vector and document vectors to perform relevance ranking This local retrieval process can expand to neighboring DHT nodes until enough relevant results have been identified

Adaptation to Corpus Dynamics Basic idea: Incrementally add new documents/concepts without affecting existing indices, and periodically (very infrequently) rebuild index locators for all documents When a set of new documents emerge, we check 1.whether they contain new frequently-used terms or new heavy weighted terms 2.whether their concept vectors belong to any existing cluster in the existing semantic space

Adaptation to Corpus Dynamics (II) To add and index a new concept c –If c belongs to an existing concept chunk whose size is less than that of the underlying DHT, we can add c to that cluster by using the next available entry of the index locator. –Otherwise, we generate a new concept group and a new set of index locators to represent c Generate the index locators for the new documents, and deploy their indices on DHT Finally, multicast the addition of the new concept c, and the addition of new concept group to all DHT nodes, so that they can route queries about c

Example for Corpus Dynamics When new documents on “Bin Larden” appear, we detect it as a new concept relating to the concept group “terrorism”. If the dimensionality of DHT is 20, and the size of “terrorism” concept group is 17 –Just add “Bin Larden” to that group as dimension 18 of the index locator. –The corresponding index locators of existing documents have weight zero as default on dimension 18, and thus remain the same. Otherwise, the terrorism concept group already full, we generate a new concept group for “Bin Larden” (i.e., a new set of index locators).

Summary Propose a scalable semantic indexing framework for peer-to-peer information retrieval: P2PIR –Index placement with good semantic locality, leading to good retrieval accuracy and efficiency –Tunable framework and flexibility –Incremental adaptation to document/concept dynamics Prototype and evaluation of P2PIR in progress