Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Text Databases Text Types
Introduction to Information Retrieval
Basic IR: Modeling Basic IR Task: Slightly more complex:
VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
INFO 624 Week 3 Retrieval System Evaluation
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
LSDS-IR’08, October 30, Peer-to-Peer Similarity Search over Widely Distributed Document Collections Christos Doulkeridis 1, Kjetil Nørvåg 2, Michalis.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Text-Based Content Search and Retrieval in ad hoc P2P Communities Francisco Matias Cuenca-Acuna Thu D. Nguyen
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Probabilistic Latent Semantic Analysis
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Other IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models boolean vector.
Information Retrieval
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
PNear Combining Content Clustering and Distributed Hash-Tables Ronny Siebes Vrije Universiteit, Amsterdam The netherlands
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Document retrieval Similarity –Vector space model –Multi dimension Search –Range query –KNN query Query processing example.
Efficient Peer to Peer Keyword Searching Nathan Gray.
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Chapter 6: Information Retrieval and Web Search
SINGULAR VALUE DECOMPOSITION (SVD)
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
P2P Concept Search Fausto Giunchiglia Uladzimir Kharkevich S.R.H Noori April 21st, 2009, Madrid, Spain.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.
A Hybrid Search Engine -- Combining Google and P2P Xuanhui Wang.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Document Clustering and Collection Selection Diego Puppin Web Mining,
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.
Why Spectral Retrieval Works Holger Bast Max-Planck-Institut für Informatik (MPII) Saarbrücken, Germany joint work with Debapriyo Majumdar SIGIR 2005 in.
Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Efficient Multi-User Indexing for Secure Keyword Search
Basic Information Retrieval
Paraskevi Raftopoulou, Euripides G.M. Petrakis
Data Mining Chapter 6 Search Engines
CS 430: Information Discovery
Latent Semantic Analysis
Presentation transcript:

Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino, December 8th 2004

What‘s wrong with ?... it‘s centralized. Global authority – can we trust it? - „presidential election Ukraine“ - „buy luxury car“ Dependency – I need it. - single point of failure Size – even Google is only human. - unlikely to index everything that‘s of interest (deep web) - infeasible to run expensive algorithms on 8 billion documents - difficult to input human knowledge First... Peer-to-peer search

... it‘s term-based. Searching for „matrix factorization“... but what about... „matrix decomposition“... or... „decompose linear system“... or even... „probabilistic latent semantic indexing“ First result: Does not find: Concept-based search We get some good hits but far from all No big problem if looking for popular things Then still enough good hits What‘s wrong with ? Second... A personal dilemma: Not a „Britney Spears“-like query

Peer-to-peer search Approach 0 Each peer has a local crawler and index Nobody posts any information about local indices Search can only be done by (limited) flooding No way to know where to find information in advance Very low recall for unpopular queries Matrix factorizatio n Relevant nerd

Peer-to-peer search Approach 1 Again local crawler and index Union of all indices stored via a distributed hash table (DHT) Each peer responsible for a few terms (i.e. keys in the DHT) To search for „matrix factorization“ retrieve the corresponding document lists and merge them to give result ranking Nice idea but infeasible Far too much data traffic even for medium local collections Britney Matrix Spears Factorization Linear Decomposition System A joining peer posts his full local index For each term in his collection he sends the inverted list to the corresponding peer The receiving peer merges this list with his current list The new peer will also become responsible for some terms Linear: doc 10, doc 6, doc 17 Factorization: doc 9, doc 7, doc 13 Matrix: doc 7, doc 10, doc 5 1.doc 7 2.doc 9...

Peer-to-peer search Approach 2 Again local crawler and index Peers share a distributed hash table (DHT) Each peer responsible for a few terms (i.e. keys in the DHT) For each term we maintain a peer list with statistics To search for „matrix factorization“ retrieve the peer lists and select the most promising peers Send the query to these peers The peers perform a local search and return their best results Merge the results This works but... Performance heavily depends on term-based peer selection Still low recall Britney Matrix Spears Factorization Linear Decomposition System A joining peer posts only statistics about his terms For each term in his collection he sends the short statistic to the corresponding peer The receiving peer merges this statistic with his current peer statistics for this term The new peer will also become responsible for some terms Linear: 20 docs, max tf 10 Factorization: peer 9, peer 13 Matrix: peer 7, peer 9 peer 9, peer 7 „matrix factorization“ doc 2, doc 5, doc 7 „matrix factorization“ doc 8, doc 7, doc 11 1.doc 7 2.doc „Minerva“ [Weikum et al.]

Concept-based search Improving recall Basic idea: Don‘t directly compare queries with documents at the word level, but introduce one level of abstraction. Query: „Probabilistic latent semantic indexing“ Document: „Non-negative matrix factorization“ Not similar at word level Strong link at concept level Concept: Approximately decomposing a non-negative matrix into a product of two smaller non- negative matrices Concept: Approximately decomposing a non-negative matrix into a product of two smaller non- negative matrices

How to derive concepts? Noise reduction –Replace documents by combinations of „simpler“ prototypes Document clustering –Partition documents into subsets and map the query to the best matching set Query expansion –Study co-occurrence pattern of terms and add „suitable“ terms to the query Note: In all cases the automatically derived concepts depend on the individual corpus. Concept-based search Improving recall

Concept-based P2P search Approach 1 Again local crawler and index Peers share a distributed hash table (DHT) Each peer responsible for a few terms (i.e. keys in the DHT) For each term we maintain a peer list with statistics Same peer selection process using query terms as in Minerva The only difference... Peers locally employ a concept-based retrieval scheme of their choice Locally better recall We still merge the invididual results This works but... Performance still heavily depends on term-based peer selection Still low recall Britney Matrix Spears Factorization Linear Decomposition System

What we would like to do: 1.Map a query to the most relevant concepts, either individually or with help from peers 2.Find out which peers have documents related to these concepts 3.Send the query to these peers and retrieve ranking for the concept 4.Merge the results Difficult as we have to post summaries of concepts which can then be found by others. Not clear how to universally and uniquely represent concepts. Concept-based P2P search A blueprint

doc 5, doc 8, doc 3,... A new „concept“-based scheme Using documents as concepts 1.Precompute a ranking of documents for each document (doc-doc similarities) using cosine similarity 2.For a query find a few documents which in their combination are most likely to generate the query (EM algorithm) 3.Merge the corresponding rankings to give the final output ranking doc 11, doc 7, doc 8,... doc 9, doc 1, doc 7,... doc 5, doc 8, doc 3,... doc 2, doc 5, doc 1,... doc 4, doc 3, doc 8, „matrix factorization“ Experimentally, the following works very well for a local collection: doc 11, doc 7, doc 8,... 1.doc 8 2.doc Observation 1: highest ranked documents do not have to contain query terms Observation 2: doc-doc similarities more content-based than doc-query

Concept-based P2P search Approach 2 As before, post short statistics about each term in collection but also about each document E.g., number of documents with cosine similarity > 0.7 Still use term statistics for first round of peer selection (as before) Send query to these peers (as before) Each selected peer locally selects documents which are most likely to generate query Sends back only these few documents (term vectors) Initiating peer then selects a few of them (EM algorithm) Britney Matrix Spears Factorization Linear Decomposition System Factorization: peer 9, peer 13 Matrix: peer 7, peer 9 peer 9, peer 7 „matrix factorization“ doc 2 w/ doc 5 „matrix factorization“ doc 8 w/ doc 5 doc 5 w/ doc doc 1 doc 5 doc 7 doc 2 doc 3 doc 4 doc 9 doc 11 doc 8

doc 5 w/ doc 8 Concept-based P2P search Approach 2 Then the peer lists for these documents are retrieved and a new set of peers is selected We then send these peers our query in terms of selected documents, i.e., „concepts“ The peers send back the most relevant documents Merge the individual results Britney Matrix Spears Factorization Linear Decomposition System peer 2, peer doc 1 doc 5 doc 7 doc 2 doc 3 doc 4 doc 9 doc 11 doc 8 doc 5: peer 2, peer 1 doc 8: peer 3, peer 2 doc 5 w/ doc 8 doc 10, doc 5, doc 2 doc 2, doc 3, doc 8 1.doc 2 2.doc

Concept-based P2P search Advantages of our approach Users only have to „agree“ on documents –No need for common taxonomy If we can find some relevant documents we can find more => increases recall –Allows content-based „More documents like this“ button Uses non-trivial doc-doc similarities –Infeasible to compute for 8 billion documents, easy to do for a few thousands