ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.

Slides:



Advertisements
Similar presentations
For more information please send to or EFFICIENT QUERY SUBSCRIPTION PROCESSING.
Advertisements

P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Clayton Sullivan PEER-TO-PEER NETWORKS. INTRODUCTION What is a Peer-To-Peer Network A Peer Application Overlay Network Network Architecture and System.
Expediting Searching Processes via Long Paths in P2P Systems 05/30 IDEA Lab.
PROMISE: Peer-to-Peer Media Streaming Using CollectCast Mohamed Hafeeda, Ahsan Habib et al. Presented By: Abhishek Gupta.
A Distributed Indexing Strategy for Efficient XML Retrieval Efficiency Issues in Information Retrieval Workshop 30th European Conference on Information.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems March 29, 2005.
Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY.
FRIENDS: File Retrieval In a dEcentralized Network Distribution System Steven Huang, Kevin Li Computer Science and Engineering University of California,
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
ODISSEA Mehdi Kharrazi Kulesh Shanmugasundaram Security Issues.
Efficient Content Location Using Interest-based Locality in Peer-to-Peer Systems Presented by: Lin Wing Kai.
Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,
Adaptive Content Management in Structured P2P Communities Jussi Kangasharju Keith W. Ross David A. Turner.
Web Exploration and Search Technology Lab Department of Computer and Information Science Polytechnic University Brooklyn, NY Faculty: Torsten Suel.
MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine Middleware 2005 Grenoble, France Sebastian Michel Max-Planck-Institut für Informatik.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
1 Seminar: Information Management in the Web Gnutella, Freenet and more: an overview of file sharing architectures Thomas Zahn.
1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
Overview of Search Engines
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Web Search Engines and Information Retrieval on the World-Wide Web Torsten Suel CIS Department Overview: introduction.
IR Techniques For P2P Networks1 Information Retrieval Techniques For Peer-To-Peer Networks Demetrios Zeinalipour-Yazti, Vana Kalogeraki and Dimitrios Gunopulos.
Link Recommendation In P2P Social Networks Yusuf Aytaş, Hakan Ferhatosmanoğlu, Özgür Ulusoy Bilkent University, Ankara, Turkey.
MPI Informatik 1/17 Oberseminar AG5 Result merging in a Peer-to-Peer Web Search Engine Supervisors: Speaker : Sergey Chernov Prof. Gerhard Weikum Christian.
Searching the Clouds Presented by Kajal Miyan Slides courtesy: UC Berkeley RAD Lab
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Efficient Peer to Peer Keyword Searching Nathan Gray.
Efficient P2P Searches Using Result-Caching From U. of Maryland. Presented by Lintao Liu 2/24/03.
Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
ODISSEA open distributed search engine architecture A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval Torsten Suel, Chandan.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys ICDE 2007 Scalable Peer-to-Peer Web Retrieval with Highly Discriminative Keys Ivana.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Efficient P2P Search by Exploiting Localities in Peer Community and Individual Peers A DISC’04 paper Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Worldwide Lexicon Brian McConnell May, WWL – Brian McConnell Worldwide Lexicon Intro Automatic discovery of dictionary, semantic net and translation.
Taxonomy Caching: A Scalable Low- Cost Mechanism for Indexing Remote Contents in Peer-to-Peer Systems Kjetil Nørvåg Norwegian University of Science and.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
Query by Image and Video Content: The QBIC System M. Flickner et al. IEEE Computer Special Issue on Content-Based Retrieval Vol. 28, No. 9, September 1995.
CS Spring 2012 CS 414 – Multimedia Systems Design Lecture 37 – Introduction to P2P (Part 1) Klara Nahrstedt.
P2P Content Search: Give the Web Back to the People Matthias Bender Sebastin Michel Peter Triantafillou Gerhard Weikum Christian Zimmer Mariam John CSE.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 Design and Implementation of a High-Performance Distributed Web Crawler Polytechnic University Vladislav Shkapenyuk, Torsten Suel 06/13/2006 석사 2 학기.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
Statistics Visualizer for Crawler
CHAPTER 3 Architectures for Distributed Systems
IST 516 Fall 2011 Dongwon Lee, Ph.D.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Bookmark-driven Query Routing in Peer-to-Peer Web Search
Information Retrieval and Web Design
Presentation transcript:

ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam CIS Department Polytechnic University Brooklyn, NY (google: “odissea peer”)

ODISSEA: architecture, motivation, ideology - system design - discussion of design choices - our vision: open distributed web search architecture Distributed query processing - query execution in large search engines - efficient distributed top-k queries - experimental results Open problems and future work Talk Outline:

huge amount of work on web search huge amount of activity in P2P so, how about P2P (full text) search? - to query content in P2P networks - to query content located outside P2P network current engines based on scalable PC clusters so are many other “giant scale services” we know how to do file sharing in P2P how about search engines and large-scale IR? Introduction:

“Open DIStributed Search Engine Architecture” global indexing and query execution service - scalable to size of the web - scalable to large query load - highly robust - open ODISSEA:

avoids broadcasting query to all nodes faces other problems: updates, long inverted lists our main technical focus: efficient top-k queries Global index organization: local index organizationglobal index organization

scalable lower tier for indexing and query execution crawling outside system open interface supporting client-based tools Two-tier architecture:

search of content located in P2P network distributed search in large organizations as a large-scale web search engine as global search middleware on top of system of local index structures Applications:

beyond current web search: - smart desktop-based search tools - browsing assistants, navigational toolbars - access lower-level search infrastructure can we have a common infrastructure? - open - scalable - agnostic example: Google API (not really) discussion: “entry barrier to search” tradeoff/challenge: performance vs. flexibility Vision: open web search infrastructure

P2P system spectrum: - unstructured (Gnutella etc) vs. structured (DHT) - rapidly evolving vs. fairly static massive data apps = fairly static system? - limit to how fast we can move data around - exception: file sharing (download, then share) we are at the more stable end of spectrum failures vs. unavailability replication and synchronization challenges Discussion: P2P and massive data

based on Pastry DHT index and objects stored in Berkeley DB fine-grained postings traffic via P2P links replication for fault-tolerance replication based on “object groups” nodes may be temporarily unavailable synchronization of nodes upon reentry Implementation:

inverted index - a data structure for supporting text queries - like index in a book Query processing in search engines inverted index aalborg 3452, 11437, …... arm 4, 19, 29, 98, 143,... armada 145, 457, 789,... armadillo 678, 2134, 3970,... armani 90, 256, 372, 511,.... zz 602, 1189, 3209,... disks with documents indexing Boolean queries: (zebra AND armadillo) OR armani unions/intersections of lists

scoring function: assigns score to each document with respect to a given query top-k queries: return k documents with highest score example cosine measure Ranking in search engines: term-based vs. link-based ranking many other important factors (links, user feedback, $, markup)

how to combine/add pagerank score and cosine? (addition) use PR or log(PR) ? normalize using mean of top-100 in list (Richardson/Domingo) Using Pagerank in ranking:

recent work by Fagin and others FA (Fagin’s algorithm), TA (Threshold algorithm), others term-based ranking: presort each list by contribution to cosine Efficient algorithms for top-k queries: Pagerank: (pre)sort by combination of cosine and Pagerank?

centralized setting 120 million crawled pages Excite query trace CA = “clairvoyant algorithm” Some results:

most savings for long lists in fact, cos + log(PR) schemes get better and better More details:

some methods increase with length of other list intersection pretty bad Shortest shorter lists:

only FA with cosine increases with length of longer list others much better and closer to each other Medium shorter lists:

one round-trip need to decide right length of prefix to send can be extended to more than two keywords Distributed implementation:

top-10 queries cosine (top) and cos + log(PR) (bottom) 8 bytes per posting TCP performance model for congestion window prefix length determined by threshold algorithm (TA) Results of distributed implementation:

P2P search: JXTA, pSearch, FASD, planetP, others with global index structure: - Gnawali (Chord) - Reynolds/Vahdat: Bloom filters - Li et al: feasibility of P2P search engines, Bloom filters and other techniques (IPTPS 2003) Pruning techniques for top-k queries - DB Community: Fagin et al now - IR Community: since 1980s (Buckley/Lewit SIGIR 85) - Persin/Zobel/Sacks-Davis 1996, Anh/Kretser/Moffat differences: random lookups, # of terms, AND vs. OR Related Work:

Current Status and Future Work: system still being built (very basic version done) working on query optimization - integrating Bloom filters and other heuristics - optimizing query plans for 2 and more keywords - use of statistics loose ends in evaluation - results for three and more terms - integrating other measures (e.g., term distance) replication, synchronization more info: (google: “odissea peer”)