OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and.

Slides:



Advertisements
Similar presentations
Searching over Many Sites in Jeremy Stribling Joint work with: Jinyang Li, M. Frans Kaashoek, Robert Morris MIT Computer Science and Artificial Intelligence.
Advertisements

Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Peer-to-Peer (P2P) Distributed Storage 1Dennis Kafura – CS5204 – Operating Systems.
Predictive Parallelization: Taming Tail Latencies in
Kademlia: A Peer-to-peer Information System Based on the XOR Metric.
CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.
Web Search - Summer Term 2006 III. Web Search - Introduction (Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Defragmenting DHT-based Distributed File Systems Jeffrey Pang, Srinivasan Seshan Carnegie Mellon University Phillip B. Gibbons, Michael Kaminsky Intel.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Architecture of the 1st Google Search Engine SEARCHER URL SERVER CRAWLERS STORE SERVER REPOSITORY INDEXER D UMP L EXICON SORTERS ANCHORS URL RESOLVER (CF.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
CS 349: WebBase 1 What the WebBase can and can’t do?
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
Distributed Web Crawling over DHTs Boon Thau Loo, Owen Cooper, Sailesh Krishnamurthy CS294-4.
1 The anatomy of a Large Scale Search Engine Sergey Brin,Lawrence Page Dept. CS of Stanford University.
OverCite: A Cooperative Digital Research Library Jeremy Stribling, Isaac G. Councill, Jinyang Li, M. Frans Kaashoek, David Karger, Robert Morris, Scott.
ObliviStore High Performance Oblivious Cloud Storage Emil StefanovElaine Shi
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Interposed Request Routing for Scalable Network Storage Darrell Anderson, Jeff Chase, and Amin Vahdat Department of Computer Science Duke University.
Wide-Area Cooperative Storage with CFS Robert Morris Frank Dabek, M. Frans Kaashoek, David Karger, Ion Stoica MIT and Berkeley.
Oasis: Anycast for Any Service Michael J. Freedman Karthik Lakshminarayanan David Mazières in NSDI 2006 Presented by: Sailesh Kumar.
Getting Started with Windows Azure Name Title Microsoft Corporation.
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
1 NETE4631 Using Google Web Services and Using Microsoft Cloud Services Lecture Notes #7.
Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.
Data Structures & Algorithms and The Internet: A different way of thinking.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Full-Text Search in P2P Networks Christof Leng Databases and Distributed Systems Group TU Darmstadt.
1 CS 425 Distributed Systems Fall 2011 Slides by Indranil Gupta Measurement Studies All Slides © IG Acknowledgments: Jay Patel.
Efficient Peer to Peer Keyword Searching Nathan Gray.
WheelFS Jeremy Stribling, Frans Kaashoek, Jinyang Li, Robert Morris MIT CSAIL and New York University.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Problems in Semantic Search Krishnamurthy Viswanathan and Varish Mulwad {krishna3, varish1} AT umbc DOT edu 1.
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Virtual Memory 1 1.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Distributed systems [Fall 2015] G Lec 1: Course Introduction.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
Data Indexing in Peer- to-Peer DHT Networks Garces-Erice, P.A.Felber, E.W.Biersack, G.Urvoy-Keller, K.W.Ross ICDCS 2004.
CS 347Notes081 CS 347: Parallel and Distributed Data Management Notes 08: P2P Systems.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
1 NETE4631 Using Google Web Services Lecture Notes #6.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Statistics Visualizer for Crawler
Efficient Multi-User Indexing for Secure Keyword Search
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Parallel Databases.
CHAPTER 3 Architectures for Distributed Systems
The Anatomy of a Large-Scale Hypertextual Web Search Engine
SCALABLE OPEN ACCESS Hussein Suleman
Building Peer-to-Peer Systems with Chord, a Distributed Lookup Service
WheelFS Jeremy Stribling, Frans Kaashoek, Jinyang Li, Robert Morris
Distributed Hash Tables
The Search Engine Architecture
Consistent Hashing and Distributed Hash Table
Virtual Memory 1 1.
Presentation transcript:

OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and Artificial Intelligence Laboratory UC Berkeley/New York University Pennsylvania State University

People Love CiteSeer Online repository of academic papers Crawls, indexes, links, and ranks papers Important resource for CS community typical unification of access points and rereliable web services

People Love CiteSeer Too Much Burden of running the system forced on one site Scalability to large document sets uncertain Adding new resources is difficult

What Can We Do? Solution #1: All your © are belong to ACM Solution #2: Donate money to PSU Solution #3: Run your own mirror Solution #4: Aggregate donated resources

Solution: OverCite Clie nt Rest of the talk focuses on how to achieve this

CiteSeer Today: Hardware Two 2.8-GHz servers at PSU Clie nt

CiteSeer Today: Search Search keywords Results meta-data Context

CiteSeer Today: Documents Cached doc Cited by

CiteSeer: Local Resources # documents675,000 Document storage803 GB Meta-data storage45 GB Index size22 GB Total storage870 GB Searches250,000/day Document traffic21 GB/day Total traffic34.4 GB/day

Goals and Challenge Goals –Parallel speedup –Lower burden per site Challenge: Distribute work over wide-area nodes –Storage –Search –Crawling

OverCite’s Approach Storage: –Use DHT for documents and meta-data –Achieve parallelism, balanced load, durability Search: –Divide docs into partitions, hosts into groups –Less search work per host Crawling –Coordinate activity via DHT

The Life of a DownloadThe Life of a Query Clie nt Query Results Page Keywords Hits w/ meta-data, rank and context Meta-data req/resp Group 1 Group 2 Group 3 Group 4 Document Req Document blocks Web-based front end Index DHT storage (Documents and meta-data)

Store Docs and Meta-data in DHT DHT stores papers for durability DHT stores meta-data tables –e.g., document IDs  {title, author, year, etc.} DHT provides load-balance and parallelism Serv er

peer  {1,2,3} DHT  {4} mesh  {2,4,6} hash  {1,2,5} table  {1,2,4} Parallelizing Queries Partition by document Divide the index into k partitions Each query sent to only k nodes Serv er peer  {1} hash  {1,5} table  {1} peer  {2} mesh  {2,6} hash  {2} table  {2} peer  {3} DHT  {4} mesh  {4} table  {4} Part. 1 Part. 2 peer  {1,2,3} mesh  {2} hash  {1,2} table  {1,2} peer  {1,2,3} mesh  {2} hash  {1,2} table  {1,2} DHT  {4} hash  {5} mesh  {4,6} table  {4} DHT  {4} hash  {5} mesh  {4,6} table  {4} Group 1Group 2 Documents 1,5 Documents 2,6 Document 3 Document 4 Documents 1,2,3Documents 4,5,6

Considerations for k If k is small + Send queries to fewer hosts  less latency + Fewer DHT lookups – Less opportunity for parallelism If k is big + More parallelism + Smaller index partitions  faster searches –More hosts  some node likely to be slow –More DHT lookups Current deployment: k = 2

Implementation Storage: Chord/DHash DHT Index: Searchy search engine Web server: OKWS Anycast service: OASIS Event-based execution, using libasync 11,000 lines of C++ code

Deployment 27 nodes across North America –9 RON/IRIS nodes + private machines –47 physical disks, 3 DHash nodes per disk –Large range of disk and memory Map source:

Evaluation Questions Does OverCite achieve parallel speedup? What is the per-node storage burden? What is the system-wide storage overhead?

Configuration Index first 5,000 words/document 2 partitions (k = 2) 20 results per query 2 replicas/block in the DHT

Evaluation Methods 9 Web front end servers 18 index servers 27 DHT servers Clie nt 1 client at MIT 1000 queries from CS trace 128 queries in parallel Group 1 Group 2

More Servers  More Queries/sec 9x servers  7x query throughput CiteSeer serves 4.8 queries/sec (All experiments use 27 DHT servers)

Per-node Storage Burden PropertyIndividual Cost Document/ meta-data storage 18.1 GB Index size6.8 GB Total storage24.9 GB

System-wide Storage Overhead PropertySystem Cost Document/ meta-data storage 18.1 GB * 47 = GB Index size6.8 GB * 27 = GB Total storage GB 4x as expensive as raw underlying data

Future Work Production-level public deployment Distributed crawler Public API for developing new features

Related Work Search on DHTs –Partition by keyword [Li et al. IPTPS ’03, Reynolds & Vadhat Middleware ’03, Suel et al. IWWD ’03] –Hybrid schemes [Tang & Dwarkadas NSDI ’04, Loo et al. IPTPS ’04, Shi et al. IPTPS ’04, Rooter WMSCI ‘05] Distributed crawlers [Loo et al. TR ’04, Cho & Garcia-Molina WWW ’02, Singh et al. SIGIR ‘03] Other paper repositories [arXiv.org (Physics), ACM and Google Scholar (CS), Inspec (general science)]

Summary A system for storing and coordinating a digital repository using a DHT Spreads load across many volunteer nodes Simple to take advantage of new resources Run CiteSeer as a community Implementation and deployment