Searching over Many Sites in Jeremy Stribling Joint work with: Jinyang Li, M. Frans Kaashoek, Robert Morris MIT Computer Science and Artificial Intelligence.

Slides:



Advertisements
Similar presentations
TWO STEP EQUATIONS 1. SOLVE FOR X 2. DO THE ADDITION STEP FIRST
Advertisements

Advanced Piloting Cruise Plot.
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 38.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
Chapter 1 Image Slides Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.
28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.
UNITED NATIONS Shipment Details Report – January 2006.
Library 1 Electronic Resources in the EUI Library Veerle Deckmyn, Library Director Aimee Glassel, Electronic Resources Librarian September 2, 2009.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
Local Customization Chapter 2. Local Customization 2-2 Objectives Customization Considerations Types of Data Elements Location for Locally Defined Data.
My Alphabet Book abcdefghijklm nopqrstuvwxyz.
DIVIDING INTEGERS 1. IF THE SIGNS ARE THE SAME THE ANSWER IS POSITIVE 2. IF THE SIGNS ARE DIFFERENT THE ANSWER IS NEGATIVE.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 5 second questions
Who Wants To Be A Millionaire? Decimal Edition Question 1.
ZMQS ZMQS
Internet Search Engine freshness by Web Server help Presented by: Barilari Alessandro.
17 th International World Wide Web Conference 2008 Beijing, China XML Data Dissemination using Automata on top of Structured Overlay Networks Iris Miliaraki.
Solve Multi-step Equations
Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.
Break Time Remaining 10:00.
SE-292 High Performance Computing
4.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 4: Organizing a Disk for Data.
Fast Crash Recovery in RAMCloud
Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
13 Copyright © 2005, Oracle. All rights reserved. Monitoring and Improving Performance.
ABC Technology Project
Building Distributed, Wide-Area Applications with WheelFS Jeremy Stribling, Emil Sit, Frans Kaashoek, Jinyang Li, and Robert Morris MIT CSAIL and NYU.
Cache and Virtual Memory Replacement Algorithms
1 Sizing the Streaming Media Cluster Solution for a Given Workload Lucy Cherkasova and Wenting Tang HPLabs.
1 Undirected Breadth First Search F A BCG DE H 2 F A BCG DE H Queue: A get Undiscovered Fringe Finished Active 0 distance from A visit(A)
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
VOORBLAD.
15. Oktober Oktober Oktober 2012.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
1 Breadth First Search s s Undiscovered Discovered Finished Queue: s Top of queue 2 1 Shortest path from s.
©2007 First Wave Consulting, LLC A better way to do business. Period This is definitely NOT your father’s standard operating procedure.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
1..
© 2012 National Heart Foundation of Australia. Slide 2.
Chapter 5 Test Review Sections 5-1 through 5-4.
Addition 1’s to 20.
25 seconds left…...
Week 1.
We will resume in: 25 Minutes.
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
Clock will move after 1 minute
Intracellular Compartments and Transport
A SMALL TRUTH TO MAKE LIFE 100%
PSSA Preparation.
Essential Cell Biology
Chen Li ( 李晨 ) Chen Li Scalable Interactive Search NFIC August 14, 2010, San Jose, CA Joint work with colleagues at UC Irvine and Tsinghua University.
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
CFR 250/590 Introduction to GIS, Autumn 1999 Data Search & Import © Phil Hurvitz, find_data 1  Overview Web search engines NSDI GeoSpatial Data.
OverCite: A Distributed, Cooperative CiteSeer Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris MIT Computer Science and.
OverCite: A Cooperative Digital Research Library Jeremy Stribling, Isaac G. Councill, Jinyang Li, M. Frans Kaashoek, David Karger, Robert Morris, Scott.
WheelFS Jeremy Stribling, Frans Kaashoek, Jinyang Li, Robert Morris MIT CSAIL and New York University.
Presentation transcript:

Searching over Many Sites in Jeremy Stribling Joint work with: Jinyang Li, M. Frans Kaashoek, Robert Morris MIT Computer Science and Artificial Intelligence Laboratory

2 Talk Outline Background: CiteSeer OverCites Design (The Search for Distributed Performance) Evaluation (The Performance of Distributed Search) Future (and related) work

3 People Love CiteSeer Online repository of academic papers Crawls, indexes, links, and ranks papers Important resource for CS community typical unification of access points and rereliable web services

4 People Love CiteSeer Too Much Burden of running the system forced on one site Scalability to large document sets uncertain Adding new resources is difficult

5 What Can We Do? Solution #1: Let a big search engine solve it Solution #2: All your © are belong to ACM Solution #3: Donate money to PSU Solution #4: Run your own mirror Solution #5: Aggregate donated resources

6 Solution: OverCite Clie nt Wide area hosts A distributed, cooperative version of CiteSeer Implementation/performance of wide-area search OverCite: A Distributed, Cooperative CiteSeer. Jeremy Stribling, Jinyang Li, Isaac G. Councill, M. Frans Kaashoek, Robert Morris. NSDI, May 2006.

7 CiteSeer Today: Hardware Two 2.8-GHz servers at PSU Clie nt

8 CiteSeer Today: Search Search keywords Results meta-data Context

9 CiteSeer: Local Resources 34.4 GB/dayTotal traffic 21 GB/dayDocument traffic 250,000/day Searches 22 GBIndex size 675,000# documents Current CiteSeer capacity: 4.8 queries/s Users issue 8.3 queries/doc 404 KB/s –Search is the bottleneck 5%Index coverage

10 Talk Outline Background: CiteSeer OverCites Design (The Search for Distributed Performance) Evaluation (The Performance of Distributed Search) Future (and related) work

11 Search Goals for OverCite Distributed search goals: –Parallel speedup –Lower burden per site Challenge: Distribute work over wide-area nodes

12 Search Approach Approach: –Divide docs into partitions, hosts into groups –Less search work per host Same as in cluster solutions, but wide-area Doesnt sacrifice search quality for performance Not explicitly designed for the scale of the Web

13 The Life of a Query Clie nt Query Results Page Keywords Hits w/ meta-data, rank and context Meta-data req/resp Group 1 Group 2 Group 3 Group 4 Web-based front end Index DHT storage (Documents and meta-data)

14 Local Queries Inverted index: words posting lists DB: words position in index Text file: full ASCII text for all documents peer {, } mesh { } hash {, } Hash 3 Mesh 2 Peer 1 Kyubgtigfiugwpoifbgwcgfiygi fouryfr ypofwy foypofwyf Kyubgtigfiugwpoifbgwcgfiygi fouryfr ypofwy foypofwy Kyubgtigfiugwpoifbgwcgfiygi fouryfr ypofwy foypofwyf Kyubgtigfiugwpoifbgwcgfiygi fouryfr ypofwy foypofwyf f Kyubgtigfiugwpoifbgwcgfiygi fouryfr ypofwy foypofwyf Kyubgtigfiugwpoifbgwcgfiygi fouryfr ypofwy foypofwy Kyubgtigfiugwpoifbgwcgfiygi fouryfr ypofwy foypofwyf the peer is Query: peer hash Result: Doc #1 w/ context

15 peer {1,2,3} DHT {4} mesh {2,4,6} hash {1,2,5} table {1,2,4} Parallelizing Queries Partition by document Divide the index into k partitions Each query sent to only k nodes Serv er peer {1} hash {1,5} table {1} peer {2} mesh {2,6} hash {2} table {2} peer {3} DHT {4} mesh {4} table {4} Part. 1 Part. 2 peer {1,2,3} mesh {2} hash {1,2} table {1,2} peer {1,2,3} mesh {2} hash {1,2} table {1,2} DHT {4} hash {5} mesh {4,6} table {4} DHT {4} hash {5} mesh {4,6} table {4} Group 1Group 2 Documents 1,5 Documents 2,6 Document 3 Document 4 Documents 1,2,3Documents 4,5,6

16 Considerations for k If k is small + Fewer hosts less network latency –Less opportunity for parallelism If k is big + More parallelism + Smaller index partitions faster searches –More hosts some node likely to be slow

17 Talk Outline Background: CiteSeer OverCites Design (The Search for Distributed Performance) Evaluation (The Performance of Distributed Search) Future (and related) work

18 Deployment 27 nodes across North America –9 RON/IRIS nodes + private machines –47 physical disks Map source:

19 Evaluation Questions What are the bottlenecks for local queries? Is wide-area search distribution worthwhile? Do more machines mean more throughput?

20 Local Configuration Index first 64K chars/document (78% coverage) 20 results per query One keyword context per query Total of 523,000 unique CiteSeer documents Average over 1000s of CiteSeer queries

21 Local: Index Size vs. Latency Context bottleneck: Disk seeks Search bottleneck: Disk tput and cache hit ratio

22 Local: Memory Performance Smaller index better memory performance

23 Distributed Configuration 1 client at MIT 128 queries in parallel Average over 1000 CiteSeer queries Vary k (number of machines used) Each machine has local index over 1/k docs

24 Distributed: Index Size vs. Tput Throughput improves, despite network latencies (10.2 GB)(5.1 GB)(2.5 GB)(1.2 GB)

25 Talk Outline Background: CiteSeer OverCites Design (The Search for Distributed Performance) Evaluation (The Performance of Distributed Search) Future (and related) work

26 Future Work Will throughput level off or drop as k increases? How would many more nodes affect approach? Push to have a more real system

27 Related Work Search on unstructured P2P –[Gia SIGCOMM 03, FASD 02, Yang et al. 02] Search on DHTs –[Loo et al. IPTPS 04, eSearch NSDI 04, Rooter WMSCI 05] Distributed Web search [Google IEEE Micro 03, Li et al. IPTPS 03, Distributed PageRank VLDB 04 & 06] Other paper repositories [arXiv.org (Physics), ACM and Google Scholar (CS), Inspec (general science)]

28 Summary Distributed search on a wide-area scale Large indexes (> memory) should be distributed Implementation and performance of a prototype