Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) Ryan Huebsch † Joe Hellerstein †, Nick Lanham †, Boon Thau Loo.

Slides:



Advertisements
Similar presentations
Implementing Declarative Overlays From two talks by: Boon Thau Loo 1 Tyson Condie 1, Joseph M. Hellerstein 1,2, Petros Maniatis 2, Timothy Roscoe 2, Ion.
Advertisements

P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Peer to Peer and Distributed Hash Tables
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.
Querying the Internet with PIER Article by: Ryan Huebsch, Joseph M. Hellerstein, Nick Lanham, Boon Thau Loo, Scott Shenker, Ion Stoica, 2003 EECS Computer.
The Architecture of PIER: an Internet-Scale Query Processor (PIER = Peer-to-peer Information Exchange and Retrieval) Ryan Huebsch Brent Chun, Joseph M.
Outline for today Structured overlay as infrastructures Survey of design solutions Analysis of designs.
PeerDB: A P2P-based System for Distributed Data Sharing Wee Siong Ng, Beng Chin Ooi, Kian-Lee Tan, Aoying Zhou Shawn Jeffery CS294-4 Peer-to-Peer Systems.
Presented by Elisavet Kozyri. A distributed application architecture that partitions tasks or work loads between peers Main actions: Find the owner of.
P2p, Fall 05 1 Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) VLDB 2003 Ryan Huebsch, Joe Hellerstein, Nick Lanham,
Internet Indirection Infrastructure Ion Stoica UC Berkeley.
Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval)
Freddies: DHT-Based Adaptive Query Processing via Federated Eddies Ryan Huebsch Shawn Jeffery CS Peer-to-Peer Systems 12/9/03.
Object Naming & Content based Object Search 2/3/2003.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
1 CS 194: Distributed Systems Distributed Hash Tables Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.
What Can Databases Do for Peer-to-Peer Steven Gribble, Alon Halevy, Zachary Ives, Maya Rodrig, Dan Suciu Presented by: Ryan Huebsch CS294-4 P2P Systems.
P2p, Fall 06 1 Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) VLDB 2003 Ryan Huebsch, Joe Hellerstein, Nick Lanham,
ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) Ryan Huebsch Joe Hellerstein, Nick Lanham, Boon Thau Loo, Timothy.
Complex Queries in DHT-based Peer-to-Peer Networks Matthew Harren, Joe Hellerstein, Ryan Huebsch, Boon Thau Loo, Scott Shenker, Ion Stoica
Introduction to Peer-to-Peer Networks. What is a P2P network A P2P network is a large distributed system. It uses the vast resource of PCs distributed.
Content Overlays (Nick Feamster). 2 Content Overlays Distributed content storage and retrieval Two primary approaches: –Structured overlay –Unstructured.
Peer to Peer Research survey TingYang Chang. Intro. Of P2P Computers of the system was known as peers which sharing data files with each other. Build.
PIER & PHI Overview of Challenges & Opportunities Ryan Huebsch † Joe Hellerstein † °, Boon Thau Loo †, Sam Mardanbeigi †, Scott Shenker †‡, Ion Stoica.
Using the Small-World Model to Improve Freenet Performance Hui Zhang Ashish Goel Ramesh Govindan USC.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Titanium/Java Performance Analysis Ryan Huebsch Group: Boon Thau Loo, Matt Harren Joe Hellerstein, Ion Stoica, Scott Shenker P I E R Peer-to-Peer.
1 Distributed Hash Tables (DHTs) Lars Jørgen Lillehovde Jo Grimstad Bang Distributed Hash Tables (DHTs)
Network Computing Laboratory Scalable File Sharing System Using Distributed Hash Table Idea Proposal April 14, 2005 Presentation by Jaesun Han.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
A Peer-to-Peer Approach to Resource Discovery in Grid Environments (in HPDC’02, by U of Chicago) Gisik Kwon Nov. 18, 2002.
An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan Presented.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
National Institute of Advanced Industrial Science and Technology Query Processing for Distributed RDF Databases Using a Three-dimensional Hash Index Akiyoshi.
Scalable Content- Addressable Networks Prepared by Kuhan Paramsothy March 5, 2007.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
A Utility-based Approach to Scheduling Multimedia Streams in P2P Systems Fang Chen Computer Science Dept. University of California, Riverside
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
PIER: Peer-to-Peer Information Exchange and Retrieval Ryan Huebsch Joe Hellerstein, Nick Lanham, Boon Thau Loo, Scott Shenker, Ion Stoica
Plethora: Infrastructure and System Design. Introduction Peer-to-Peer (P2P) networks: –Self-organizing distributed systems –Nodes receive and provide.
Querying The Internet With PIER Nitin Khandelwal.
Peer-to-Peer Result Dissemination in High-Volume Data Filtering Shariq Rizvi and Paul Burstein CS 294-4: Peer-to-Peer Systems.
PIER ( Peer-to-Peer Information Exchange and Retrieval ) 30 March 07 Neha Singh.
CS Spring 2014 CS 414 – Multimedia Systems Design Lecture 37 – Introduction to P2P (Part 1) Klara Nahrstedt.
Two Peer-to-Peer Networking Approaches Ken Calvert Net Seminar, 23 October 2001 Note: Many slides “borrowed” from S. Ratnasamy’s Qualifying Exam talk.
INTERNET TECHNOLOGIES Week 10 Peer to Peer Paradigm 1.
CS 347Notes081 CS 347: Parallel and Distributed Data Management Notes 08: P2P Systems.
P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.
P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.
NCLAB 1 Supporting complex queries in a distributed manner without using DHT NodeWiz: Peer-to-Peer Resource Discovery for Grids Sujoy Basu, Sujata Banerjee,
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Nick McKeown CS244 Lecture 17 Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications [Stoica et al 2001]
TRUST Self-Organizing Systems Emin G ü n Sirer, Cornell University.
Querying the Internet with PIER CS294-4 Paul Burstein 11/10/2003.
CS694 - DHT1 Distributed Hash Table Systems Hui Zhang University of Southern California.
Research Directions in Databases Technological Education Institution of Larisa in collaboration with Staffordshire University Larisa Dr. Theodoros.
Ryan Huebsch, Joseph M. Hellerstein, Ion Stoica, Nick Lanham, Boon Thau Loo, Scott Shenker Querying the Internet with PIER Speaker: Natalia KozlovaTutor:
CS Spring 2010 CS 414 – Multimedia Systems Design Lecture 24 – Introduction to Peer-to-Peer (P2P) Systems Klara Nahrstedt (presented by Long Vu)
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
UC Berkeley and Intel Research Berkeley
Early Measurements of a Cluster-based Architecture for P2P Systems
Presentation transcript:

Querying the Internet with PIER (PIER = Peer-to-peer Information Exchange and Retrieval) Ryan Huebsch † Joe Hellerstein †, Nick Lanham †, Boon Thau Loo †, Scott Shenker ‡, Ion Stoica † † UC Berkeley, CS Division ‡ International Computer Science Institute, Berkeley CA VLDB 9/12/03

2 What is Very Large? Depends on Who You Are Single Site Clusters Internet Scale 1000’s – Millions Distributed 10’s – 100’s Challenge: How to run DB style queries at Internet Scale! Database CommunityNetwork Community

3 What are the Key Properties? Lots of data that is: 1.Naturally distributed (where it’s generated) 2.Centralized collection undesirable 3.Homogeneous in schema 4.Data is more useful when viewed as a whole This is the design space we have chosen to investigate.

4 Who Needs Internet Scale? Example 1: Filenames Simple ubiquitous schemas: Filenames, Sizes, ID3 tags Born from early P2P systems such as Napster, Gnutella, AudioGalaxy, etc. Content is shared by “normal” non-expert users… home users Systems were built by a few individuals ‘in their garages’  Low barrier to entry

5 Example 2: Network Traces Schemas are mostly standardized: IP, SMTP, HTTP, SNMP log formats Network administrators are looking for patterns within their site AND with other sites: DoS attacks cross administrative boundaries Tracking virus/worm infections Timeliness is very helpful Might surprise you how useful it is: Network bandwidth on PlanetLab (world-wide distributed research test bed) is mostly filled with people monitoring the network status

6 Our Challenge Our focus is on the challenge of scale: Applications are homogeneous and distributed Already have significant interest Provide a flexible framework for a wide variety of applications Other interesting issues we will not discuss today: Heterogeneous schemas Crawling/Warehousing/etc These issues are complementary to the issues we will cover in this talk

7 Four Design Principles (I) Relaxed Consistency ACID transactions severely limits the scalability and availability of distributed databases We provide best-effort results Organic Scaling Applications may start small, without a priori knowledge of size

8 Four Design Principles (II) Natural habitat No CREATE TABLE/INSERT No “publish to web server” Wrappers or gateways allow the information to be accessed where it is created Standard Schemas via Grassroots software Data is produced by widespread software providing a de-facto schema to utilize

Physical NetworkOverlay Network Query Plan Declarative Queries

10 Recent Overlay Networks: Distributed Hash Tables (DHTs) What is a DHT? Take an abstract ID space, and partition among a changing set of computers (nodes) Given a message with an ID, route the message to the computer currently responsible for that ID Can store messages at the nodes This is like a “distributed hash table” Provides a put()/get() API Cheap maintenance when nodes come and go

11 Recent Overlay Networks: Distributed Hash Tables (DHTs) (16,16) (16,0) (0,16)(0,0) Data Key = (15,14)

12 Recent Overlay Networks: Distributed Hash Tables (DHTs) Lots of effort is put into making DHTs better: Scalable (thousands  millions of nodes) Resilient to failure Secure (anonymity, encryption, etc.) Efficient (fast access with minimal state) Load balanced etc.

13 PIER’s Three Uses for DHTs Single elegant mechanism with many uses: Search: Index Like a hash index Partitioning: Value (key)-based routing Like Gamma/Volcano Routing: Network routing for QP messages Query dissemination Bloom filters Hierarchical QP operators (aggregation, join, etc) Not clear there’s another substrate that supports all these uses

14 Metrics We are primarily interested in 3 metrics: Answer quality (recall and precision) Bandwidth utilization Latency Different DHTs provide different properties: Resilience to failures (recovery time)  answer quality Path length  bandwidth & latency Path convergence  bandwidth & latency Different QP Join Strategies: Symmetric Hash Join, Fetch Matches, Symmetric Semi-Join, Bloom Filters, etc. Big Picture: Tradeoff bandwidth (extra rehashing) and latency

15 Experimental Setup A simple join query SELECTR.pkey, S.pkey, R.pad FROMR, S WHERER.num1 = S.pkey AND R.num2 > constant1 AND S.num2 > constant2 AND f(R.num3, S.num3) > constant3 Table R has 10 times more tuples than S num1, num2, num3 are uniformly distributed R.num1 is chosen such that 90% of R tuples have one matching tuple in S, while 10% have no matching tuples. R.pad is used to insure all result tuples are 1KB Symmetric Hash Join

16 Does This Work for Real? Scale-up Performance (1MB source data/node) Simulation Real Network

17 Current Status of PIER PIER currently runs box-and-arrow dataflow graphs We are working on query optimization GUI (called Lighthouse) for query entry and result display Designed primarily for application developers User inputs an optimized physical query plan

18 Future Research Routing, Storage and Layering Catalogs and Query Optimization Hierarchical Aggregations Range Predicates Continuous Queries over Streams Sharing between Queries Semi-structured Data

Bonus Material

21 Symmetric Hash Join (SHJ)

22 Fetch Matches (FM)

23 Symmetric Semi Join (SSJ) Both R and S are projected to save bandwidth The complete R and S tuples are fetched in parallel to improve latency