1 ICS 214B: Transaction Processing and Distributed Data Management Lecture 18: Data Management in Peer-to-Peer Systems Professor Chen Li Based on slides.

Slides:



Advertisements
Similar presentations
CMPE 521 Improving Search In P2P Systems by Yang and Molina Prepared by Ayhan Molla.
Advertisements

Efficient Search - Overview Improving Search In Peer-to-Peer Systems Presented By Jon Hess cs294-4 Fall 2003.
Improving Search in Peer-to-Peer Networks Beverly Yang Hector Garcia-Molina Presented by Shreeram Sahasrabudhe
Routing Indices For Peer-to-Peer Systems Arturo Crespo, Hector Garcia-Molina Stanford ICDCS 2002.
An Overview of Peer-to-Peer Networking CPSC 441 (with thanks to Sami Rollins, UCSB)
Peer-to-Peer Networks João Guerreiro Truong Cong Thanh Department of Information Technology Uppsala University.
Open Problems in Data- Sharing Peer-to-Peer Systems Neil Daswani, Hector Garcia-Molina, Beverly Yang.
P2p, Spring 05 1 Topics in Database Systems: Data Management in Peer-to-Peer Systems March 29, 2005.
Peer-to-peer archival data trading Brian Cooper Joint work with Hector Garcia-Molina (and others) Stanford University.
Peer-to-Peer Content Sharing. P2P File Sharing Benefits Why use a P2P model for a file sharing application?
FRIENDS: File Retrieval In a dEcentralized Network Distribution System Steven Huang, Kevin Li Computer Science and Engineering University of California,
1 Peer-To-Peer Data Management Hector Garcia-Molina ICDE Conference, February 28, 2002.
A Trust Based Assess Control Framework for P2P File-Sharing System Speaker : Jia-Hui Huang Adviser : Kai-Wei Ke Date : 2004 / 3 / 15.
Improving Search in P2P Networks By Shadi Lahham.
Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,
1 Client-Server versus P2P  Client-server Computing  Purpose, definition, characteristics  Relationship to the GRID  Research issues  P2P Computing.
Object Naming & Content based Object Search 2/3/2003.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
Comparing Hybrid Peer-to-Peer Systems Beverly Yang and Hector Garcia-Molina Presented by Marco Barreno November 3, 2003 CS 294-4: Peer-to-peer systems.
1 Seminar: Information Management in the Web Gnutella, Freenet and more: an overview of file sharing architectures Thomas Zahn.
Efficient Search in Peer to Peer Networks By: Beverly Yang Hector Garcia-Molina Presented By: Anshumaan Rajshiva Date: May 20,2002.
Searching in Unstructured Networks Joining Theory with P-P2P.
P2P Databases. Overview 0. Data objects, pointers (URLs), and attributes 1. Freeform versus structured attribute data 2. Centralized indices for attribute.
Peer-to-peer archival data trading Brian Cooper and Hector Garcia-Molina Stanford University.
Peer-to-peer archival data trading Brian Cooper Joint work with Hector Garcia-Molina (and others) Stanford University www-db.stanford.edu/peers/
1CS 6401 Peer-to-Peer Networks Outline Overview Gnutella Structured Overlays BitTorrent.
Storage management and caching in PAST PRESENTED BY BASKAR RETHINASABAPATHI 1.
Introduction to Peer-to-Peer Networks. What is a P2P network Uses the vast resource of the machines at the edge of the Internet to build a network that.
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
Freenet. Anonymity  Napster, Gnutella, Kazaa do not provide anonymity  Users know who they are downloading from  Others know who sent a query  Freenet.
Peer-to-Peer Computing CS587x Lecture Department of Computer Science Iowa State University.
1 Napster & Gnutella An Overview. 2 About Napster Distributed application allowing users to search and exchange MP3 files. Written by Shawn Fanning in.
Introduction Widespread unstructured P2P network
IR Techniques For P2P Networks1 Information Retrieval Techniques For Peer-To-Peer Networks Demetrios Zeinalipour-Yazti, Vana Kalogeraki and Dimitrios Gunopulos.
1 ICS-FORTH & Univ. of Crete SeLene November 2002 Zacharioudakis Giorgos P2P Systems & technologies Zacharioudakis Giorgos.
Searching In Peer-To-Peer Networks Chunlin Yang. What’s P2P - Unofficial Definition All of the computers in the network are equal Each computer functions.
Introduction to Peer-to-Peer Networks. What is a P2P network A P2P network is a large distributed system. It uses the vast resource of PCs distributed.
Peer to Peer Research survey TingYang Chang. Intro. Of P2P Computers of the system was known as peers which sharing data files with each other. Build.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Peer-to-Pee Computing HP Technical Report Chin-Yi Tsai.
Routing Indices For P-to-P Systems ICDCS Introduction Search in a P2P system –Mechanisms without an index –Mechanisms with specialized index nodes.
1 Distributed Hash Tables (DHTs) Lars Jørgen Lillehovde Jo Grimstad Bang Distributed Hash Tables (DHTs)
Super-peer Network. Motivation: Search in P2P Centralised (Napster) Flooding (Gnutella)  Essentially a breadth-first search using TTLs Distributed Hash.
An IP Address Based Caching Scheme for Peer-to-Peer Networks Ronaldo Alves Ferreira Joint work with Ananth Grama and Suresh Jagannathan Department of Computer.
1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.
PEER TO PEER (P2P) NETWORK By: Linda Rockson 11/28/06.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Efficient P2P Search by Exploiting Localities in Peer Community and Individual Peers A DISC’04 paper Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
LightFlood: An Efficient Flooding Scheme for File Search in Unstructured P2P Systems Song Jiang, Lei Guo, and Xiaodong Zhang College of William and Mary.
Peer to Peer Network Design Discovery and Routing algorithms
Peer to Peer Computing. What is Peer-to-Peer? A model of communication where every node in the network acts alike. As opposed to the Client-Server model,
1 Reading Report 3 Yin Chen 20 Feb 2004 Reference: Efficient Search in Peer-to-Peer Networks, Beverly Yang, Hector Garcia-Molina, In 22 nd Int. Conf. on.
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
Two Peer-to-Peer Networking Approaches Ken Calvert Net Seminar, 23 October 2001 Note: Many slides “borrowed” from S. Ratnasamy’s Qualifying Exam talk.
P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.
P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.
09/13/04 CDA 6506 Network Architecture and Client/Server Computing Peer-to-Peer Computing and Content Distribution Networks by Zornitza Genova Prodanoff.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
Composing Web Services and P2P Infrastructure. PRESENTATION FLOW Related Works Paper Idea Our Project Infrastructure.
Peer-to-Peer Information Systems Week 12: Naming
A Survey of Peer-to-Peer Content Distribution Technologies Stephanos Androutsellis-Theotokis and Diomidis Spinellis ACM Computing Surveys, December 2004.
BitTorrent Vs Gnutella.
PROGRAM STUDI TEKNIK INFORMATIKA FAKULTAS ILMU KOMPUTER
Early Measurements of a Cluster-based Architecture for P2P Systems
EE 122: Peer-to-Peer (P2P) Networks
Peer-to-Peer Information Systems Week 6: Performance
Peer-To-Peer Data Management
Peer-to-Peer Information Systems Week 12: Naming
Presentation transcript:

1 ICS 214B: Transaction Processing and Distributed Data Management Lecture 18: Data Management in Peer-to-Peer Systems Professor Chen Li Based on slides developed by Beverley Yang and Hector Garcia-Molina

ICS214BNotes 182 What is P2P? napster gnutella morpheus kazaa bearshare ebay limewire icq fiorana mojo nation jxta united devices open cola uddi process tree can chord ocean store farsite pastry tapestry ? grove netmeeting freenet popular power aim jabber

ICS214BNotes 183 Napster central server join query response get file...

ICS214BNotes 184 Gnutella query

ICS214BNotes 185 PeerCast UCI source UCI source after: before:

ICS214BNotes 186 What is Peer-to-Peer? Definition: Nodes of equal roles exchanging information and services directly Is this a new idea? –IP routing (1970’s) –Mariposa (1980’s) –Distributed Databases! What are people really thinking?

ICS214BNotes 187 Implicit Definition of P2P Scale: millions (billions?) of peers Nature of peers: PC’s Application: lightweight semantics (e.g., file-sharing)

ICS214BNotes 188 P2P vs. Distributed DBMS Traditional DDBMS Issues: Transactions Network Partitions Distributed Query Optimization Interoperation of heterogeneous data sources Reliability/failure of nodes Complex features do not scale

ICS214BNotes 189 P2P vs. Distributed DBMS Example application: file-sharing Simple data model and query language –No complex query optimization –Easy interoperation No guarantee on quality of results –Individual site availability unimportant Local updates –No transactions –Network partitions OK Simple Amenable to large-scale network of PCs

ICS214BNotes 1810 Potential Benefits Efficiency: harnessing unused resources Self-organizing Effectively sharing cost of ownership Robustness and availability through replication Anonymity/legal protection

ICS214BNotes 1811 Challenges No authority to enforce behavior Cooperation Unreliability of individual peers Efficiency of distributed operations (absolute resources)

ICS214BNotes 1812 Research Areas Resource Management Security Efficient Search

ICS214BNotes 1813 Resource Management Resource: –Storage/information –CPU processing –bandwidth Issues: –fairness –load balancing

ICS214BNotes 1814 Example: Data Trading site 1 site 2 site 3 A1A1 B1B1 C1C1 A2A2 B2B2 C2C2 B1B1 A1A1 trade B2B2 A2A2

ICS214BNotes 1815 Example: Data Trading site 1 site 2 site 3 A1A1 B1B1 C1C1 A2A2 B2B2 C2C2 B1B1 A1A1 trade C1C1 A2A2 C2C2 B2B2

ICS214BNotes 1816 Data Trading Order of trades impacts availability Issues: –Swaps vs. Deeds –Fixed price vs. bids –Preference to  sites with a lot of space?  reliable sites?  “desperate” sites?

ICS214BNotes 1817 Security Issues: –Reputation –Trust –Accountability –Information Preservation –Information Quality –Denial of service attacks Problem: Detecting and punishing bad behavior

ICS214BNotes 1818 Information Preservation Example Policy: make 3 copies of documents A1A1 make copies What can go wrong?

ICS214BNotes 1819 What Can Go Wrong? “Bad” sites deletes copies “Bad” site alters copy “Bad” site publishes fake “Bad” site makes many copies at other sites... A1A1 make copies A’1A’1 A1A1

ICS214BNotes 1820 Reputation Systems Peers evaluate each other Good reviews -> Good reputation Bad reviews -> Bad reputation No reviews -> ? Problems –Trustworthiness of reviews –Permanence of identity

ICS214BNotes 1821 Efficiency of Search Problem: finding needle in haystack Efficiency measured in terms of absolute resources consumed

ICS214BNotes 1822 Architecture Hybrid –Centralized index, P2P file storage and transfer Super-peer –A “pure” network of “hybrid” clusters Pure –functionality completely distributed

ICS214BNotes 1823 Goal Develop search techniques for “loose” systems that are Efficient Simple (easy to implement, no hidden costs) Realistically and thoroughly evaluated

ICS214BNotes 1824 Current Techniques: Gnutella = forward query = processed query = source = found result = forward response Breadth-First Search (BFS)

ICS214BNotes 1825 Metrics Cost (aggregate) –Bandwidth –Processing Power Quality of Results –Number of results –Satisfaction (true if # results >= X, false otherwise) –Time to satisfaction

ICS214BNotes 1826 Iterative Deepening Interested in satisfaction, not # of results BFS returns “too many” results  expensive Iterative Deepening: common technique to reduce the cost of BFS –Intuition: A search at a small depth is much cheaper than at a larger depth

ICS214BNotes 1827 Iterative Deepening = source = forward query = processed query = found result = forward response ?

ICS214BNotes 1828 Directed BFS Sends query to a subset of neighbors Maintains statistics on neighbors –E.g., ping latency, history of number of results Chooses subset intelligently (via heuristics), to maximize quality of results –E.g., Neighbors with shortest message queue, since long message queue implies neighbor is saturated/dead

ICS214BNotes 1829 Directed BFS = source = forward query = processed query = found result = forward response ?

ICS214BNotes 1830 Directed BFS: Heuristics RAND(Random) RESReturned greatest # results in past TIMEHad shorted avg. time to satisfaction in past HOPSHad smallest avg. # hops for response messages in past MSGSent our client greatest # of messages QLENShortest message queue DEGHighest degree

ICS214BNotes 1831 Local Indices Each node maintains index over other nodes’ collections –r is the radius of the index –Index covers all nodes within r hops away Can process query at fewer nodes, but get just as many results back r

ICS214BNotes 1832 sdf nrd sdf nrd sdf nrd Local Indices (r=1) = source = forward query = processed query = found result = forward response sdf nrd

ICS214BNotes 1833 Evaluation Goal: realistic evaluation of techniques Cannot directly evaluate techniques in a real environment Simulation of large-scale distributed systems is hard Use Gnutella as a “laboratory” for gathering data Use analysis driven by query “traces” to project cost

ICS214BNotes 1834 Passive Observation 1.Statistics Size of collection % redundant messages 2.Sample queries (Q rep ) Gnutella Network Pong Query

ICS214BNotes 1835 Gathering Data # hops traveled IP address Timestamp Individual result records # hops traveled IP address Query (Q rep ) Ping

ICS214BNotes 1836 Gathering Data For each query Q: L(Q)Length of query string M(Q,n)# response messages from n hops away R(Q,n)# results from n hops away S(Q,n,Z)True if >= Z results received from n hops away T(Q,Z,W,P)Time to satisfaction N(Q,n)# nodes n hops away C(Q,n)# redundant edges n hops away

ICS214BNotes 1837 Example: Trace-driven Cost Projection = source = forward query = processed query = found result = forward response ?

ICS214BNotes 1838 Example: Calculating Message Size Use the Gnutella protocol, trace data e.g., Query message consists of: –Gnutella header (22 B) –Options field (2 B) –Query string (L(Q)) –TCP/IP and Ethernet headers (58 B) Total size of Query message for query Q: 82 + L(Q) bytes

ICS214BNotes 1839 Calculating Cost We know the sizes of each type of message We know # messages sent, for each type of message, for query Q Put together: aggregate bandwidth for Q Similar process to compute aggregate processing power

ICS214BNotes 1840 Overall Comparison Time to Satisfy Prob. of Satisfying # results Bandwidth Cost BFS Iterative Deepening (d=5,W=6) Directed BFS (>RES) Local Indices (r=1) B I D B B B B I I I I D D D D L L L L L

ICS214BNotes 1841 Summary: Efficient Search What we’ve done: –Proposed techniques to improve performance  Kept simple –Evaluated techniques using extensive real data –Improved performance, with tradeoffs Open issues: –More efficient! –Make intelligent use of topology, replication –Take advantage of heterogeneity (e.g., super- peers)