Data Management in Large-scale P2P Systems

Slides:



Advertisements
Similar presentations
Outline Introduction Background Distributed Database Design
Advertisements

P2P data retrieval DHT (Distributed Hash Tables) Partially based on Hellerstein’s presentation at VLDB2004.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Peer-to-Peer (P2P) Distributed Storage 1Dennis Kafura – CS5204 – Operating Systems.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
1/27 Replication and Query Processing in the APPA Data Management System Reza AKBARINIA Vidal MARTINS Esther PACITTI Patrick VALDURIEZ.
An Overview of Peer-to-Peer Networking CPSC 441 (with thanks to Sami Rollins, UCSB)
PeerDB: A P2P-based System for Distributed Data Sharing Wee Siong Ng, Beng Chin Ooi, Kian-Lee Tan, Aoying Zhou Shawn Jeffery CS294-4 Peer-to-Peer Systems.
Peer-to-Peer Networks as a Distribution and Publishing Model Jorn De Boever (june 14, 2007)
1 Minggu 12, Pertemuan 23 Introduction to Distributed DBMS (Chapter , 22.6, 3rd ed.) Matakuliah: T0206-Sistem Basisdata Tahun: 2005 Versi: 1.0/0.0.
ABCSG - Distributed Database 1 Data Management Distributed Database Data Replication.
Topics in Reliable Distributed Systems Lecture 2, Fall Dr. Idit Keidar.
Based on last years lecture notes, used by Juha Takkinen.
CSc 461/561 CSc 461/561 Peer-to-Peer Streaming. CSc 461/561 Summary (1) Service Models (2) P2P challenges (3) Service Discovery (4) P2P Streaming (5)
P2P: Advanced Topics Filesystems over DHTs and P2P research Vyas Sekar.
Overview Distributed vs. decentralized Why distributed databases
1 Client-Server versus P2P  Client-server Computing  Purpose, definition, characteristics  Relationship to the GRID  Research issues  P2P Computing.
Object Naming & Content based Object Search 2/3/2003.
Chord-over-Chord Overlay Sudhindra Rao Ph.D Qualifier Exam Department of ECECS.
Topics in Reliable Distributed Systems Fall Dr. Idit Keidar.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Peer-to-Peer Databases David Andersen Advanced Databases.
Distributed Databases Dr. Lee By Alex Genadinik. Distributed Databases? What is that!?? Distributed Database - a collection of multiple logically interrelated.
Introduction to Peer-to-Peer Networks. What is a P2P network Uses the vast resource of the machines at the edge of the Internet to build a network that.
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Distributed Database The University of California Berkeley Extension Copyright © 2011 Patrick McDermott.
PMIT-6102 Advanced Database Systems
1 Distributed and Parallel Databases. 2 Distributed Databases Distributed Systems goal: –to offer local DB autonomy at geographically distributed locations.
Introduction to Peer-to-Peer Networks. What is a P2P network A P2P network is a large distributed system. It uses the vast resource of PCs distributed.
Presenter: Dipesh Gautam.  Introduction  Why Data Grid?  High Level View  Design Considerations  Data Grid Services  Topology  Grids and Cloud.
Introduction of P2P systems
Peer to Peer Research survey TingYang Chang. Intro. Of P2P Computers of the system was known as peers which sharing data files with each other. Build.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Peer-to-Pee Computing HP Technical Report Chin-Yi Tsai.
Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.
Objectives Functionalities and services Architecture and software technologies Potential Applications –Link to research problems.
Super-peer Network. Motivation: Search in P2P Centralised (Napster) Flooding (Gnutella)  Essentially a breadth-first search using TTLs Distributed Hash.
Serverless Network File Systems Overview by Joseph Thompson.
Preventive Replication in Database Cluster Esther Pacitti, Cedric Coulon, Patrick Valduriez, M. Tamer Özsu* LINA / INRIA – Atlas Group University of Nantes.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
Distributed Databases
1 Distributed Databases BUAD/American University Distributed Databases.
1 Peer-to-Peer Technologies Seminar by: Kunal Goswami (05IT6006) School of Information Technology Guided by: Prof. C.R.Mandal, School of Information Technology.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
Distributed database system
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Computer Networking P2P. Why P2P? Scaling: system scales with number of clients, by definition Eliminate centralization: Eliminate single point.
Scalable Hybrid Keyword Search on Distributed Database Jungkee Kim Florida State University Community Grids Laboratory, Indiana University Workshop on.
Topic Distributed DBMS Database Management Systems Fall 2012 Presented by: Osama Ben Omran.
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
Distributed Database Management Systems. Reading Textbook: Ch. 1, Ch. 3 Textbook: Ch. 1, Ch. 3 For next class: Ch. 4 For next class: Ch. 4 FarkasCSCE.
Peer-to-Peer Systems: An Overview Hongyu Li. Outline  Introduction  Characteristics of P2P  Algorithms  P2P Applications  Conclusion.
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
第 1 讲 分布式系统概述 §1.1 分布式系统的定义 §1.2 分布式系统分类 §1.3 分布式系统体系结构.
P2P Search COP6731 Advanced Database Systems. P2P Computing  Powerful personal computer Share computing resources P2P Computing  Advantages: Shared.
P2P Search COP P2P Search Techniques Centralized P2P systems  e.g. Napster, Decentralized & unstructured P2P systems  e.g. Gnutella.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Peer-to-Peer (P2P) File Systems. P2P File Systems CS 5204 – Fall, Peer-to-Peer Systems Definition: “Peer-to-peer systems can be characterized as.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
Distributed Systems Architecure. Architectures Architectural Styles Software Architectures Architectures versus Middleware Self-management in distributed.
Composing Web Services and P2P Infrastructure. PRESENTATION FLOW Related Works Paper Idea Our Project Infrastructure.
Peer-to-Peer Data Management
CHAPTER 3 Architectures for Distributed Systems
Chapter 19: Distributed Databases
PROGRAM STUDI TEKNIK INFORMATIKA FAKULTAS ILMU KOMPUTER
DHT Routing Geometries and Chord
Distributed Databases
Presentation transcript:

Data Management in Large-scale P2P Systems Patrick Valduriez, Esther Pacitti Atlas group, INRIA and LINA University of Nantes, France

Motivations P2P systems Distributed database systems Decentralized control, large scale Low-level, simple services File sharing, computation sharing, com. sharing Distributed database systems High-level data management services queries, transactions, consistency, security, etc. Centralized control, limited scale P2P + distributed database Why? How?

Why high-level P2P data sharing? Professional community example Medical doctors in a hospital may want to share (some of) their patient data for an epidemiological study They have their own, independent patient descriptions They want to ask queries such as “age and weight of male patients diagnosed with disease X …” over their own descriptions They don’t want to create a database and buy a server

Problem definition P2P system No centralized control, very large scale Very dynamic: peers can join and leave the network at any time Peers can be autonomous and unreliable Techniques designed for distributed data management no longer apply Too static, need to be decentralized, dynamic and self-adaptive

Outline Data management in distributed systems P2P systems Data management in P2P systems Data management in APPA

Data management basic principle Data independence Hide implementation details Provision for high-level services Schema Queries (SQL, XQuery) Automatic optimization Transactions Consistency Access control … Application Application Logical view (schema) Storage Storage

Distributed database system (DDBS) Distribution transparency Global schema Common data descriptions Distributed data placement Centralized control through global catalog Distributed functions Schema mapping Query processing Transaction management Access control Etc. Queries, Transactions Site 1 Distributed Database System Site 2 Site 3 DBMS1 DBMS2

Scaling up DDBS Distributed database systems Data integration systems Enterprise information systems Scale up to tens of databases Data integration systems strong heterogeneity and autonomy of data sources (files, databases, XML documents, ..) Limited functionality (queries) Scale up to hundreds of data sources Parallel database systems Focus on high-performance and high-availability Strong homogeneity Scale up to hundreds of data nodes

A generic P2P system A user at a peer may access sharable data at remote peers P2P software private sharable P2P software private sharable P2P software private sharable

Potential benefits of P2P systems Scale up to very large numbers of peers Dynamic self-organization Load balancing Parallel processing High availability through massive replication

P2P vs DDBS P2P DDBS Joining the network Upon peer’s initiative Controled by DBA Queries No schema, key-word based Global schema, static optimization Query answers Partial Complete Content location Using neighbors or DHT Using directory

Requirements for P2P data management (1) Autonomy of peers Peers should be able to join/leave at any time, control their data wrt other (trusted) peers Query expressiveness Key-lookup, key-word search, SQL-like Efficiency Efficient use of bandwidth, computing power, storage

Requirements for P2P data management (2) Quality of service (QoS) User-perceived efficiency: completeness of results, response time, data consistency, … Fault-tolerance Efficiency and QoS despite failures Security Data access control in the context of very open systems

P2P network topologies Unstructured systems Structured (DHT) systems e.g. SETI@home Structured (DHT) systems e.g. CAN, CHORD Super-peer (hybrid) systems e.g. Napster

P2P unstructured network data p2p data p2p data peer 4 p2p data peer 1 peer 2 peer 3 High autonomy (peer needs to know neighbor to login) Searching by flooding the network general, inefficient High-fault tolerance with replication

P2P structured network Efficient exact-match search Distributed Hash Table (DHT) h(k1)= p1 h(k2)= p2 h(k3)= p3 h(k4)= p4 p2p d(k1) p2p d(k2) p2p d(k3) p2p d(k4) peer 1 peer 2 peer 3 peer 4 Efficient exact-match search O(log n) for put(key,value), get(key) Limited autonomy since a peer is responsible for a range of keys

Super-peer network sp2sp sp2p sp2sp sp2p p2sp data p2sp data p2sp data p2sp data peer 1 peer 2 peer 3 peer 4 Super-peers can perform complex functions (meta-data management, indexing, acces control, etc.) Efficiency and QoS Restricted autonomy SP = single point of failure => use several

P2P systems comparison Requirements Unstructured DHT Super-peer Autonomy high low avg Query exp. Efficiency QoS Fault-tolerance Security

Data management in P2P systems Current research focuses on Decentralized schema mappings PeerDB: unstruct. network, keyword search only Extending DHT for complex querying PIER : exact-match and join queries Query reformulation Edutella: super-peer, RDF-based schemas Piazza: graph of pair-wise schema mappings Replication generally limited to static read-only files P-Grid addresses updates in structured networks

Data management in APPA (Atlas P2P Architecture) Objectives Scalability, availability and performance Main features Network-independent architecture Layered, service-based architecture Replication with semantics-based reconciliation Decentralized schema management Schema-based query support and optimization Peer data caching Prototype on JXTA Network-independent P2P services

Network independent APPA Advanced Services Query Processing Replication Cache Management Security Basic Services Group Membership Management Consensus Management P2P Data Management Peer Management Peer Communication P2P Network Key-based Storage and Retrieval Peer ID Assignment Peer Linking Internet ...

Different APPA architectures Peer Advanced services Basic P2P network DHT data local Basic services P2P network Super-peer Peer P2P data local Advanced services

Schema management in APPA Takes advantage of the collaborative nature of the applications Peers that wish to cooperate agree on a Common Schema Description (CSD) Given 2 CSD relation definitions, an example of peer mapping at peer p is: p:r(A,B,D) csd:r1(A,B,C), csd:r2(C,D,E) Peer mappings stored as P2P data

Replication in APPA Small-world assumption: peers work in smaller groups with time locality Lazy multi-master replication n peers can update the same replica Improves read performance and availability Replica divergence solved by distributed log-based reconciliation Exploit P2P data management service

Query processing in APPA Given a SQL-like query on peer schema, performs query reformulation Maps the query on CSD schemas query matching Finds relevant peers query optimization Selects best peers, taking replication into account query decomposition and execution Exploits parallelism

Conclusion Advanced P2P applications will need high-level data management services Various P2P networks will improve Network-independence crucial to exploit and combine them Many technical issues Important to characterize applications that can most benefit from P2P wrt other distributed architectures