Distributed Architectures. Introduction r Computing everywhere: m Desktop, Laptop, Palmtop m Cars, Cellphones m Shoes? Clothing? Walls? r Connectivity.

Slides:



Advertisements
Similar presentations
Tapestry: Decentralized Routing and Location SPAM Summer 2001 Ben Y. Zhao CS Division, U. C. Berkeley.
Advertisements

What is OceanStore? - 10^10 users with files each - Goals: Durability, Availability, Enc. & Auth, High performance - Worldwide infrastructure to.
Pastiche: Making Backup Cheap and Easy. Introduction Backup is cumbersome and expensive Backup is cumbersome and expensive ~$4/GB/Month (now $0.02/GB)
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
File Management Chapter 12. File Management A file is a named entity used to save results from a program or provide data to a program. Access control.
1 Accessing nearby copies of replicated objects Greg Plaxton, Rajmohan Rajaraman, Andrea Richa SPAA 1997.
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage John Kubiatowicz.
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage John Kubiatowicz.
OceanStore: An Architecture for Global-Scale Persistent Storage John Kubiatowicz University of California at Berkeley.
OceanStore An Architecture for Global-scale Persistent Storage By John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
OceanStore: Data Security in an Insecure world John Kubiatowicz.
OceanStore Theoretical Issues and Open Problems John Kubiatowicz University of California at Berkeley.
Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.
Weaving a Tapestry Distributed Algorithms for Secure Node Integration, Routing and Fault Handling Ben Y. Zhao (John Kubiatowicz, Anthony Joseph) Fault-tolerant.
OceanStore: An Architecture for Global-Scale Persistent Storage Professor John Kubiatowicz, University of California at Berkeley
Opportunities for Continuous Tuning in a Global Scale File System John Kubiatowicz University of California at Berkeley.
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
OceanStore/Tapestry Toward Global-Scale, Self-Repairing, Secure and Persistent Storage Anthony D. Joseph John Kubiatowicz Sahara Retreat, January 2003.
Or, Providing High Availability and Adaptability in a Decentralized System Tapestry: Fault-resilient Wide-area Location and Routing Issues Facing Wide-area.
Wide-area cooperative storage with CFS
Or, Providing Scalable, Decentralized Location and Routing Network Services Tapestry: Fault-tolerant Wide-area Application Infrastructure Motivation and.
OceanStore An Architecture for Global-Scale Persistent Storage Motivation Feature Application Specific Components - Secure Naming - Update - Access Control-
NFS. The Sun Network File System (NFS) An implementation and a specification of a software system for accessing remote files across LANs. The implementation.
OceanStore: An Architecture for Global - Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patric Eaton, Dennis Geels,
 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.
Distributed File Systems Concepts & Overview. Goals and Criteria Goal: present to a user a coherent, efficient, and manageable system for long-term data.
FARSITE: Federated, Available, and Reliable Storage for an Incompletely Trusted Environment.
Tapestry GTK Devaroy (07CS1012) Kintali Bala Kishan (07CS1024) G Rahul (07CS3009)
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
1 Plaxton Routing. 2 Introduction Plaxton routing is a scalable mechanism for accessing nearby copies of objects. Plaxton mesh is a data structure that.
MOBILE AD-HOC NETWORK(MANET) SECURITY VAMSI KRISHNA KANURI NAGA SWETHA DASARI RESHMA ARAVAPALLI.
Jan 17, 2001CSCI {4,6}900: Ubiquitous Computing1 Announcements I will be out of town Monday and Tuesday to present at Multimedia Computing and Networking.
Failure Resilience in the Peer-to-Peer-System OceanStore Speaker: Corinna Richter.
CH2 System models.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
World Wide Web Hypertext model Use of hypertext in World Wide Web (WWW) WWW client-server model Use of TCP/IP protocols in WWW.
Pond: the OceanStore Prototype Sean Rhea, Patric Eaton, Dennis Gells, Hakim Weatherspoon, Ben Zhao, and John Kubiatowicz University of California, Berkeley.
October 8, 2015 University of Tulsa - Center for Information Security Microsoft Windows 2000 DNS October 8, 2015.
OceanStore: An Infrastructure for Global-Scale Persistent Storage John Kubiatowicz, David Bindel, Yan Chen, Steven Czerwinski, Patrick Eaton, Dennis Geels,
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Rushing Attacks and Defense in Wireless Ad Hoc Network Routing Protocols ► Acts as denial of service by disrupting the flow of data between a source and.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
1 More on Plaxton routing There are n nodes, and log B n digits in the id, where B = 2 b The neighbor table of each node consists of - primary neighbors.
OceanStore: An Architecture for Global- Scale Persistent Storage.
ITGS Network Architecture. ITGS Network architecture –The way computers are logically organized on a network, and the role each takes. Client/server network.
POND: THE OCEANSTORE PROTOTYPE S. Rea, P. Eaton, D. Geels, H. Weatherspoon, J. Kubiatowicz U. C. Berkeley.
Peer to Peer Network Design Discovery and Routing algorithms
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Introduction to Active Directory
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
1 Plaxton Routing. 2 History Greg Plaxton, Rajmohan Rajaraman, Andrea Richa. Accessing nearby copies of replicated objects, SPAA 1997 Used in several.
CS791Aravind Elango Maintenance-Free Global Data Storage Sean Rhea, Chris Wells, Patrick Eaten, Dennis Geels, Ben Zhao, Hakim Weatherspoon and John Kubiatowicz.
OceanStore : An Architecture for Global-Scale Persistent Storage Jaewoo Kim, Youngho Yi, Minsik Cho.
Option 2: The Oceanic Data Utility: Global-Scale Persistent Storage
OceanStore: An Architecture for Global-Scale Persistent Storage
File System Implementation
8.2. Process resilience Shreyas Karandikar.
Accessing nearby copies of replicated objects
Providing Secure Storage on the Internet
OceanStore: Data Security in an Insecure world
John D. Kubiatowicz UC Berkeley
Replica Placement Model: We consider objects (and don’t worry whether they contain just data or code, or both) Distinguish different processes: A process.
Review Stateless (NFS) vs Statefull (AFS)
Content Distribution Network
Outline for today Oceanstore: An architecture for Global-Scale Persistent Storage – University of California, Berkeley. ASPLOS 2000 Feasibility of a Serverless.
Presentation transcript:

Distributed Architectures

Introduction r Computing everywhere: m Desktop, Laptop, Palmtop m Cars, Cellphones m Shoes? Clothing? Walls? r Connectivity everywhere: m Rapid growth of bandwidth in the interior of the net m Broadband to the home and office m Wireless technologies such as CMDA, Satellite, laser r There is a need for persistent storage. r Is it possible to provide an Internet-based distributed storage system.

System Requirements r Key requirements: m Be able to deal with intermittent connectivity provided to some computing devices m Information must be kept secure from theft and denial of service attacks. m Information must be extremely durable. m Information must be separate from location. m Should have a uniform and highly available access to information.

Implications of Requirements r Don’t want to worry about backup r Don’t want to worry about obsolescence r Need lots of resources to make data secure and highly available, BUT don’t want to own them m Outsourcing of storage already becoming popular r Pay monthly fee and your “data is out there” m Simple payment interface  One bill from one company

Global Persistent Store r Persistent store should be characterized as follows: m Transparent: Permits behavior to be independent of the device themselves m Consistently: Allows users to safely access the same information from many different devices simultaneously. m Reliably: Devices can be rebooted or replaced without losing vital configuration information

Applications r Groupware and Personal Information Management Tools m Examples: Calendar, , contact lists, distributed design tools. m Must allow for concurrent updates from many people m Users must see an ever-progressing view of shaired information even when conflicts occurs. r Digital libraries and repositories for scientific data. m Require massive quantities of storage. m Replication for durability and availability is desirable.

Questions about Persistent Information r Where is persistent information stored? m Want: Geographic independence for availability, durability, and freedom to adapt to circumstances r How is it protected? m Want: Encryption for privacy, signatures for authenticity, and Byzantine commitment for integrity r Can we make it indestructible? m Want: Redundancy with continuous repair and redistribution for long-term durability r Is it hard to manage? m Want: Automatic optimization, diagnosis and repair r Who owns the aggregate resouces? m Want: Utility Infrastructure!

OceanStore: Utility-based Infrastructure Pac Bell Sprint IBM AT&T Canadian OceanStore IBM r Transparent data service provided by federation of companies: m Monthly fee paid to one service provider m Companies buy and sell capacity from each other

OceanStore: Everyone’s Data, One Big Utility “The data is just out there” r How many files in the OceanStore? m Assume people in world m Say 10,000 files/person (very conservative?) m So files in OceanStore! m If 1 gig files (ok, a stretch), get 1 mole of bytes!

Oceanstore Goals r Untrusted infrastructure m Only clients can be trusted m Servers can crash, or leak information to third parties m Most of the servers are working correctly most of the time m Class of trusted servers that can carry out protocols on the clients behalf (financially liable for integrity of data) r Nomadic Data Access m Data can be cached anywhere, anytime (promiscuous caching) m Continuous introspective monitoring to locate data close to the user

OceanStore: Everyone’s data, One big Utility “The data is just out there” r Separate information from location m Locality is an only an optimization (an important one!) m Wide-scale coding and replication for durability r All information is globally identified m Unique identifiers are hashes over names & keys m Single uniform lookup interface replaces: DNS, server location, data location m No centralized namespace required

OceanStore Assumptions r Untrusted Infrastructure: m The OceanStore is comprised of untrusted components m Only ciphertext within the infrastructure r Responsible Party: m Some organization (i.e. service provider) guarantees that your data is consistent and durable m Not trusted with content of data, merely its integrity r Mostly Well-Connected: m Data producers and consumers are connected to a high-bandwidth network most of the time m Exploit multicast for quicker consistency when possible r Promiscuous Caching: m Data may be cached anywhere, anytime r Optimistic Concurrency via Conflict Resolution: m Avoid locking in the wide area m Applications use object-based interface for updates

Naming r At the lowest level, OceanStore objects are identified by a globally unique identifier (GUID). r A GUID is a pseudorandom fixed length bit string. r The GUID does not contain any location information and is not human readable. m Passwd vs. 12agfs237fundfhg666abcdefg999ldfnhgga r Generation of GUIDs is adapted from the concept of a self-certifying pathname which inherently specify all information necessary to communicate securely with remote file servers (e.g., network address; public key).

Naming r Create hierarchies using “directory” objects r To allow arbitrary directory hierarchies to be built, Oceanstore allows directores to contain pointers to other directories. r A user can choose several directories as “roots” and secure those directories through external methods, such as a public key authority. r These root directories are only roots wrt clients using them. The system has no root. r An object GUID is a secure hash (160 bit SHA-1) of owner’s key and some human readable name

Naming r Is 160 bits enough? m Good enough for now m Requires over 2 80 unique objects before collisions worrisome.

Access Control r Restricting Readers m To prevent unauthorized reads, we encrypt all data in the system that is not completely public and distribute the encryption key to those readers with read permission. m To revoke read permission: Data should be deleted from all replicas Data should be re-encrypted New keys should be distributed Clients can still access old data till it is deleted in all replicas.

Access Control r Restricting Writers m All writes are signed to prevent unauthorized writes. m Validity checked by Access Control Lists (ACLs) m The owner can securely choose the ACL x for an object foo by providing a signed certificate that translates to “Owner says use ACL x for object foo” m If A trusts B, B trusts C and C trusts D, then does A trust D? Depends on the policy! Needs work

Data Routing and Location r Entities are free to reside on any of the OceanStore servers. r This provides maxiumum flexibility in selecting policies for replication, availability, caching and migration. r Makes the process of locating and interaction with the objects more complex.

Routing r Address an object by its GUID m Message label: GUID, random number; no destination IP address. m Oceanstore routes message to closest GUID replica. r Oceanstore combines data location and routing: m No name service to attack m Save one round-trip for location discovery

Two-Tiered Approach to Routing r Fast, probabilistic search: m Built from attenuated Bloom filters m Why? This works well if items accessed frequently reside close to where they are being used. r Slow guaranteed search based on the Plaxton Mesh data structure used for underlying routing infrastructure m Messages begin by routing from node to node along the distributed data structure until the destination is discovered.

Attenuated Bloom Filter r Fast, probabilistic search algorithm based on Bloom filters. m A Bloom filter is a randomized data structure m Strings are stored using multiple hash functions m It can be queried to check the presence of a string m Membership queries result in rare false positives but never false negatives.

Attenuated Bloom Filter r Bloom filter m Assume hash functions h 1,h 2,…h k m A Bloom filter is a bit vector of length w. m On input the Bloom filter computes k hash functions (denoted h 1,h 2,…h k ) h1h1 h2h2 h3h3 hkhk X

Attenuated Bloom Filter m To determine whether the set represented by a Bloom filter contains a given element, that element is hashed and the corresponding bits in the filter are examined. m If all of the bits are set, the set may contain the object. m False positive probability decreases exponentially with linear increase in the number of hash functions and memory.

Attenuated Bloom Filter r Attenuated Bloom filter m Array of d Bloom filters m Associate each neighbor link with an attenuated Bloom filter. m The first filter in the array summarizes documents available from that neighbor (one hop along the link). m The i th Bloom filter is a union of all Bloom filters for all of the nodes that are a distance of i hops.

Attenuated Bloom Filter n1n2 n3 n b 2 3 4a 1 Rounded Box: Local Bloom Filter Unrounded Box: Attenuated Bloom filter 5

Attenuated Bloom Filter r Example m Assume that an object whose GUID hashes to bits 0,1,and 3 is being searched for. The steps are as follows: 1)The local Bloom Filter for n 1 shows that it does not have the object. 2)The attenuated Bloom filter at n 1 is used to determine which neighbor may have the object. 3)The query moves to n 2 whose Bloom filter indicates that it does not have the document. 4)(a) Examining the attenuated Bloom filter shows that n 4 doesn’t have the document but that n 3 may. 5)The query is forwarded to n 3 which verifies that it has the object.

Attenuated Bloom Filter r Routing m Performing a location query: The querying node examine the first level of each of its neighbors’ attenuated Bloom Filters. If one of the filters matches, it is likely that the desired data item is only one hop away, and the query is forwarded to the matching neighbor closest to the current node in network latency. If no filter matches, the querying node looks for a match in the 2 nd level of every filter. If a match is found, the query is forwarded to the matching neighbor of lowest latency. If the data can’t be found then go to the deterministic search.

Deterministic Routing and Location r Based on Plaxton trees r A Plaxton tree is a distributed tree structure where every node is the root of a tree. r Every server in the system is assigned a random node identifier (node-ID). m Each object has root node m f(GUID) = RootID m Root node is responsible for storing object’s location

Deterministic Routing and Location r Local routing maps are used by each node. These are referred to as neighbor maps. r Nodes keep “nearest” neighbor pointers differing in one digit. r Each node is connected to other nodes via neighbor links of various levels. r Level 1 edges from a given node connect to the 16 nodes closest (in network latency) with different values in the lowest digit of their addresses.

Deterministic Routing and Location r Level-2 edges connect to the 16 closest nodes that match in the lowest digit and have different second digits, etc; r Neighborhood links provide a route from every node to every other node in the system: m Resolve the destination node address one digit at a time using a level-1 edge for the first digit, a level-2 edge for the second etc r A Plaxton mesh refers to a set of trees where there is a tree rooted at every node.

Deterministic Routing and Location Example: Route from ID 0694 to  0692  0642  0442  x042xx02xxx0 1642x142xx12xxx1 2642x242xx22xxx2 3642x342xx32xxx3 4642x442xx42xxx4 5642x542xx52xxx5 6642x642xx62xxx6 7642x742xx72xxx7 Lookup map for node digits equal 3 digits equal MSB unequal digit = 0 MSB unequal digit = 1 Tapestry: An Infrastructure for Fault-tolerant Wide-area Location & Routing J. Kubiatowicz and A. Joseph UC Berkeley Technical Report

Global Routing Algorithm r Move to the closest node to you that shares at least one more digit with the target destination r E.g.: 0325  B4F8  9098  7598  E98 2BB CA 7598 B4F D598 L1 L2 L3 L4

Deterministic Routing and Location r This routing method guarantees that any existing unique node in the system will be found in at most Log b N logical hops, in a system with an N size namespace using IDs of base b.

Deterministic Routing and Location r A server S publishes that it has an object O by routing a message to the “root node” of O. r The publishing process consists of sending a message toward the root node. r At each hop along the way,the publish message stores location information in the form of a mapping. m These mappings are simply pointers to the server S where O is being stored.

Deterministic Routing and Location r During a location query, clients send messages to objects. A message destined for O is initially routed toward’s O’s root. r At each step, if the message encounters a node that contains the location mapping for O, it is immediately redirected to the server containing the object. r Else the message is forwarded one step closer to the root.

Deterministic Routing and Location r Where multiple copies of data exist in Plaxton, each node en route to the root node stores locations of all replicas. r The replica chosen may be defined by the application.

Achieving Fault Tolerance r If each object has a single root then it becomes a single point of failure, the potential subject of denial of service attacks, and an availability problem. r This is addressed by having multiple roots which can be achieved by hashing each GUID with a small number of different salt values. r The result maps to several different root nodes, thus gaining redundancy and making it difficult to target for denial of service.

Achieving Fault Tolerance r This scheme is sensitive to corruption in the links and pointers. r Bad links can be immediately detected and routing can be continued by jumping to a random neighbor node. m Each tree spans every node, hence any node should be able to reach the root.

Update Model and Consitency Resolution r Depending on the application, there could be a high degree of write sharing. r Do not really want to do wide-area locking.

Update Model and Consistency Resolution r The process of conflict resolution starts with a series of updates, chooses a total order among them, then applies them atomically in that order. r The easiest way to compute this order is to require that all updates pass through a master replica. r All other replicas can be thought of as read caches. r Problems m Does not scale m Vulnerable to Denial of Service (DOS) attacks m Requires trusting a single replica

Update Model and Consistency Resolution r The notion of distributed consistency solves DoS/Trust issues r Replace master replica with a primary tier of replicas. r These replicas cooperate with one another in Byzantine agreement protocol to choose the final commit order for updates. r Problem: m The message traffic needed by Byzantine agreement protocols is high.

Update Model and Consistency Resolution r Basically a two tier architecture is used. r Primary tier is where replicas use Byzantine Agreement r A secondary tier acts as a distributed read/write cache. r The secondary tier is kept up-to-date via “push” or “pull”

The Path of an Update r Client creates and signs an update against an object r Cilent sends update to primary tier and several random secondary tier replicas r While primary tier applies update against master copy, secondary tier propagates update epidemically. r Primary tier’s decision is passed to all replicas

Update Model and Consistency Resolution r A secondary tier of replicas communicate among themselves and the primary tier using a protocol that implements an epidemic algorithm. r Epidemic algorithms propagate updates to all replicas in as few messages as possible. m A process P picks another server Q at random and subsequently exchanges updates with Q.

The Path of an Update

Update Model and Consistency Resolution r Note that secondary replicas contain both tentative and committed data. r The epidemic-style communication pattern is used used to quickly spread tentative commits among themselves and to pick a tentative serialization order which is based on optimistically timestamping their updates.

Byzantine Agreement r A set of processes need to agree on a value, after one or more processes have proposed what that value (decision) should be r Processes may be correct, crashed, or they may exhibit arbitrary (Byzantine) failures r Messages are exchanged on an one-to-one basis, and they are not signed

Problems of Agreement r The general goal of distributed agreement algorithms is to have all the nonfaulty processes reach consensus on some issue and to establish that consensus within a finite number of steps. r What if processes exhibit Byzantine failures. r This is often compared to armies in the Byzantine Empire in which there many conspiracies, intrigue and untruthfulness were alleged to be common in ruling circles. r A traitor is analogous to a fault.

Byzantine Agreement r We will illustrate by example where there are 4 generals, where one is a traitor. r Step 1: m Every general sends a (reliable) message to every other general announcing his troop strength. m Loyal generals tell the truth. m Traitors tell every other general a different lie. m Example: general 1 reports 1K troops, general 2 reports 2K troops, general 3 lies to everyone (giving x, y, z respectively) and general 4 reports 4K troops.

Byzantine Agreement r We will illustrate by example where there are 4 generals, where one is a traitor. r Step 1: m Every general sends a (reliable) message to every other general announcing his troop strength. m Loyal generals tell the truth. m Traitors tell every other general a different lie. m Example: general 1 reports 1K troops, general 2 reports 2K troops, general 3 lies to everyone (giving x, y, z respectively) and general 4 reports 4K troops.

Byzantine Agreement

r Step 2: m The results of the announcements of step 1 are collected together in the form of vectors.

Byzantine Agreement

r Step 3 m Consists of every general passing his vector from the previous step to every other general. m Each general gets three vectors from each other general. m General 3 hasn’t stopped lying. He invents 12 new values: a through l.

Byzantine Agreement

r Step 4 m Each general examines the ith element of each of the newly received vectors. m If any value has a majority, that value is put into the result vector. m If no value has a majority, the corresponding element of the result vector is marked UNKNOWN.

Byzantine Agreement r With m faulty processes, agreement is possible only if 2m+1 processes function correctly r The total is 3m+1. r If messages cannot be guaranteed to be delivered within a known, finite time, no agreement is possible if even one process is faulty. r Why? Slow processes are indistinguishable from crashed ones.

Introspection r Oceanstore uses introspection which mimics adaptation in biological systems. r The observation modules monitor the activity of a running system and keeps a historical record of system behavior. r Use sophisticated analysis to extract patterns from these observations. r Why? m Cluster recognition m Replica management

Status r System written in Java r Unix r Applications used: r Deployed on 100 host machines at 40 sites (approximately) in the US, Canada, Europe, Australia and New Zealand.

Prototype Applications r File Server (ran the Andrew file system) r Primary replicas at UCB, Stanford, Intel, Berkeley m Reads are faster m Writes are slower r Cost of write is primarily associated with using Byzantine agreement. r They modified the Byzantine agreement protocol to use signatures which does speed up the writes but not as hoped for. m Better for large writes then small writes.

Why Important? r This body of work introduces the concept of a cooperative utility for global-scale persistent storage. r Discusses interesting approaches to addressing the problems of data availability and survivability in the face of disaster. r Investigates the uses of introspection for optimization of a self-organizing system.