Download presentation
Presentation is loading. Please wait.
Published byDina Mason Modified over 9 years ago
1
Distributed Architectures
2
Introduction r Computing everywhere: m Desktop, Laptop, Palmtop m Cars, Cellphones m Shoes? Clothing? Walls? r Connectivity everywhere: m Rapid growth of bandwidth in the interior of the net m Broadband to the home and office m Wireless technologies such as CMDA, Satellite, laser r There is a need for persistent storage. r Is it possible to provide an Internet-based distributed storage system.
3
System Requirements r Key requirements: m Be able to deal with intermittent connectivity provided to some computing devices m Information must be kept secure from theft and denial of service attacks. m Information must be extremely durable. m Information must be separate from location. m Should have a uniform and highly available access to information.
4
Implications of Requirements r Don’t want to worry about backup r Don’t want to worry about obsolescence r Need lots of resources to make data secure and highly available, BUT don’t want to own them m Outsourcing of storage already becoming popular r Pay monthly fee and your “data is out there” m Simple payment interface One bill from one company
5
Global Persistent Store r Persistent store should be characterized as follows: m Transparent: Permits behavior to be independent of the device themselves m Consistently: Allows users to safely access the same information from many different devices simultaneously. m Reliably: Devices can be rebooted or replaced without losing vital configuration information
6
Applications r Groupware and Personal Information Management Tools m Examples: Calendar, email, contact lists, distributed design tools. m Must allow for concurrent updates from many people m Users must see an ever-progressing view of shaired information even when conflicts occurs. r Digital libraries and repositories for scientific data. m Require massive quantities of storage. m Replication for durability and availability is desirable.
7
Questions about Persistent Information r Where is persistent information stored? m Want: Geographic independence for availability, durability, and freedom to adapt to circumstances r How is it protected? m Want: Encryption for privacy, signatures for authenticity, and Byzantine commitment for integrity r Can we make it indestructible? m Want: Redundancy with continuous repair and redistribution for long-term durability r Is it hard to manage? m Want: Automatic optimization, diagnosis and repair r Who owns the aggregate resouces? m Want: Utility Infrastructure!
8
OceanStore: Utility-based Infrastructure Pac Bell Sprint IBM AT&T Canadian OceanStore IBM r Transparent data service provided by federation of companies: m Monthly fee paid to one service provider m Companies buy and sell capacity from each other
9
OceanStore: Everyone’s Data, One Big Utility “The data is just out there” r How many files in the OceanStore? m Assume 10 10 people in world m Say 10,000 files/person (very conservative?) m So 10 14 files in OceanStore! m If 1 gig files (ok, a stretch), get 1 mole of bytes!
10
Oceanstore Goals r Untrusted infrastructure m Only clients can be trusted m Servers can crash, or leak information to third parties m Most of the servers are working correctly most of the time m Class of trusted servers that can carry out protocols on the clients behalf (financially liable for integrity of data) r Nomadic Data Access m Data can be cached anywhere, anytime (promiscuous caching) m Continuous introspective monitoring to locate data close to the user
11
OceanStore: Everyone’s data, One big Utility “The data is just out there” r Separate information from location m Locality is an only an optimization (an important one!) m Wide-scale coding and replication for durability r All information is globally identified m Unique identifiers are hashes over names & keys m Single uniform lookup interface replaces: DNS, server location, data location m No centralized namespace required
12
OceanStore Assumptions r Untrusted Infrastructure: m The OceanStore is comprised of untrusted components m Only ciphertext within the infrastructure r Responsible Party: m Some organization (i.e. service provider) guarantees that your data is consistent and durable m Not trusted with content of data, merely its integrity r Mostly Well-Connected: m Data producers and consumers are connected to a high-bandwidth network most of the time m Exploit multicast for quicker consistency when possible r Promiscuous Caching: m Data may be cached anywhere, anytime r Optimistic Concurrency via Conflict Resolution: m Avoid locking in the wide area m Applications use object-based interface for updates
13
Naming r At the lowest level, OceanStore objects are identified by a globally unique identifier (GUID). r A GUID is a pseudorandom fixed length bit string. r The GUID does not contain any location information and is not human readable. m Passwd vs. 12agfs237fundfhg666abcdefg999ldfnhgga r Generation of GUIDs is adapted from the concept of a self-certifying pathname which inherently specify all information necessary to communicate securely with remote file servers (e.g., network address; public key).
14
Naming r Create hierarchies using “directory” objects r To allow arbitrary directory hierarchies to be built, Oceanstore allows directores to contain pointers to other directories. r A user can choose several directories as “roots” and secure those directories through external methods, such as a public key authority. r These root directories are only roots wrt clients using them. The system has no root. r An object GUID is a secure hash (160 bit SHA-1) of owner’s key and some human readable name
15
Naming r Is 160 bits enough? m Good enough for now m Requires over 2 80 unique objects before collisions worrisome.
16
Access Control r Restricting Readers m To prevent unauthorized reads, we encrypt all data in the system that is not completely public and distribute the encryption key to those readers with read permission. m To revoke read permission: Data should be deleted from all replicas Data should be re-encrypted New keys should be distributed Clients can still access old data till it is deleted in all replicas.
17
Access Control r Restricting Writers m All writes are signed to prevent unauthorized writes. m Validity checked by Access Control Lists (ACLs) m The owner can securely choose the ACL x for an object foo by providing a signed certificate that translates to “Owner says use ACL x for object foo” m If A trusts B, B trusts C and C trusts D, then does A trust D? Depends on the policy! Needs work
18
Data Routing and Location r Entities are free to reside on any of the OceanStore servers. r This provides maxiumum flexibility in selecting policies for replication, availability, caching and migration. r Makes the process of locating and interaction with the objects more complex.
19
Routing r Address an object by its GUID m Message label: GUID, random number; no destination IP address. m Oceanstore routes message to closest GUID replica. r Oceanstore combines data location and routing: m No name service to attack m Save one round-trip for location discovery
20
Two-Tiered Approach to Routing r Fast, probabilistic search: m Built from attenuated Bloom filters m Why? This works well if items accessed frequently reside close to where they are being used. r Slow guaranteed search based on the Plaxton Mesh data structure used for underlying routing infrastructure m Messages begin by routing from node to node along the distributed data structure until the destination is discovered.
21
Attenuated Bloom Filter r Fast, probabilistic search algorithm based on Bloom filters. m A Bloom filter is a randomized data structure m Strings are stored using multiple hash functions m It can be queried to check the presence of a string m Membership queries result in rare false positives but never false negatives.
22
Attenuated Bloom Filter r Bloom filter m Assume hash functions h 1,h 2,…h k m A Bloom filter is a bit vector of length w. m On input the Bloom filter computes k hash functions (denoted h 1,h 2,…h k ) h1h1 h2h2 h3h3 hkhk 1 1 1 1 X
23
Attenuated Bloom Filter m To determine whether the set represented by a Bloom filter contains a given element, that element is hashed and the corresponding bits in the filter are examined. m If all of the bits are set, the set may contain the object. m False positive probability decreases exponentially with linear increase in the number of hash functions and memory.
24
Attenuated Bloom Filter r Attenuated Bloom filter m Array of d Bloom filters m Associate each neighbor link with an attenuated Bloom filter. m The first filter in the array summarizes documents available from that neighbor (one hop along the link). m The i th Bloom filter is a union of all Bloom filters for all of the nodes that are a distance of i hops.
25
Attenuated Bloom Filter n1n2 n3 n4 10101 00011 11010 11011 11100 11010 00011 4b 2 3 4a 1 Rounded Box: Local Bloom Filter Unrounded Box: Attenuated Bloom filter 5
26
Attenuated Bloom Filter r Example m Assume that an object whose GUID hashes to bits 0,1,and 3 is being searched for. The steps are as follows: 1)The local Bloom Filter for n 1 shows that it does not have the object. 2)The attenuated Bloom filter at n 1 is used to determine which neighbor may have the object. 3)The query moves to n 2 whose Bloom filter indicates that it does not have the document. 4)(a) Examining the attenuated Bloom filter shows that n 4 doesn’t have the document but that n 3 may. 5)The query is forwarded to n 3 which verifies that it has the object.
27
Attenuated Bloom Filter r Routing m Performing a location query: The querying node examine the first level of each of its neighbors’ attenuated Bloom Filters. If one of the filters matches, it is likely that the desired data item is only one hop away, and the query is forwarded to the matching neighbor closest to the current node in network latency. If no filter matches, the querying node looks for a match in the 2 nd level of every filter. If a match is found, the query is forwarded to the matching neighbor of lowest latency. If the data can’t be found then go to the deterministic search.
28
Deterministic Routing and Location r Based on Plaxton trees r A Plaxton tree is a distributed tree structure where every node is the root of a tree. r Every server in the system is assigned a random node identifier (node-ID). m Each object has root node m f(GUID) = RootID m Root node is responsible for storing object’s location
29
Deterministic Routing and Location r Local routing maps are used by each node. These are referred to as neighbor maps. r Nodes keep “nearest” neighbor pointers differing in one digit. r Each node is connected to other nodes via neighbor links of various levels. r Level 1 edges from a given node connect to the 16 nodes closest (in network latency) with different values in the lowest digit of their addresses.
30
Deterministic Routing and Location r Level-2 edges connect to the 16 closest nodes that match in the lowest digit and have different second digits, etc; r Neighborhood links provide a route from every node to every other node in the system: m Resolve the destination node address one digit at a time using a level-1 edge for the first digit, a level-2 edge for the second etc r A Plaxton mesh refers to a set of trees where there is a tree rooted at every node.
31
Deterministic Routing and Location Example: Route from ID 0694 to 5442 0694 0692 0642 0442 5442 0642x042xx02xxx0 1642x142xx12xxx1 2642x242xx22xxx2 3642x342xx32xxx3 4642x442xx42xxx4 5642x542xx52xxx5 6642x642xx62xxx6 7642x742xx72xxx7 Lookup map for node 0642 0 digits equal 3 digits equal MSB unequal digit = 0 MSB unequal digit = 1 Tapestry: An Infrastructure for Fault-tolerant Wide-area Location & Routing J. Kubiatowicz and A. Joseph UC Berkeley Technical Report
32
Global Routing Algorithm r Move to the closest node to you that shares at least one more digit with the target destination r E.g.: 0325 B4F8 9098 7598 4598 9098 4598 1598 0098 3E98 2BB8 0325 87CA 7598 B4F8 2118 D598 L1 L2 L3 L4
33
Deterministic Routing and Location r This routing method guarantees that any existing unique node in the system will be found in at most Log b N logical hops, in a system with an N size namespace using IDs of base b.
34
Deterministic Routing and Location r A server S publishes that it has an object O by routing a message to the “root node” of O. r The publishing process consists of sending a message toward the root node. r At each hop along the way,the publish message stores location information in the form of a mapping. m These mappings are simply pointers to the server S where O is being stored.
35
Deterministic Routing and Location r During a location query, clients send messages to objects. A message destined for O is initially routed toward’s O’s root. r At each step, if the message encounters a node that contains the location mapping for O, it is immediately redirected to the server containing the object. r Else the message is forwarded one step closer to the root.
36
Deterministic Routing and Location r Where multiple copies of data exist in Plaxton, each node en route to the root node stores locations of all replicas. r The replica chosen may be defined by the application.
37
Achieving Fault Tolerance r If each object has a single root then it becomes a single point of failure, the potential subject of denial of service attacks, and an availability problem. r This is addressed by having multiple roots which can be achieved by hashing each GUID with a small number of different salt values. r The result maps to several different root nodes, thus gaining redundancy and making it difficult to target for denial of service.
38
Achieving Fault Tolerance r This scheme is sensitive to corruption in the links and pointers. r Bad links can be immediately detected and routing can be continued by jumping to a random neighbor node. m Each tree spans every node, hence any node should be able to reach the root.
39
Update Model and Consitency Resolution r Depending on the application, there could be a high degree of write sharing. r Do not really want to do wide-area locking.
40
Update Model and Consistency Resolution r The process of conflict resolution starts with a series of updates, chooses a total order among them, then applies them atomically in that order. r The easiest way to compute this order is to require that all updates pass through a master replica. r All other replicas can be thought of as read caches. r Problems m Does not scale m Vulnerable to Denial of Service (DOS) attacks m Requires trusting a single replica
41
Update Model and Consistency Resolution r The notion of distributed consistency solves DoS/Trust issues r Replace master replica with a primary tier of replicas. r These replicas cooperate with one another in Byzantine agreement protocol to choose the final commit order for updates. r Problem: m The message traffic needed by Byzantine agreement protocols is high.
42
Update Model and Consistency Resolution r Basically a two tier architecture is used. r Primary tier is where replicas use Byzantine Agreement r A secondary tier acts as a distributed read/write cache. r The secondary tier is kept up-to-date via “push” or “pull”
43
The Path of an Update r Client creates and signs an update against an object r Cilent sends update to primary tier and several random secondary tier replicas r While primary tier applies update against master copy, secondary tier propagates update epidemically. r Primary tier’s decision is passed to all replicas
44
Update Model and Consistency Resolution r A secondary tier of replicas communicate among themselves and the primary tier using a protocol that implements an epidemic algorithm. r Epidemic algorithms propagate updates to all replicas in as few messages as possible. m A process P picks another server Q at random and subsequently exchanges updates with Q.
45
The Path of an Update
46
Update Model and Consistency Resolution r Note that secondary replicas contain both tentative and committed data. r The epidemic-style communication pattern is used used to quickly spread tentative commits among themselves and to pick a tentative serialization order which is based on optimistically timestamping their updates.
47
Byzantine Agreement r A set of processes need to agree on a value, after one or more processes have proposed what that value (decision) should be r Processes may be correct, crashed, or they may exhibit arbitrary (Byzantine) failures r Messages are exchanged on an one-to-one basis, and they are not signed
48
Problems of Agreement r The general goal of distributed agreement algorithms is to have all the nonfaulty processes reach consensus on some issue and to establish that consensus within a finite number of steps. r What if processes exhibit Byzantine failures. r This is often compared to armies in the Byzantine Empire in which there many conspiracies, intrigue and untruthfulness were alleged to be common in ruling circles. r A traitor is analogous to a fault.
49
Byzantine Agreement r We will illustrate by example where there are 4 generals, where one is a traitor. r Step 1: m Every general sends a (reliable) message to every other general announcing his troop strength. m Loyal generals tell the truth. m Traitors tell every other general a different lie. m Example: general 1 reports 1K troops, general 2 reports 2K troops, general 3 lies to everyone (giving x, y, z respectively) and general 4 reports 4K troops.
50
Byzantine Agreement r We will illustrate by example where there are 4 generals, where one is a traitor. r Step 1: m Every general sends a (reliable) message to every other general announcing his troop strength. m Loyal generals tell the truth. m Traitors tell every other general a different lie. m Example: general 1 reports 1K troops, general 2 reports 2K troops, general 3 lies to everyone (giving x, y, z respectively) and general 4 reports 4K troops.
51
Byzantine Agreement
52
r Step 2: m The results of the announcements of step 1 are collected together in the form of vectors.
53
Byzantine Agreement
54
r Step 3 m Consists of every general passing his vector from the previous step to every other general. m Each general gets three vectors from each other general. m General 3 hasn’t stopped lying. He invents 12 new values: a through l.
55
Byzantine Agreement
56
r Step 4 m Each general examines the ith element of each of the newly received vectors. m If any value has a majority, that value is put into the result vector. m If no value has a majority, the corresponding element of the result vector is marked UNKNOWN.
57
Byzantine Agreement r With m faulty processes, agreement is possible only if 2m+1 processes function correctly r The total is 3m+1. r If messages cannot be guaranteed to be delivered within a known, finite time, no agreement is possible if even one process is faulty. r Why? Slow processes are indistinguishable from crashed ones.
58
Introspection r Oceanstore uses introspection which mimics adaptation in biological systems. r The observation modules monitor the activity of a running system and keeps a historical record of system behavior. r Use sophisticated analysis to extract patterns from these observations. r Why? m Cluster recognition m Replica management
59
Status r System written in Java r Unix r Applications used: e-mail r Deployed on 100 host machines at 40 sites (approximately) in the US, Canada, Europe, Australia and New Zealand.
60
Prototype Applications r File Server (ran the Andrew file system) r Primary replicas at UCB, Stanford, Intel, Berkeley m Reads are faster m Writes are slower r Cost of write is primarily associated with using Byzantine agreement. r They modified the Byzantine agreement protocol to use signatures which does speed up the writes but not as hoped for. m Better for large writes then small writes.
61
Why Important? r This body of work introduces the concept of a cooperative utility for global-scale persistent storage. r Discusses interesting approaches to addressing the problems of data availability and survivability in the face of disaster. r Investigates the uses of introspection for optimization of a self-organizing system.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.