1 CS 268: Lecture 19 A Quick Survey of Distributed Systems 80 slides in 80 minutes!

Slides:



Advertisements
Similar presentations
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Authored by: Seth Gilbert and Nancy Lynch Presented by:
Advertisements

1 CS 194: Elections, Exclusion and Transactions Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Linearizability Linearizability is a correctness criterion for concurrent object (Herlihy & Wing ACM TOPLAS 1990). It provides the illusion that each operation.
Replication. Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
Consistency and Replication Chapter 7 Part II Replica Management & Consistency Protocols.
Life after CAP Ali Ghodsi CAP conjecture [reminder] Can only have two of: – Consistency – Availability – Partition-tolerance Examples.
Replication Management. Motivations for Replication Performance enhancement Increased availability Fault tolerance.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
1 CS 194: Lecture 8 Consistency and Replication Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
L-11 Consistency 1. Important Lessons Replication  good for performance/reliability  Key challenge  keeping replicas up-to-date Wide range of consistency.
1 CS 194: Distributed Systems Other Distributed File Systems Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering.
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
1 CS 194: Lecture 10 Bayou, Brewer, and Byzantine.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class: Web Caching Use web caching as an illustrative example Distribution protocols –Invalidate.
Systems of Distributed Systems Module 2 -Distributed algorithms Teaching unit 3 – Advanced algorithms Ernesto Damiani University of Bozen Lesson 6 – Two.
CS 582 / CMPE 481 Distributed Systems
“Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System ” Distributed Systems Κωνσταντακοπούλου Τζένη.
1 CS 194: Lecture 15 Midterm Review. 2 Notation  Red titles mean start of new topic  They don’t indicate increased importance…..
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 16 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
G Robert Grimm New York University Bayou: A Weakly Connected Replicated Storage System.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved Client-Centric.
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
Concurrency Control & Caching Consistency Issues and Survey Dingshan He November 18, 2002.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 18: Replication Control All slides © IG.
1 CS 194: Lecture 9 Bayou Scott Shenker and Ion Stoica Computer Science Division Department of Electrical Engineering and Computer Sciences University.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System D. B. Terry, M. M. Theimer, K. Petersen, A. J. Demers, M. J. Spreitzer.
Multicast Communication Multicast is the delivery of a message to a group of receivers simultaneously in a single transmission from the source – The source.
1 6.4 Distribution Protocols Different ways of propagating/distributing updates to replicas, independent of the consistency model. First design issue.
Mobility in Distributed Computing With Special Emphasis on Data Mobility.
1 CS 268: Lecture 20 Classic Distributed Systems: Bayou and BFT Ion Stoica Computer Science Division Department of Electrical Engineering and Computer.
Replication and Consistency. Reference The Dangers of Replication and a Solution, Jim Gray, Pat Helland, Patrick O'Neil, and Dennis Shasha. In Proceedings.
1. Big Data A broad term for data sets so large or complex that traditional data processing applications ae inadequate. 2.
Replication March 16, Replication What is Replication?  A technique for increasing availability, fault tolerance and sometimes, performance 
TRANSACTIONS (AND EVENT-DRIVEN PROGRAMMING) EE324.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Data Versioning Lecturer.
Practical Byzantine Fault Tolerance
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.
CSE 486/586 CSE 486/586 Distributed Systems Consistency Steve Ko Computer Sciences and Engineering University at Buffalo.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Byzantine Fault Tolerance CS 425: Distributed Systems Fall 2012 Lecture 26 November 29, 2012 Presented By: Imranul Hoque 1.
Feb 1, 2001CSCI {4,6}900: Ubiquitous Computing1 Eager Replication and mobile nodes Read on disconnected clients may give stale data Eager replication prohibits.
CAP Theorem Justin DeBrabant CIS Advanced Systems - Fall 2013.
Fault Tolerance and Replication
Hwajung Lee.  Improves reliability  Improves availability ( What good is a reliable system if it is not available?)  Replication must be transparent.
CSE 486/586 Distributed Systems Consistency --- 3
Highly Available Services and Transactions with Replicated Data Jason Lenthe.
THE EVOLUTION OF CODA M. Satyanarayanan Carnegie-Mellon University.
Mobility Victoria Krafft CS /25/05. General Idea People and their machines move around Machines want to share data Networks and machines fail Network.
CS 540 Database Management Systems NoSQL & NewSQL Some slides due to Magda Balazinska 1.
The consensus problem in distributed systems
Nomadic File Systems Uri Moszkowicz 05/02/02.
Distributed Systems – Paxos
CS 194: Lecture 1 University of California Berkeley
Providing Secure Storage on the Internet
CSE 486/586 Distributed Systems Consistency --- 3
EECS 498 Introduction to Distributed Systems Fall 2017
Outline Announcements Fault Tolerance.
Replication Improves reliability Improves availability
Distributed Systems CS
Replica Placement Model: We consider objects (and don’t worry whether they contain just data or code, or both) Distinguish different processes: A process.
Last Class: Web Caching
CSE 486/586 Distributed Systems Consistency --- 3
Presentation transcript:

1 CS 268: Lecture 19 A Quick Survey of Distributed Systems 80 slides in 80 minutes!

2 Agenda  Introduction and overview  Data replication and eventual consistency  Bayou: beyond eventual consistency  Practical BFT: fault tolerance  SFS: security

3 To Paraphrase Lincoln....  Some of you will know all of this  And all of you will know some of this  But not all of you will know all of this.....

4 Why Distributed Systems in 268?  You won’t learn it any other place....  Networking research is drifting more towards DS  DS research is drifting more towards the Internet It has nothing to do with the fact that I’m also teaching DS......but for those of you co-enrolled, these slides will look very familiar!

5 Two Views of Distributed Systems  Optimist: A distributed system is a collection of independent computers that appears to its users as a single coherent system  Pessimist: “You know you have one when the crash of a computer you’ve never heard of stops you from getting any work done.” (Lamport)

6 History  First, there was the mainframe  Then there were workstations (PCs)  Then there was the LAN  Then wanted collections of PCs to act like a mainframe  Then built some neat systems and had a vision for future  But the web changed everything

7 Why?  The vision of distributed systems: -Enticing dream -Very promising start, theoretically and practically  But the impact was limited by: -Autonomy (fate sharing, policies, cost, etc.) -Scaling (some systems couldn’t scale well)  The Internet (and the web) provided: -Extreme autonomy -Extreme scale -Poor consistency (nobody cared!)

8 Recurring Theme  Academics like: -Clean abstractions -Strong semantics -Things that prove they are smart  Users like: -Systems that work (most of the time) -Systems that scale -Consistency per se isn’t important  Eric Brewer had the following observations

9 A Clash of Cultures  Classic distributed systems: focused on ACID semantics -A: Atomic -C: Consistent -I: Isolated -D: Durable  Modern Internet systems: focused on BASE -Basically Available -Soft-state (or scalable) -Eventually consistent

10 ACID vs BASE ACID  Strong consistency for transactions highest priority  Availability less important  Pessimistic  Rigorous analysis  Complex mechanisms BASE  Availability and scaling highest priorities  Weak consistency  Optimistic  Best effort  Simple and fast

11 Why Not ACID+BASE?  What goals might you want from a shared-date system? -C, A, P  Strong Consistency: all clients see the same view, even in the presence of updates  High Availability: all clients can find some replica of the data, even in the presence of failures  Partition-tolerance: the system properties hold even when the system is partitioned

12 CAP Theorem [Brewer]  You can only have two out of these three properties  The choice of which feature to discard determines the nature of your system

13 Consistency and Availability  Comment: -Providing transactional semantics requires all functioning nodes to be in contact with each other (no partition)  Examples: -Single-site and clustered databases -Other cluster-based designs  Typical Features: -Two-phase commit -Cache invalidation protocols -Classic DS style

14 Consistency and Partition-Tolerance  Comment: -If one is willing to tolerate system-wide blocking, then can provide consistency even when there are temporary partitions  Examples: -Distributed locking -Quorum (majority) protocols  Typical Features: -Pessimistic locking -Minority partitions unavailable -Also common DS style Voting vs primary copy

15 Partition-Tolerance and Availability  Comment: -Once consistency is sacrificed, life is easy….  Examples: -DNS -Web caches -Coda -Bayou  Typical Features: -TTLs and lease cache management -Optimistic updating with conflict resolution -This is the “Internet design style”

16 Voting with their Clicks  In terms of large-scale systems, the world has voted with their clicks: -Consistency less important than availability and partition-tolerance

17 Data Replication and Eventual Consistency

18 Replication  Why replication? -Volume of requests -Proximity -Availability  Challenge of replication: consistency

19 Many Kinds of Consistency  Strict: updates happen instantly everywhere  Linearizable: updates happen in timestamp order  Sequential: all updates occur in the same order everywhere  Causal: on each replica, updates occur in a causal order  FIFO: all updates from a single client are applied in order

20 Focus on Sequential Consistency  Weakest model of consistency in which data items have to converge to the same value everywhere  But hard to achieve at scale: -Quorums -Primary copy -TPC -...

21 Is Sequential Consistency Overkill?  Sequential consistency requires that at each stage in time, the operations at a replica occur in the same order as at every other replica  Ordering of writes causes the scaling problems!  Why insist on such a strict order?

22 Eventual Consistency  If all updating stops then eventually all replicas will converge to the identical values

23 Implementing Eventual Consistency Can be implemented with two steps: 1. All writes eventually propagate to all replicas 2. Writes, when they arrive, are written to a log and applied in the same order at all replicas Easily done with timestamps and “undo-ing” optimistic writes

24 Update Propagation  Rumor or epidemic stage: -Attempt to spread an update quickly -Willing to tolerate incompletely coverage in return for reduced traffic overhead  Correcting omissions: -Making sure that replicas that weren’t updated during the rumor stage get the update

25 Rumor Spreading: Push  When a server P has just been updated for data item x, it contacts some other server Q at random and tells Q about the update  If Q doesn’t have the update, then it (after some time interval) contacts another server and repeats the process  If Q already has the update, then P decides, with some probability, to stop spreading the update

26 Performance of Push Scheme  Not everyone will hear! -Let S be fraction of servers not hearing rumors -Let M be number of updates propagated per server  S= exp{-M}  Note that M depends on the probability of continuing to push rumor  Note that S(M) is independent of algorithm to stop spreading

27 Pull Schemes  Periodically, each server Q contacts a random server P and asks for any recent updates  P uses the same algorithm as before in deciding when to stop telling rumor  Performance: better (next slide), but requires contact even when no updates

28 Variety of Pull Schemes  When to stop telling rumor: (conjectures) -Counter: S ~ exp{-M 3 } -Min-counter: S ~ exp{-2 M } (best you can do!)  Controlling who you talk to next -Can do better  Knowing N: -Can choose parameters so that S << 1/N  Spatial dependence

29 Finishing Up  There will be some sites that don’t know after the initial rumor spreading stage  How do we make sure everyone knows?

30 Anti-Entropy  Every so often, two servers compare complete datasets  Use various techniques to make this cheap  If any data item is discovered to not have been fully replicated, it is considered a new rumor and spread again

31 We Don’t Want Lazarus!  Consider server P that does offline  While offline, data item x is deleted  When server P comes back online, what happens?

32 Death Certificates  Deleted data is replaced by a death certificate  That certificate is kept by all servers for some time T that is assumed to be much longer than required for all updates to propagate completely  But every death certificate is kept by at least one server forever

33 Bayou

34 Why Bayou?  Eventual consistency: strongest scalable consistency model  But not strong enough for mobile clients -Accessing different replicas can lead to strange results  Bayou was designed to move beyond eventual consistency -One step beyond CODA -Session guarantees -Fine-grained conflict detection and application-specific resolution

35 Why Should You Care about Bayou?  Subset incorporated into next-generation WinFS  Done by my friends at PARC

36 Bayou System Assumptions  Variable degrees of connectivity: -Connected, disconnected, and weakly connected  Variable end-node capabilities: -Workstations, laptops, PDAs, etc.  Availability crucial

37 Resulting Design Choices  Variable connectivity  Flexible update propagation -Incremental progress, pairwise communication (anti-entropy)  Variable end-nodes  Flexible notion of clients and servers -Some nodes keep state (servers), some don’t (clients) -Laptops could have both, PDAs probably just clients  Availability crucial  Must allow disconnected operation -Conflicts inevitable -Use application-specific conflict detection and resolution

38 Components of Design  Update propagation  Conflict detection  Conflict resolution  Session guarantees

39 Updates  Client sends update to a server  Identified by a triple: -  Updates are either committed or tentative -Commit-stamps increase monotonically -Tentative updates have commit-stamp = inf  Primary server does all commits: -It sets the commit-stamp -Commit-stamp different from time-stamp

40 Update Log  Update log in order: -Committed updates (in commit-stamp order) -Tentative updates (in time-stamp order)  Can truncate committed updates, and only keep db state  Clients can request two views: (or other app-specific views) -Committed view -Tentative view

41 Bayou System Organization

42 Anti-Entropy Exchange  Each server keeps a version vector: -R.V[X] is the latest timestamp from server X that server R has seen  When two servers connect, exchanging the version vectors allows them to identify the missing updates  These updates are exchanged in the order of the logs, so that if the connection is dropped the crucial monotonicity property still holds -If a server X has an update accepted by server Y, server X has all previous updates accepted by that server

43 Requirements for Eventual Consistency  Universal propagation: anti-entropy  Globally agreed ordering: commit-stamps  Determinism: writes do not involve information not contained in the log (no time-of-day, process-ID, etc.)

44 Example with Three Servers P [0,0,0] A [0,0,0] B [0,0,0] Version Vectors

45 All Servers Write Independently P [8,0,0] A [0,10,0] B [0,0,9]

46 P and A Do Anti-Entropy Exchange P [8,10,0] A [8,10,0] B [0,0,9] [8,0,0] [0,10,0]

47 P Commits Some Early Writes P [8,10,0] A [8,10,0] B [0,0,9] [8,10,0]

48 P and B Do Anti-Entropy Exchange P [8,10,9] A [8,10,0] B [8,10,9] [8,10,0] [0,0,9]

49 P Commits More Writes P [8,10,9] P [8,10,9]

50 Bayou Writes  Identifier (commit-stamp, time-stamp, server-ID)  Nominal value  Write dependencies  Merge procedure

51 Conflict Detection  Write specifies the data the write depends on: -Set X=8 if Y=5 and Z=3 -Set Cal(11:00-12:00)=dentist if Cal(11:00-12:00) is null  These write dependencies are crucial in eliminating unnecessary conflicts -If file-level detection was used, all updates would conflict with each other

52 Conflict Resolution  Specified by merge procedure (mergeproc)  When conflict is detected, mergeproc is called -Move appointments to open spot on calendar -Move meetings to open room

53 Session Guarantees  When client move around and connects to different replicas, strange things can happen -Updates you just made are missing -Database goes back in time -Etc.  Design choice: -Insist on stricter consistency -Enforce some “session” guarantees  SGs ensured by client, not by distribution mechanism

54 Read Your Writes  Every read in a session should see all previous writes in that session

55 Monotonic Reads and Writes  A later read should never be missing an update present in an earlier read  Same for writes

56 Writes Follow Reads  If a write W followed a read R at a server X, then at all other servers -If W is in Y’s database then any writes relevant to R are also there

57 Supporting Session Guarantees  Responsibility of “session manager”, not servers!  Two sets: -Read-set: set of writes that are relevant to session reads -Write-set: set of writes performed in session  Causal ordering of writes -Use Lamport clocks

58 Practical Byzantine Fault Tolerance Only a high-level summary

59 The Problem  Ensure correct operation of a state machine in the face of arbitrary failures  Limitations: -no more than f failures, where n>3f -messages can’t be indefinitely delayed

60 Basic Approach  Client sends request to primary  Primary multicasts request to all backups  Replicas execute request and send reply to client  Client waits for f+1 replies that agree Challenge: make sure replicas see requests in order

61 Algorithm Components  Normal case operation  View changes  Garbage collection  Recovery

62 Normal Case  When primary receives request, it starts 3-phase protocol  pre-prepare: accepts request only if valid  prepare: multicasts prepare message and, if 2f prepare messages from other replicas agree, multicasts commit message  commit: commit if 2f+1 agree on commit

63 View Changes  Changes primary  Required when primary malfunctioning

64 Communication Optimizations  Send only one full reply: rest send digests  Optimistic execution: execute prepared requests  Read-only operations: multicast from client, and executed in current state

65 Most Surprising Result  Very little performance loss!

66 Secure File System (SFS)

67 Secure File System (SFS)  Developed by David Mazieres while at MIT (now NYU)  Key question: how do I know I’m accessing the server I think I’m accessing?  All the fancy distributed systems performance work is irrelevant if I’m not getting the data I wanted  Several current stories about why I believe I’m accessing the server I want to access

68 Trust DNS and Network  Someone I trust hands me server name:  Verisign runs root servers for.com, directs me to DNS server for foo.com  I trust that packets sent to/from DNS and to/from server are indeed going to the intended destinations

69 Trust Certificate Authority  Server produces certificate (from, for example, Verisign) that attests that the server is who it says it is.  Disadvantages: -Verisign can screw up (which it has) -Hard for some sites to get meaningful Verisign certificate

70 Use Public Keys  Can demand proof that server has private key associated with public key  But how can I know that public key is associated with the server I want?

71 Secure File System (SFS)  Basic problem in normal operation is that the pathname (given to me by someone I trust) is disconnected from the public key (which will prove that I’m talking to the owner of the key).  In SFS, tie the two together. The pathname given to me automatically certifies the public key!

72 Self-Certifying Path Name  LOC: DNS or IP address of server, which has public key K  HID: Hash(LOC,K)  Pathname: local pathname on server /sfsLOCHIDPathname /sfs/sfs.vu.sc.nl:ag62hty4wior450hdh63u623i4f0kqere/home/steen/mbox

73 SFS Key Point  Whatever directed me to the server initially also provided me with enough information to verify their key  This design separates the issue of who I trust (my decision) from how I act on that trust (the SFS design)  Can still use Verisign or other trusted parties to hand out pathnames, or could get them from any other source

74 SUNDR  Developed by David Mazieres  SFS allows you to trust nothing but your server -But what happens if you don’t even trust that? -Why is this a problem?  P2P designs: my files on someone else’s machine  Corrupted servers: sourceforge hacked -Apache, Debian,Gnome, etc.

75 Traditional File System Model  Client send read and write requests to server  Server responds to those requests  Client/Server channel is secure, so attackers can’t modify requests/responses  But no way for clients to know if server is returning correct data  What if server isn’t trustworthy?

76 Byzantine Fault Tolerance  Can only protect against a limited number of corrupt servers

77 SUNDR Model V1  Clients send digitally signed requests to server  Server returns log of these requests -Server doesn’t compute anything -Server doesn’t know any keys  Problem: server can drop some updates from log, or reorder them

78 SUNDR Model V2  Have clients sign log, not just their own request  Only bad thing a server can do is a fork attack: -Keep two separate copies, and only show one to client 1 and the other to client 2  This is hopelessly inefficient, but various tricks can solve the efficiency problem