Ken Birman, Cornell University. Sept 24, 2009Cornell Dept of Computer Science Colloquium2.

Slides:



Advertisements
Similar presentations
Dr. Multicast for Data Center Communication Scalability Ymir Vigfusson Hussam Abu-Libdeh Mahesh Balakrishnan Ken Birman Cornell University Yoav Tock IBM.
Advertisements

Reliable Multicast for Time-Critical Systems Mahesh Balakrishnan Ken Birman Cornell University.
Distributed Processing, Client/Server and Clusters
COS 461 Fall 1997 Group Communication u communicate to a group of processes rather than point-to-point u uses –replicated service –efficient dissemination.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Replication. Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
Amazon’s Dynamo Simple Cloud Storage. Foundations 1970 – E.F. Codd “A Relational Model of Data for Large Shared Data Banks”E.F. Codd –Idea of tabular.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
Multiple Processor Systems
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank
Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.
Ken Birman Cornell University. CS5410 Fall
Technical Architectures
Virtual Synchrony Ki Suh Lee Some slides are borrowed from Ken, Jared (cs ) and Justin (cs )
Ken Birman, Cornell University August 18/19, 20101Berkeley Workshop on Scalability.
CS5412: HOW MUCH ORDERING? Ken Birman 1 CS5412 Spring 2014 (Cloud Computing: Birman) Lecture XVI.
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Group Communications Group communication: one source process sending a message to a group of processes: Destination is a group rather than a single process.
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
Ken Birman Professor, Dept. of Computer Science.  Today’s cloud computing platforms are best for building “apps” like YouTube, web search  Highly elastic,
Distributed Systems 2006 Retrofitting Reliability* *With material adapted from Ken Birman.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
An Introduction to Internetworking. Why distributed systems - Share resources (devices & CPU) - Communicate people (by transmitting data)
Ken Birman, Cornell University.  In a 2000 PODC keynote, Brewer speculated that Consistency is in tension with Availability and Partition Tolerance 
Dr. Multicast for Data Center Communication Scalability Ymir Vigfusson Hussam Abu-Libdeh Mahesh Balakrishnan Ken Birman Cornell University Yoav Tock IBM.
Client/Server Architecture
Client-Server Computing in Mobile Environments
Lecture The Client/Server Database Environment
The Client/Server Database Environment
Ch 4. The Evolution of Analytic Scalability
1 The Google File System Reporter: You-Wei Zhang.
Presentation on Osi & TCP/IP MODEL
Technology Overview. Agenda What’s New and Better in Windows Server 2003? Why Upgrade to Windows Server 2003 ?  From Windows NT 4.0  From Windows 2000.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
5.1 Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Lab 2 Group Communication Farnaz Moradi Based on slides by Andreas Larsson 2012.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Farnaz Moradi Based on slides by Andreas Larsson 2013.
Distributed Computing Systems CSCI 4780/6780. Geographical Scalability Challenges Synchronous communication –Waiting for a reply does not scale well!!
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Partitioning and Replication.
The InetAddress Class A class for storing and managing internet addresses (both as IP numbers and as names). The are no constructors but “class factory”
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Computing Division Requests The following is a list of tasks about to be officially submitted to the Computing Division for requested support. D0 personnel.
Hwajung Lee.  Interprocess Communication (IPC) is at the heart of distributed computing.  Processes and Threads  Process is the execution of a program.
Geo-distributed Messaging with RabbitMQ
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Replication (1). Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
D u k e S y s t e m s Asynchronous Replicated State Machines (Causal Multicast and All That) Jeff Chase Duke University.
Lecture 4 Mechanisms & Kernel for NOSs. Mechanisms for Network Operating Systems  Network operating systems provide three basic mechanisms that support.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Group Communication Theresa Nguyen ICS243f Spring 2001.
Distributed Systems Lecture 7 Multicast 1. Previous lecture Global states – Cuts – Collecting state – Algorithms 2.
BIG DATA/ Hadoop Interview Questions.
Replication & Fault Tolerance CONARD JAMES B. FARAON
Last Class: Introduction
Netscape Application Server
The Client/Server Database Environment
Replication Middleware for Cloud Based Storage Service
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT -Sumanth Kandagatla Instructor: Prof. Yanqing Zhang Advanced Operating Systems (CSC 8320)
Outline Midterm results summary Distributed file systems – continued
Ch 4. The Evolution of Analytic Scalability
Distributed Systems (15-440)
Presentation transcript:

Ken Birman, Cornell University

Sept 24, 2009Cornell Dept of Computer Science Colloquium2

The “realtime web” Simple ways to create and share collaboration and social network applications [Try it!  Examples: Live Objects, Google “Wave”, Javascript/AJAX, Silverlight, Java Fx, Adobe FLEX and AIR, etc…. Sept 24, 2009Cornell Dept of Computer Science Colloquium3

 Cloud computing entails building massive distributed systems  They use replicated data, sharded relational databases, parallelism  Brewer’s “CAP theorem:” Must sacrifice Consistency for Availability & Performance  Cloud providers believe this theorem  My view: We gave up on consistency too easily Long ago, we knew how to build reliable, consistent distributed systems.

 Partly, superstition….  … albeit backed by some painful experiences

Don’t believe me? Just ask the people who really know…

 As described by Randy Shoup at LADIS 2008 Thou shalt… 1. Partition Everything 2. Use Asynchrony Everywhere 3. Automate Everything 4. Remember: Everything Fails 5. Embrace Inconsistency Sept 24, 2009Cornell Dept of Computer Science Colloquium7

 Werner Vogels is CTO at Amazon.com…  His first act? He banned reliable multicast * !  Amazon was troubled by platform instability  Vogels decreed: all communication via SOAP/TCP  This was slower… but  Stability matters more than speed * Amazon was (and remains) a heavy pub-sub user Sept 24, 2009Cornell Dept of Computer Science Colloquium8

 Key to scalability is decoupling, loosest possible synchronization  Any synchronized mechanism is a risk  His approach: create a committee  Anyone who wants to deploy a highly consistent mechanism needs committee approval …. They don’t meet very often Sept 24, 2009Cornell Dept of Computer Science Colloquium9

 Applications structured as stateless tasks  Azure decides when and how much to replicate them, can pull the plug as often as it likes  Any consistent state lives in backend servers running SQL server… but application design tools encourage developers to run locally if possible

11 Consistency technologies just don’t scale! Sept 11, 2009P2P 2009 Seattle, Washington Sept 24, 2009Cornell Dept of Computer Science Colloquium

 This is the common thread  All three guys (and Microsoft too)  Really build massive data centers, that work  And are opposed to “consistency mechanisms” Sept 24, 2009Cornell Dept of Computer Science Colloquium12

A consistent distributed system will often have many components, but users observe behavior indistinguishable from that of a single-component reference system Reference ModelImplementation Sept 24, 2009Cornell Dept of Computer Science Colloquium13

 They reason this way:  Systems that make guarantees put those guarantees first and struggle to achieve them  For example, any reliability property forces a system to retransmit lost messages, use acks, etc  But modern computers often become unreliable as a symptom of overload… so these consistency mechanisms will make things worse, by increasing the load just when we want to ease off!  So consistency (of any kind) is a “root cause” for meltdowns, oscillations, thrashing

 Transactions that update replicated data  Atomic broadcast or other forms of reliable multicast protocols  Distributed 2-phase locking mechanisms Sept 24, 2009Cornell Dept of Computer Science Colloquium15

 Our systems become “eventually” consistent but can lag far behind reality  Thus application developers are urged to not assume consistency and to avoid anything that will break if inconsistency occurs

 Synchronous runs: indistinguishable from non-replicated object that saw the same updates (like Paxos)  Virtually synchronous runs are indistinguishable from synchronous runs Synchronous executionVirtually synchronous execution Sept 24, 2009Cornell Dept of Computer Science Colloquium17 Non-replicated reference execution A=3B=7B = B-A A=A+1

 During the 1990’s, Isis was a big success  French Air Traffic Control System, New York Stock Exchange, US Navy AEGIS are some blue-chip examples that used (or still use!) Isis  But there were hundreds of less high-profile users  However, it was not a huge commercial success  Focus was on server replication and in those days, few companies had big server pools

 Leaving a collection of weaker products that, nonetheless, were sometimes highly toxic  For example, publish-subscribe message bus systems that use IPMC are notorious for massive disruption of data centers!  Among systems with strong consistency models, only Paxos is widely used in cloud systems (but its role is strictly for locking)

 Inconsistency causes bugs  Clients would never be able to trust servers… a free-for-all  Weak or “best effort” consistency?  Strong security guarantees demand consistency  Would you trust a medical electronic-health records system or a bank that used “weak consistency” for better scalability? My rent check bounced? That can’t be right! Sept 24, 2009Cornell Dept of Computer Science Colloquium20 Jason Fane Properties Sept 2009 Tommy Tenant

 To reintroduce consistency we need  A scalable model ▪ Should this be the Paxos model? The old Isis one?  A high-performance implementation ▪ Can handle massive replication for individual objects ▪ Massive numbers of objects ▪ Won’t melt down under stress ▪ Not prone to oscillatory instabilities or resource exhaustion problems

 I’m reincarnating group communication!  Basic idea: Imagine the distributed system as a world of “live objects” somewhat like files  They float in the network and hold data when idle  Programs “import” them as needed at runtime ▪ The data is replicated but every local copy is accurate ▪ Updates, locking via distributed multicast; reads are purely local; failure detection is automatic & trustworthy

 A library… highly asynchronous… Group g = new Group(“/amazon/something”); g.register(UPDATE, myUpdtHandler); g.cast(UPDATE, “John Smith”, new_salary); public void myUpdtHandler(string empName, double salary) { …. }

 Just ask all the members to do “their share” of work: Replies = g.query(LOOKUP, “Name=*Smith”); g.callback(myReplyHndlr, Replies, typeof(double)); public void lookup(string who) { divide work into viewSize() chunks this replica will search chunk # getMyRank(); reply(myAnswer); } public void myReplyHndlr(double[] whatTheyFound) { … }

Replies = g.query(LOOKUP, “Name=*Smith”); g.callback(myReplyHndlr, Replies, typeof(double)); public void myReplyHndlr(double[] fnd) { foreach(double d in fnd) avg += d; … } public void myLookup(string who) { divide work into viewSize() chunks this replica will search chunk # getMyRank(); ….. reply(myAnswer); } Group g = new Group(“/amazon/something”); g.register(LOOKUP, myLookup);

 The group is just an object.  User doesn’t experience sockets… marshalling… preprocessors… protocols…  As much as possible, they just provide arguments as if this was a kind of RPC, but no preprocessor  Sometimes they provide a list of types and Isis does a callback  Groups have replicas… handlers… a “current view” in which each member has a “rank”

 Can’t we just use Paxos?  In recent work (collaboration with MSR SV) we’ve merged the models. Our model “subsumes” both…  This new model is more flexible:  Paxos is really used only for locking.  Isis can be used for locking, but can also replicate data at very high speeds, with dynamic membership, and support other functionality.  Isis 2 will be much faster than Paxos for most group replication purposes (1000x or more) [Building a Dynamic Reliable Service. Ken Birman, Dahlia Malkhi and Robbert van Renesse. Available as a 2009 technical report, in submission to SOCC 10 and ACM Computing Surveys...]

 Unbreakable TCP connections that terminate in groups  [Burgess ‘10] describes Robert Burgess’ new r-TCP solution  Groups use some form of state machine replication scheme  State transfer and persistence  Locking, other coordination paradigms  2PC and transactional 1-copy SR  Publish-subscribe with topic or content filtering (or both)

 Isis 2 has a lot in common with an operating system and is internally very complex  Distributed communication layer manages multicast, flow control, reliability, failure sensing  Agreement protocols track group membership, maintain group views, implement virtual synchrony  Infrastructure services build messages, handle callbacks, keep groups healthy

 To scale really well we need to take full advantage of the hardware: IPMC  But IPMC was the root cause of the oscillation shown on the prior slide

 Traditional IPMC systems can overload the router, melt down  Issue is that routers have a small “space” for active IPMC addresses  In [Vigfusson, et al ‘09] we show how to use optimization to manage the IPMC space  In effect, merges similar groups while respecting limits on the routers and switches Melts down at ~100 groups

 Algorithm by Vigfusson, Tock [HotNets 09, LADIS 2008, Submission to Eurosys 10]  Uses a k-means clustering algorithm  Generalized problem is NP complete  But heuristic works well in practice Sept 24, 2009Cornell Dept of Computer Science Colloquium32

o Assign IPMC and unicast addresses s.t.  % receiver filtering (hard)  Min. network traffic  # IPMC addresses (hard) Prefers sender load over receiver load Intuitive control knobs as part of the policy (1) Sept 24, 2009Cornell Dept of Computer Science Colloquium33

Topics in `user- interest’ space FGIF B EER G ROUP F REE F OOD (1,1,1,1,1,0,1,0,1,0,1,1)(0,1,1,1,1,1,1,0,0,1,1,1) Sept 24, 2009Cornell Dept of Computer Science Colloquium34

Topics in `user- interest’ space Sept 24, 2009Cornell Dept of Computer Science Colloquium35

Topics in `user- interest’ space Filtering cost: MAX Sending cost: Sept 24, 2009Cornell Dept of Computer Science Colloquium36

Topics in `user- interest’ space Filtering cost: MAX Sending cost: Unicast Sept 24, 2009Cornell Dept of Computer Science Colloquium37

Topics in `user- interest’ space Unicast Sept 24, 2009Cornell Dept of Computer Science Colloquium38

Procs L-IPMC Heuristic multicast Procs L-IPMC Processes use “logical” IPMC addresses Dr. Multicast transparently maps these to true IPMC addresses or 1:1 UDP sends Sept 24, 2009Cornell Dept of Computer Science Colloquium39

 We looked at various group scenarios  Most of the traffic is carried by <20% of groups  For IBM Websphere, Dr. Multicast achieves 18x reduction in physical IPMC addresses  [Dr. Multicast: Rx for Data Center Communication Scalability. Ymir Vigfusson, Hussam Abu-Libdeh, Mahesh Balakrishnan, Ken Birman, and Yoav Tock. LADIS November Full paper submitted to Eurosys 10.] Sept 24, 2009Cornell Dept of Computer Science Colloquium40

 For small groups, reliable multicast protocols directly ack/nack the sender  For large ones, use QSM technique: tokens circulate within a tree of rings  Acks travel around the rings and aggregate over members they visit (efficient token encodes data)  This scales well even with many groups  Isis 2 uses this mode for |groups| > 25 members, with each ring containing ~25 nodes  [Quicksilver Scalable Multicast (QSM). Krzys Ostrowski, Ken Birman, and Danny Dolev. Network Computing and Applications (NCA’08), July 08. Boston.]

 Needed to prevent bursts of multicast from overrunning receivers  AJIL protocol imposes limits on IPMC rate  AJIL monitors aggregated multicast rate  Uses optimization to apportion bandwidth  If limit exceeded, user perceives a “slower” multicast channel  [Ajil: Distributed Rate-limiting for Multicast Networks. Hussam Abu- Libdeh, Ymir Vigfusson, Ken Birman, and Mahesh Balakrishnan (Microsoft Research, Silicon Valley). Cornell University TR. Dec 08.] Sept 24, 2009Cornell Dept of Computer Science Colloquium42

 AJIL reacts rapidly to load surges, stays close to targets (and we’re improving it steadily)  Makes it possible to eliminate almost all IPMC message loss within the datacenter! Sept 24, Cornell Dept of Computer Science Colloquium

ChallengesSolutions Distributed computing is hard and our target developers have limited skills Make group communication look as natural to the developer as building a.NET GUI Raw performance is critical to successConsistency at the “speed of light” by using lossless IPMC to send updates IPMC can trigger resource exhaustion and loss by entering “promiscuous” mode, overrunning receivers. Optimization-based management of IPMC addresses reduces # of IPMC groups 100:1. AJIL flow control scheme prevents overload. User’s will generate massive numbers of groups, not just high rates of events Aggregation, aggregation, aggregation… all automated and transparent to users Reliable protocols in massive groups result in ack implosions For big groups, deploy hierarchical ack/nack rings (idea from Quicksilver) Many existing group communication systems are insecure Use replicated group keys to secure membership, sensitive data What about C++ and Python on Linux?Port platform to Linux with Mono, then offer C++/Python supporting using remoting

 Isis 2 is coming soon… initially on.NET  Developers will think of distributed groups very much as they think of objects in C#.  A friendly, easy to understand model  And under the surface, theoretically rigorous  Yet fast and secure too  All the complexities of distributed computing are swept into this library… users have a very insulated and easy experience

.NET supports ~40 languages, all of which can call Isis 2 directly  On Linux, we’ll do a Mono port and then build an outboard server that offers a remoted library interface  C++ and other Linux languages/applications will simply run off this server, unless they are comfortable running under Mono of course

 Code extensively leverages  Reflection capabilities of C#, even when called from one of the other.NET languages  Component architecture of.NET means that users will already have the right “mind set”  Powerful prebuilt data types such as HashSets  All of this makes Isis 2 simpler and more robust; roughly a 3x improvement compared to older C/C++ version of Isis!

 Building this system (myself) as a sabbatical project… code is mostly written  Goal is to run this system on 500 to 500,000 node systems, with millions of object groups  Initial byte-code only version will be released under a freeBSD license.