Tackling Challenges of Scale in Highly Available Computing Systems Ken Birman Dept. of Computer Science Cornell University.

Slides:



Advertisements
Similar presentations
Reliable Multicast for Time-Critical Systems Mahesh Balakrishnan Ken Birman Cornell University.
Advertisements

Distributed Processing, Client/Server and Clusters
SDN Controller Challenges
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Replication. Topics r Why Replication? r System Model r Consistency Models r One approach to consistency management and dealing with failures.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Ken Birman Cornell University. CS5410 Fall
Ken Birman Cornell University. CS5410 Fall
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Rheeve: A Plug-n-Play Peer- to-Peer Computing Platform Wang-kee Poon and Jiannong Cao Department of Computing, The Hong Kong Polytechnic University ICDCSW.
Using Gossip to Build Scalable Services Ken Birman, CS514 Dept. of Computer Science Cornell University.
Future Usage Environments & Systems Integration November 16 th 2004 HCMDSS planning workshop Douglas C. Schmidt (moderator) David Forslund, Cognition Group.
Reliable Distributed Systems Astrolabe. Massive scale. Constant flux Source: Burch and Cheswick The Internet.
1 Spring Semester 2007, Dept. of Computer Science, Technion Internet Networking recitation #13 Web Caching Protocols ICP, CARP.
CS5412: ADAPTIVE OVERLAYS Ken Birman 1 CS5412 Spring 2012 (Cloud Computing: Birman) Lecture V.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Astrolabe Serge Kreiker. Problem Need to manage large collections of distributed resources (Scalable system) The computers may be co-located in a room,
Internet Networking Spring 2002 Tutorial 13 Web Caching Protocols ICP, CARP.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
Ken Birman Cornell University. CS5410 Fall
Epidemic Techniques Chiu Wah So (Kelvin). Database Replication Why do we replicate database? – Low latency – High availability To achieve strong (sequential)
Lesson 1: Configuring Network Load Balancing
Navigating in the Dark: New Options for Building Self- Configuring Embedded Systems Ken Birman Cornell University.
Improving Data Access in P2P Systems Karl Aberer and Magdalena Punceva Swiss Federal Institute of Technology Manfred Hauswirth and Roman Schmidt Technical.
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
EPIDEMIC TECHNIQUES Ki Suh Lee. OUTLINE Epidemic Protocol Epidemic Algorithms for Replicated Database Maintenance Astrolabe: A Robust and scalable technology.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
1 The Google File System Reporter: You-Wei Zhang.
Communication (II) Chapter 4
Technology Overview. Agenda What’s New and Better in Windows Server 2003? Why Upgrade to Windows Server 2003 ?  From Windows NT 4.0  From Windows 2000.
Using Self-Regenerative Tools to Tackle Challenges of Scale Ken Birman QuickSilver Project Cornell University.
Local Area Networks (LAN) are small networks, with a short distance for the cables to run, typically a room, a floor, or a building. - LANs are limited.
Collaborative Content Delivery Werner Vogels Robbert van Renesse, Ken Birman Dept. of Computer Science, Cornell University A peer-to-peer solution for.
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
DISTRIBUTED COMPUTING
Jonathan Walpole CSE515 - Distributed Computing Systems 1 Teaching Assistant for CSE515 Rahul Dubey.
COMP 410 Update. The Problems Story Time! Describe the Hurricane Problem Do this with pictures, lots of people, a hurricane, trucks, medicine all disconnected.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Krzys Ostrowski: TA.
Reliable Distributed Systems Astrolabe. Massive scale. Constant flux Source: Burch and Cheswick The Internet.
Cassandra - A Decentralized Structured Storage System
Approved for Public Release, Distribution Unlimited QuickSilver: Middleware for Scalable Self-Regenerative Systems Cornell University Ken Birman, Johannes.
Adaptive Web Caching CS411 Dynamic Web-Based Systems Flying Pig Fei Teng/Long Zhao/Pallavi Shinde Computer Science Department.
Peer to Peer Networks Distributed Hash Tables Chord, Kelips, Dynamo Galen Marchetti, Cornell University.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
CS5412: BIMODAL MULTICAST ASTROLABE Ken Birman CS5412 Spring Lecture XIX.
Leiden; Dec 06Gossip-Based Networking Workshop1 Epidemic Algorithms and Emergent Shape Ken Birman.
Leiden; Dec 06Gossip-Based Networking Workshop1 Gossip Algorithms and Emergent Shape Ken Birman.
CS5412: SHEDDING LIGHT ON THE CLOUDY FUTURE Ken Birman 1 Lecture XXV.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Scalable Self-Repairing Publish/Subscribe Robbert van Renesse Ken Birman Werner Vogels Cornell University.
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Vivek Vishnumurthy: TA.
Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.
CS5412: BIMODAL MULTICAST ASTROLABE Ken Birman Gossip-Based Networking Workshop 1 Lecture XIX Leiden; Dec 06.
Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
BIMODAL MULTICAST ASTROLABE Ken Birman 1 CS6410. Gossip  Recall from early in the semester that gossip spreads in log(system size) time  But is.
Tempest: An Architecture for Scalable Time-Critical Services Mahesh Balakrishnan Amar Phanishayee Tudor Marian Professor Ken Birman.
BIG DATA/ Hadoop Interview Questions.
Replication & Fault Tolerance CONARD JAMES B. FARAON
CHAPTER 3 Architectures for Distributed Systems
Using Gossip to Build Scalable Services
CS5412: Bimodal Multicast Astrolabe
TRUST:Team for Research in Ubiquitous Secure Technologies
CS514: Intermediate Course in Operating Systems
CS5412: Bimodal Multicast Astrolabe
Presentation transcript:

Tackling Challenges of Scale in Highly Available Computing Systems Ken Birman Dept. of Computer Science Cornell University

Members of the group Ken Birman Robbert van Renesse Einar Vollset Krzystof Ostrowski Mahesh Balakrishnan Maya Haridasan Amar Phanishayee

Our topic Computing systems are growing … larger, … and more complex, … and we are hoping to use them in a more and more “unattended” manner Peek under the covers of the toughest, most powerful systems that exist Then ask: Can we discern a research agenda?

Some “factoids” Companies like Amazon, Google, eBay are running data centers with tens of thousands of machines Credit card companies, banks, brokerages, insurance companies close behind Rate of growth is staggering Meanwhile, a new rollout of wireless sensor networks is poised to take off

How are big systems structured? Typically a “data center” of web servers Some human-generated traffic Some automatic traffic from WS clients The front-end servers are connected to a pool of clustered back-end application “services” All of this load-balanced, multi-ported Extensive use of caching for improved performance and scalability Publish-subscribe very popular

A glimpse inside eStuff.com Pub-sub combined with point-to-point communication technologies like TCP LB service LB service LB service LB service LB service LB service “front-end applications”

Hierarchy of sets A set of data centers, each having A set of services, each structured as A set of partitions, each consisting of A set of programs running in a clustered manner on A set of machines … raising the obvious question: how well do platforms support hierarchies of sets?

x y z A RAPS of RACS (Jim Gray) RAPS: A reliable array of partitioned subservices RACS: A reliable array of cloned server processes Ken Birman searching for “digital camera” Pmap “B-C”: {x, y, z} (equivalent replicas) Here, y gets picked, perhaps based on load A set of RACS RAPS

RAPS of RACS in Data Centers

Technology needs? Programs will need a way to Find the “members” of the service Apply the partitioning function to find contacts within a desired partition Dynamic resource management, adaptation of RACS size and mapping to hardware Fault detection Within a RACS we also need to: Replicate data for scalability, fault tolerance Load balance or parallelize tasks

Scalability makes this hard! Membership Within RACS Of the service Services in data centers Communication Point-to-point Multicast Resource management Pool of machines Set of services Subdivision into RACS Fault-tolerance Consistency

… hard in what sense? Sustainable workload often drops at least linearly in system size And this happens because overheads grow worse than linearly (quadratic is common) Reasons vary… but share a pattern: Frequency of “disruptive” events rises with scale Protocols have property that whole system is impacted when these events occur

QuickSilver project We’ve been building a scalable infrastructure addressing these needs Consists of: Some existing technologies, notably Astrolabe, gossip “repair” protocols Some new technology, notably a new publish-subscribe message bus and a new way to automatically create a RAPS of RACS for time-critical applications

Gossip 101 Suppose that I know something I’m sitting next to Fred, and I tell him Now 2 of us “know” Later, he tells Mimi and I tell Anne Now 4 This is an example of a push epidemic Push-pull occurs if we exchange data

Gossip scales very nicely Participants’ loads independent of size Network load linear in system size Information spreads in log(system size) time % infected Time 

Gossip in distributed systems We can gossip about membership Need a bootstrap mechanism, but then discuss failures, new members Gossip to repair faults in replicated data “I have 6 updates from Charlie” If we aren’t in a hurry, gossip to replicate data too

Bimodal Multicast ACM TOCS 1999 Gossip source has a message from Mimi that I’m missing. And he seems to be missing two messages from Charlie that I have. Here are some messages from Charlie that might interest you. Could you send me a copy of Mimi’s 7’th message? Mimi’s 7’th message was “The meeting of our Q exam study group will start late on Wednesday…” Send multicasts to report events Some messages don’t get through Periodically, but not synchronously, gossip about messages.

Stock Exchange Problem: Reliable multicast is too “fragile” Most members are healthy…. … but one is slow Most members are healthy….

The problem gets worse as the system scales up Virtually synchronous Ensemble multicast protocols perturb rate average throughput on nonperturbed members group size: 32 group size: 64 group size:

Bimodal multicast with perturbed processes Bimodal multicast scales well Traditional multicast: throughput collapses under stress

Bimodal Multicast Imposes a constant overhead on participants Many optimizations and tricks needed, but nothing that isn’t practical to implement Hardest issues involve “biased” gossip to handle LANs connected by WAN long-haul links Reliability is easy to analyze mathematically using epidemic theory Use the theory to derive optimal parameter setting Theory also let’s us predict behavior Despite simplified model, the predictions work!

Kelips A distributed “index” Put(“name”, value) Get(“name”) Kelips can do lookups with one RPC, is self-stabilizing after disruption

Kelips Take a a collection of “nodes”

Kelips Affinity Groups: peer membership thru consistent hash 1N  N members per affinity group Map nodes to affinity groups

Kelips Affinity Groups: peer membership thru consistent hash 1N  Affinity group pointers N members per affinity group idhbeatrtt ms ms Affinity group view 110 knows about other members – 230, 30…

Kelips Affinity Groups: peer membership thru consistent hash 1N  Contact pointers N members per affinity group idhbeatrtt ms ms Affinity group view groupcontactNode …… 2202 Contacts 202 is a “contact” for 110 in group 2

Kelips Affinity Groups: peer membership thru consistent hash 1N  Gossip protocol replicates data cheaply N members per affinity group idhbeatrtt ms ms Affinity group view groupcontactNode …… 2202 Contacts resourceinfo …… cnn.com110 Resource Tuples “cnn.com” maps to group 2. So 110 tells group 2 to “route” inquiries about cnn.com to it.

Kelips Affinity Groups: peer membership thru consistent hash 1N  N members per affinity group To look up “cnn.com”, just ask some contact in group 2. It returns “110” (or forwards your request). IP2P, ACM TOIS (submitted)

Kelips Per-participant loads are constant Space required grows as O(√N) Finds an object in “one hop” Most other DHTs need log(N) hops And isn’t disrupted by churn, either Most other DHTs are seriously disrupted when churn occurs and might even “fail”

Astrolabe: Distributed Monitoring NameLoadWeblogic?SMTP?Word Version… swift falcon cardinal Row can have many columns Total size should be k-bytes, not megabytes Configuration certificate determines what data is pulled into the table (and can change) ACM TOCS 2003

State Merge: Core of Astrolabe epidemic NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic?SMTP?Word Versi on swift falcon cardinal swift.cs.cornell.edu cardinal.cs.cornell.edu

State Merge: Core of Astrolabe epidemic NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic?SMTP?Word Versi on swift falcon cardinal swift.cs.cornell.edu cardinal.cs.cornell.edu swift cardinal

State Merge: Core of Astrolabe epidemic NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic?SMTP?Word Versi on swift falcon cardinal swift.cs.cornell.edu cardinal.cs.cornell.edu

Scaling up… and up… With a stack of domains, we don’t want every system to “see” every domain Cost would be huge So instead, we’ll see a summary NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal cardinal.cs.cornell.edu NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal

Build a hierarchy using a P2P protocol that “assembles the puzzle” without any servers NameLoadWeblogic?SMTP?Word Version … swift falcon cardinal NameLoadWeblogic?SMTP?Word Version … gazelle zebra gnu NameAvg Load WL contactSMTP contact SF NJ Paris San Francisco New Jersey SQL query “summarizes” data Dynamically changing query output is visible system-wide

(1) Query goes out… (2) Compute locally… (3) results flow to top level of the hierarchy NameLoadWeblogic?SMTP?Word Version … swift falcon cardinal NameLoadWeblogic?SMTP?Word Version … gazelle zebra gnu NameAvg Load WL contactSMTP contact SF NJ Paris San Francisco New Jersey

Hierarchy is virtual… data is replicated NameLoadWeblogic?SMTP?Word Version … swift falcon cardinal NameLoadWeblogic?SMTP?Word Version … gazelle zebra gnu NameAvg Load WL contactSMTP contact SF NJ Paris San Francisco New Jersey ACM TOCS 2003

Astrolabe Load on participants, in worst case, grows as log rsize (N) Most partipants see a constant, low load Incredibly robust, self-repairing Information visible in log time And can reconfigure or change aggregation query in log time, too Well matched to data mining

QuickSilver: Current work One goal is to offer scalable support for: Publish(“topic”, data) Subscribe(“topic”, handler) Topic associated w/ protocol stack, properties Many topics… hence many protocol stacks (communication groups) Quicksilver scalable multicast is running now and demonstrates this capability in a web services framework Primary developer is Krzys Ostrowski

Tempest This project seeks to automate a new drag- and-drop style of clustered application development Emphasis is on time-critical response You start with a relatively standard web service application having good timing properties (inheriting from our data class) Tempest automatically clones services, places them, load-balances, repairs faults Uses Ricochet protocol for time-critical multicast

Ricochet Core protocol underlying Tempest Delivers a multicast with Probabilistically strong timing properties Three orders of magnitude faster than prior record! Probability-one reliability, if desired Key idea is to use FEC and to exploit patterns of numerous, heavily overlapping groups. Available for download from Cornell as a library (coded in Java)

Our system will be used in… Massive data centers Distributed data mining Sensor networks Grid computing Air Force “Services Infosphere”

Our platform in a datacenter

Next major project? We’re starting a completely new effort Goal is to support a new generation of mobile platforms that can collaborate, learn, and can query a surrounding mesh of sensors using wireless ad-hoc communication Stefan Pleisch has worked on the mobile query problem. Einar Vollset and Robbert van Renesse are building the new mobile platform software. Epidemic gossip remains our key idea…

Summary Our project builds software Software that real people will end up running But we tell users when it works and prove it! The focus lately is on scalability and QoS Theory, engineering, experiments and simulation For scalability, set probabilistic goals, use epidemic protocols But outcome will be real systems that we believe will be widely used.