Using Self-Regenerative Tools to Tackle Challenges of Scale Ken Birman QuickSilver Project Cornell University.

Slides:



Advertisements
Similar presentations
Computer Systems & Architecture Lesson 2 4. Achieving Qualities.
Advertisements

Reliable Multicast for Time-Critical Systems Mahesh Balakrishnan Ken Birman Cornell University.
Multicasting in Mobile Ad hoc Networks By XIE Jiawei.
Distributed Processing, Client/Server and Clusters
SDN Controller Challenges
Distributed Systems 1 Topics  What is a Distributed System?  Why Distributed Systems?  Examples of Distributed Systems  Distributed System Requirements.
NETWORK LOAD BALANCING NLB.  Network Load Balancing (NLB) is a Clustering Technology.  Windows Based. (windows server).  To scale performance, Network.
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
Receiver-driven Layered Multicast S. McCanne, V. Jacobsen and M. Vetterli SIGCOMM 1996.
Technical Architectures
Ken Birman Cornell University. CS5410 Fall
Web Caching Schemes1 A Survey of Web Caching Schemes for the Internet Jia Wang.
Applications over P2P Structured Overlays Antonino Virgillito.
1 Quality Objects: Advanced Middleware for Wide Area Distributed Applications Rick Schantz Quality Objects: Advanced Middleware for Large Scale Wide Area.
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
Matching Patterns Servers assemble sequences of notifications from smaller subsequences or from single notifications.This technique requires an advertisement.
Reliable Distributed Systems Astrolabe. Massive scale. Constant flux Source: Burch and Cheswick The Internet.
Flash Crowds And Denial of Service Attacks: Characterization and Implications for CDNs and Web Sites Aaron Beach Cs395 network security.
Ken Birman Cornell University. CS5410 Fall
Epidemic Techniques Chiu Wah So (Kelvin). Database Replication Why do we replicate database? – Low latency – High availability To achieve strong (sequential)
Semester 4 - Chapter 3 – WAN Design Routers within WANs are connection points of a network. Routers determine the most appropriate route or path through.
Page 1Prepared by Sapient for MITVersion 0.1 – August – September 2004 This document represents a snapshot of an evolving set of documents. For information.
Lesson 1: Configuring Network Load Balancing
Navigating in the Dark: New Options for Building Self- Configuring Embedded Systems Ken Birman Cornell University.
The Future of the Internet Jennifer Rexford ’91 Computer Science Department Princeton University
Network Topologies.
EPIDEMIC TECHNIQUES Ki Suh Lee. OUTLINE Epidemic Protocol Epidemic Algorithms for Replicated Database Maintenance Astrolabe: A Robust and scalable technology.
Word Wide Cache Distributed Caching for the Distributed Enterprise.
1 The Google File System Reporter: You-Wei Zhang.
Capacity of Wireless Mesh Networks: Comparing Single- Radio, Dual-Radio, and Multi- Radio Networks By: Alan Applegate.
IMS 4212: Distributed Databases 1 Dr. Lawrence West, Management Dept., University of Central Florida Distributed Databases Business needs.
1 October 20-24, 2014 Georgian Technical University PhD Zaza Tsiramua Head of computer network management center of GTU South-Caucasus Grid.
Communication (II) Chapter 4
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
Managing Service Metadata as Context The 2005 Istanbul International Computational Science & Engineering Conference (ICCSE2005) Mehmet S. Aktas
Collaborative Content Delivery Werner Vogels Robbert van Renesse, Ken Birman Dept. of Computer Science, Cornell University A peer-to-peer solution for.
Gil EinzigerRoy Friedman Computer Science Department Technion.
SOFTWARE DESIGN AND ARCHITECTURE LECTURE 09. Review Introduction to architectural styles Distributed architectures – Client Server Architecture – Multi-tier.
Reliable Distributed Systems Astrolabe. Massive scale. Constant flux Source: Burch and Cheswick The Internet.
Approved for Public Release, Distribution Unlimited QuickSilver: Middleware for Scalable Self-Regenerative Systems Cornell University Ken Birman, Johannes.
Middleware for FIs Apeego House 4B, Tardeo Rd. Mumbai Tel: Fax:
© 2006 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Introducing Network Design Concepts Designing and Supporting Computer Networks.
Adaptive Web Caching CS411 Dynamic Web-Based Systems Flying Pig Fei Teng/Long Zhao/Pallavi Shinde Computer Science Department.
Types of Service. Types of service (1) A network architecture may have multiple protocols at the same layer in order to provide different types of service.
9 Systems Analysis and Design in a Changing World, Fourth Edition.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
CS5412: BIMODAL MULTICAST ASTROLABE Ken Birman CS5412 Spring Lecture XIX.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Scalable Self-Repairing Publish/Subscribe Robbert van Renesse Ken Birman Werner Vogels Cornell University.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Information-Centric Networks10b-1 Week 10 / Paper 2 Hermes: a distributed event-based middleware architecture –P.R. Pietzuch, J.M. Bacon –ICDCS 2002 Workshops.
AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?
Slingshot: Time-Critical Multicast for Clustered Applications Mahesh Balakrishnan Stefan Pleisch Ken Birman Cornell University.
CS5412: BIMODAL MULTICAST ASTROLABE Ken Birman Gossip-Based Networking Workshop 1 Lecture XIX Leiden; Dec 06.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
NDDS: The Real-Time Publish Subscribe Middleware Network Data Delivery Service An Efficient Real-Time Application Communications Platform Presented By:
Company LOGO Network Management Architecture By Dr. Shadi Masadeh 1.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Tackling Challenges of Scale in Highly Available Computing Systems Ken Birman Dept. of Computer Science Cornell University.
WHAT'S THE DIFFERENCE BETWEEN A WEB APPLICATION STREAMING NETWORK AND A CDN? INSTART LOGIC.
Mobile IP THE 12 TH MEETING. Mobile IP  Incorporation of mobile users in the network.  Cellular system (e.g., GSM) started with mobility in mind. 
Martin Casado, Nate Foster, and Arjun Guha CACM, October 2014
Semester 4 - Chapter 3 – WAN Design
CHAPTER 3 Architectures for Distributed Systems
TRUST:Team for Research in Ubiquitous Secure Technologies
Software Defined Networking (SDN)
Quasardb Is a Fast, Reliable, and Highly Scalable Application Database, Built on Microsoft Azure and Designed Not to Buckle Under Demand MICROSOFT AZURE.
Presentation transcript:

Using Self-Regenerative Tools to Tackle Challenges of Scale Ken Birman QuickSilver Project Cornell University

2 The QuickSilver team At Cornell: Birman/van Renesse: Core platform Gehrke: Content filtering technology Francis: Streaming content delivery At Raytheon: DiPalma/Work: Military scenarios With help from: AFRL JBI team in Rome NY

3 Technical Overview Objective: Overcome communications challenges that plague (and limit) current GIG/NCES platforms For example, dramatically improve time-critical event delivery delays, speed of event filtering layer Do this even when sustaining damage or when under attack Existing COTS publish-subscribe technology, particularly in Web Services (SOA) platforms: Not designed for these challenging settings. Scale poorly. Very expensive to own/operate, easily disabled Forces military to “hack around” limitations, else major projects can stumble badly (Navy’s CEC effort an example) Problem identified by Air Force JBI team in Rome NY But also a major concern for companies like Google, Amazon Charles Holland, Assis. DDR&E highlighted topic as a top priority

4 Technical Overview GIG/NCES vision centers on reliable communication protocols, like publish-subscribe. Underlying protocols are old… hit limits 15 years ago! Faster hardware has helped… but only a little Peer-to-peer epidemic protocols (“gossip”) have never been applied in such systems We’re fusing these with more conventional protocols, and achieving substantial improvements Also makes our system robust, self-repairing Existing systems take an all-or-nothing approach to reliability. Under stress, we often get nothing. Probabilistic guarantees enables better solutions But need provable guarantees of quality

5 Major risks, mitigation Building a big platform fast… despite profound technical hurdles. But we not constrained by existing product to sell. Already demonstrated some solutions in SRS seedling Users demand standards. We’re extending Web Services architecture & tools Focus on real needs of military users. We work closely with AFRL (JBI) and Raytheon (Navy) What about baseline (scenario II), quantitative metrics and goals? Defer until last 15 minutes of talk.

6 Expected major achievement? QuickSilver will represent a breakthrough technology for building new GIG/NCES applications … applications that operate reliably even under stress that cripples existing COTS solutions … that need far less hand-holding for application developer, deployment team, systems administrator (saving money!) … and enabling powerful new information-enabled applications for uses like self-managed sensor networks, new real-time information tools for urban warfare, control and exploitation of autonomous vehicles We’ll enable the military to take GIG concepts into domains where commercial products just can’t go!

7 Our topic: GIG and NCES platforms Military computing systems are growing … larger, … and more complex, … and must operate “unattended” With existing technology … are far too expensive to develop … require much to much time to deploy … are insecure and too easily disrupted QuickSilver: Brings SRS concepts to the table

8 How are big systems structured? Typically a “data center” of web servers Some human-generated traffic Some automatic traffic from WS clients Front-end servers connected to a pool of back-end application “services” (new applications on clusters and wrapped legacy applications) Publish-subscribe very popular Sensor networks have similarities although they lack this data center “focus”

9 GIG/NCES (and SOA) vision Pub-sub combined with point-to-point communication technologies like TCP LB service LB service LB service LB service LB service LB service Clients connect via “front-end inteface systems” Wrapper Legacy app Wrapper Legacy app Wrapper Legacy app

10 Big sensor networks? QuickSilver will also be useful in, e.g., sensor networks We’re focused on fixed mesh of sensors using wireless ad-hoc communication, mobile query sources, QuickSilver as the middleware

11 How to build big systems today? Programmer is on his own! Expected to use GIG/NCES standards, base on Service Oriented Architectures (SOAs) No support for this architecture as a whole Focus is on isolated aspects, like legacy wrappers Existing SOAs focus on Single client, single server No attention to performance, stability, scale Structure of data center is overlooked! Results in high costs, lower quality solutions

12 x y z Drill down: An example Many services (not all) will be RAPS of RACS RAPS: A reliable array of partitioned services RACS: A reliable array of cluster-structured server processes General Pershing searching for “Faluja SITREP h” Pmap “Faluja”: {x, y, z} (equivalent replicas) Here, y gets picked, perhaps based on load A set of RACS RAPS

13 Multiple datacenters Query sourceUpdate source Services are hosted at data centers but accessible system-wide pmap Server pool l2P map Logical partitioning of services Logical services map to a physical resource pool, perhaps many to one Data center A Data center B One application can be a source of both queries and updates. Operators can controlpmap, l2P map, other parameters. Large-scale multicast used to disseminate updates

14 Problems you must solve by hand Membership Within RACS Of the service Services in data centers Communication Multicast Streaming media Resource management Pool of machines Set of services Subdivision into RACS Fault-tolerance Consistency

15 Replication The unifying “concept” here? Replication within a clustered service “Notification” in publish-subscribe apps. Replicated system configuration data Replication of streaming media Existing platforms lack replication tools or provide them in small-scale forms

16 QuickSilver vision We’ll develop a new generation of solutions that At its core offers scalable replication Is presented to the user through GIG/NCES interfaces (Web Services, CORBA) Is fast, stable and self-managed, self- repairs when disrupted

17 Core challenges To solve our problem… Reduce the big challenge to smaller ones Tackle these using a new conceptual tools Then integrate solutions into a publish- subscribe platform And apply to high-value scenarios

18 Milestones Scalable reliable multicast (many receivers, “groups”) Time-critical event notification Management and self-repair Streaming real-time media data Scalable content filtering Integrate into Core Platform 9/0410/0411/0412/041/052/053/054/055/056/057/058/059/0510/0511/0512/05 Develop baselines, overall architecture Solve key subproblems Integrate into platform Deliver to early users

19 Large Scale makes it hard! Want… Reliability Performance Publish rates, Latency, Recovery time Scalability # Participants # Topics Subscription or failure rates Self-tuning Nice interfaces Structured Solution –Detecting regularities –Introducing some structure –Sophisticated methods –Re-adjusting dynamically

20 Techniques Detecting overlap patterns IP multicast Buffering Aggregation, routing Gossip (structured) Receivers forwarding data Flow control Reconfiguring upon failure Self-monitoring Reconfiguring for speed-up Modular structure Reusable hot-plug modules The system ~ 65,000 lines in C# Modular architecture Testing on a cluster

21 Drill down: How will we do it? Combine scalable multicast… Uses Peer to Peer gossip to enhance reliability of a scalable multicast protocol Achieves dramatic scalability improvements … with a scalable “groups” framework Uses gossip to take many costly aspects of group management “offline” Slashes costs of huge numbers of groups!

22 Reliable multicast is too “fragile” Most members are healthy…. … but one is slow Most members are healthy….

23 Performance drops with scale Virtually synchronous Ensemble multicast protocols perturb rate average throughput on nonperturbed members group size: 32 group size: 64 group size: group size: 128

24 Gossip 101 Suppose that I know something I’m sitting next to Fred, and I tell him Now 2 of us “know” Later, he tells Mimi and I tell Anne Now 4 This is an example of a push epidemic Push-pull occurs if we exchange data

25 Gossip scales very nicely Participants’ loads independent of size Network load linear in system size Information spreads in log(system size) time % infected Time 

26 Gossip in distributed systems We can gossip about membership Need a bootstrap mechanism, but then discuss failures, new members Gossip to repair faults in replicated data “I have 6 updates from Charlie” If we aren’t in a hurry, gossip to replicate data too

27 Bimodal Multicast ACM TOCS 1999 Gossip source has a message from Mimi that I’m missing. And he seems to be missing two messages from Charlie that I have. Here are some messages from Charlie that might interest you. Could you send me a copy of Mimi’s 7’th message? Mimi’s 7’th message was “The meeting of our Q exam study group will start late on Wednesday…” Send multicasts to report events Some messages don’t get through Periodically, but not synchronously, gossip about messages.

28 Bimodal multicast in baseline scenario Bimodal multicast scales well Baseline multicast: throughput collapses under stress

29 Bimodal Multicast Summary Imposes a constant overhead on participants Many optimizations and tricks needed, but nothing that isn’t practical to implement Hardest issues involve “biased” gossip to handle LANs connected by WAN long-haul links Reliability is easy to analyze mathematically using epidemic theory Use the theory to derive optimal parameter setting Theory also let’s us predict behavior Despite simplified model, the predictions work!

30 So we have part of our solution To multicast in many groups: Map down to IP multicast in popular overlap regions Multicast unreliably Then, in background’ Use gossip to repair omissions Also for flow control (rate based), surge handling (deals with bursty traffic)

31 Techniques Detecting overlap patterns IP multicast Buffering Aggregation, routing Gossip (structured) Receivers forwarding data Flow control Reconfiguring upon failure Self-monitoring Reconfiguring for speed-up Modular structure Reusable hot-plug modules The system ~ 65,000 lines in C# Modular architecture Testing on a cluster

32 Other components of QuickSilver? Astrolabe: Developed during seedling A hierarchical distributed database It also uses gossip… … and is used for self-organizing, scalable, robust distributed management and control Slingshot: Uses FEC for low-latency time- critical event notification ChunkySpread: Focus is on streaming media Event Filter: Rapidly scans event stream to identify relevant data

33 State Merge: Core of Astrolabe epidemic NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic?SMTP?Word Versi on swift falcon cardinal swift.cs.cornell.edu cardinal.cs.cornell.edu

34 Scaling up… and up… With a stack of domains, we don’t want every system to “see” every domain Cost would be huge So instead, we’ll see a summary NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal cardinal.cs.cornell.edu NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal NameTimeLoadWeblogic ? SMTP?Word Version swift falcon cardinal

35 Build a hierarchy using a P2P protocol that “assembles the puzzle” without any servers NameLoadWeblogic?SMTP?Word Version … swift falcon cardinal NameLoadWeblogic?SMTP?Word Version … gazelle zebra gnu NameAvg Load WL contactSMTP contact SF NJ Paris San Francisco New Jersey SQL query “summarizes” data Dynamically changing query output is visible system-wide

36 Astrolabe “compared” to multicast Both used gossip in similar ways But here, data comes from all nodes in a system, not just a few sources Rates are low… hence overhead is low… … but invaluable when orchestrating adaptation and self-repair Astrolabe is extremely robust to disruption Hierarchy is self-constructed, self-healing

37 Remaining time: 2 baselines First focuses on latency of real-time event notification Second on speed of event filtering Both involve key elements of QuickSilver and both are easy to compare with prior state-of-the-art

38 Slingshot Time-critical event notifaction protocol Idea: probabilistic real-time goals Pay a higher overhead but reduce frequency of missed deadlines Already yielding multiple order of magnitude improvements in latency, throughput!

39 Redefining Time-Critical Probabilistic Guarantees: With x% overhead, y% data is delivered within t seconds. Data ‘expires’: stock quotes, location updates Urgency-Sensitive: New data is prioritized over old Application runs in COTS settings, co-existing with other non-time-critical applications on the same machine

40 Time-Critical Eventing Eventing: Publishers publish events to topics, which are then received by subscribers Applications characterized by many-to-many flow of small, discrete units of data Scalability Dimensions: number of topics numbers of publishers and subscribers per topic degree of subscription overlap

41 Slingshot: Receiver-Based FEC Topics are mapped to multicast groups Publishers multicast events unreliably Subscribers constantly exchange error correction packets for message history suffixes

42 Slingshot: Tunable Reliability

43 Slingshot: Scalability in Topics

44 A second baseline Scalable Stateful Content Filtering (Gehrke) Arises when deciding which events to deliver to the client system Usually pub-sub is “coarse grained”, then does content filtering A chance to apply security policy, prune unwanted data… but can be slow

45 Model and problem stmt. Model: Event is a set of (attribute, value) pairs Example: Event notifying location of a vehicle {(Type, “Tank”), (Latitude, 10), (Longtitude, 25)} Subscription is a set of predicates on event attributes (conjunctive semantics) Example: Subscription looking for tanks in the area {(Type = “Tank”), (8 < Latitude < 12)} Equality and range predicates Problem: Given: A (large) set of subscriptions, S, and a stream of events, E Find: For each event e in E, determine the set of subscriptions whose predicates are satisfied by e Scalability: With the event rate With the number of subscriptions

46 What About State? Model: Event is a set of (attribute, value) pairs Example: Event notifying location of a vehicle {(Type, “Tank”), (Latitude, 10), (Longtitude, 25)} Subscription is a query over sequences of events Example: Subscription looking for adversaries with suspicious behavior “Notify me if enemy first visits location A and then location B” Subscriptions need to maintain state across events Problem: Given: A (large) set of stateful subscriptions, S, and a stream of events, E Find: For each event e in E, determine set of subscriptions whose predicates are satisfied by e

47 Managing State Use linear finite state automaton with self loops to encapsulate state

48 Baseline System Architecture App Server

49 Experimental Results Y axis in Log scale!

50 Putting it all together Scalable reliable multicast (many receivers, “groups”) Time-critical event notification Management and self-repair Streaming real-time media data Scalable content filtering Integrate into Core Platform 9/0410/0411/0412/041/052/053/054/055/056/057/058/059/0510/0511/0512/05 Develop baselines, overall architecture Solve key subproblems Integrate into platform Deliver to early users

51 Services are hosted at data centers but accessible system-wide Will QuickSilver solve our problem? Query sourceUpdate source pmap Server pool l2P map Logical partitioning of services Logical services map to a physical resource pool, perhaps many to one Data center A Data center B One application can be a source of both queries and updates. Operators can controlpmap, l2P map, other parameters. Large-scale multicast used to disseminate updates Scalable multicast used to update system-wide parameters and management controls Within and between groups, we need stronger reliability properties and higher speeds. Groups are smaller but there are many of them We need a way to monitor and manage the collection of services in our data center. A good match to Astrolabe We need a way to monitor and manage the machines in the server pool… another good match to Astrolabe We’re exploring the limits beyond which a strong (non-probabilistic) replication scheme is needed in clustered services. QuickSilver will support virtual synchrony too

52 DoD “Typical” Baseline Data - 1 According to a study by the Congressional Budget Office for the Department of the Army in 2003, bandwidth demands for the Army alone will exceed bandwidth supply by a factor of between 10:1 and 30:1 by the year The Army’s Bandwidth Bottleneck, A CBO Report, August 2003, The growth rates, data volumes, and characterization of networked transactions described in a DCGS Block 10.2 Navy Study are consistent with the CBO study. In many cases the DCGS-N Study predicts earlier bandwidth saturation given the disparate rates of growth in total network capacity compared to technological innovation that necessarily will increase demand. Throughput requirements of 3-10Mbps for Imagery Data, 200K-1Mbps for others forms (see next slide)

53 DoD “Typical” Baseline Data - 2 Inputs System size System topology Event type Event rate Perturbation rate “Typical” DoD Scenario 100s-1000s nodes Hierarchical Networks with “Bridges” using LAN/WAN, SATCOM, and Wireless RF (LOS & BLOS) Multiple Situational Awareness Updates (Binary/Text/XML) Plans and Reports (Text) Imagery Multiple SA – 100s/sec (size = 1KB per entity) Plans & Reports – Aperiodic/Sporadic (size = 10KB) Imagery – Aperiodic/Continuous (size = 50MB) Since most of this sort of data is short-lived yet requires processing in a time-valued ordering scheme

54 DoD Challenges for SRS ( ) – Network Oriented Granular, Scalable Redundancy: USN FORCEnet Source: NETWARCOM Official FORCEnet World Wide Web site

55 DoD Challenges for SRS ( ) – Network Oriented Granular, Scalable Redundancy: Ground Sensor Netting Source: Raytheon Company © 2004 Raytheon Company. All Rights Reserved. Unpublished Work