Using Self-Regenerative Tools to Tackle Challenges of Scale Ken Birman QuickSilver Project Cornell University.

Using Self-Regenerative Tools to Tackle Challenges of Scale Ken Birman QuickSilver Project Cornell University

2 The QuickSilver team At Cornell: Birman/van Renesse: Core platform Gehrke: Content filtering technology Francis: Streaming content delivery At Raytheon: DiPalma/Work: Military scenarios With help from: AFRL JBI team in Rome NY

3 Technical Overview Objective: Overcome communications challenges that plague (and limit) current GIG/NCES platforms For example, dramatically improve time-critical event delivery delays, speed of event filtering layer Do this even when sustaining damage or when under attack Existing COTS publish-subscribe technology, particularly in Web Services (SOA) platforms: Not designed for these challenging settings. Scale poorly. Very expensive to own/operate, easily disabled Forces military to “hack around” limitations, else major projects can stumble badly (Navy’s CEC effort an example) Problem identified by Air Force JBI team in Rome NY But also a major concern for companies like Google, Amazon Charles Holland, Assis. DDR&E highlighted topic as a top priority

4 Technical Overview GIG/NCES vision centers on reliable communication protocols, like publish-subscribe. Underlying protocols are old… hit limits 15 years ago! Faster hardware has helped… but only a little Peer-to-peer epidemic protocols (“gossip”) have never been applied in such systems We’re fusing these with more conventional protocols, and achieving substantial improvements Also makes our system robust, self-repairing Existing systems take an all-or-nothing approach to reliability. Under stress, we often get nothing. Probabilistic guarantees enables better solutions But need provable guarantees of quality

5 Major risks, mitigation Building a big platform fast… despite profound technical hurdles. But we not constrained by existing product to sell. Already demonstrated some solutions in SRS seedling Users demand standards. We’re extending Web Services architecture & tools Focus on real needs of military users. We work closely with AFRL (JBI) and Raytheon (Navy) What about baseline (scenario II), quantitative metrics and goals? Defer until last 15 minutes of talk.

6 Expected major achievement? QuickSilver will represent a breakthrough technology for building new GIG/NCES applications … applications that operate reliably even under stress that cripples existing COTS solutions … that need far less hand-holding for application developer, deployment team, systems administrator (saving money!) … and enabling powerful new information-enabled applications for uses like self-managed sensor networks, new real-time information tools for urban warfare, control and exploitation of autonomous vehicles We’ll enable the military to take GIG concepts into domains where commercial products just can’t go!

7 Our topic: GIG and NCES platforms Military computing systems are growing … larger, … and more complex, … and must operate “unattended” With existing technology … are far too expensive to develop … require much to much time to deploy … are insecure and too easily disrupted QuickSilver: Brings SRS concepts to the table

8 How are big systems structured? Typically a “data center” of web servers Some human-generated traffic Some automatic traffic from WS clients Front-end servers connected to a pool of back-end application “services” (new applications on clusters and wrapped legacy applications) Publish-subscribe very popular Sensor networks have similarities although they lack this data center “focus”

9 GIG/NCES (and SOA) vision Pub-sub combined with point-to-point communication technologies like TCP LB service LB service LB service LB service LB service LB service Clients connect via “front-end inteface systems” Wrapper Legacy app Wrapper Legacy app Wrapper Legacy app

10 Big sensor networks? QuickSilver will also be useful in, e.g., sensor networks We’re focused on fixed mesh of sensors using wireless ad-hoc communication, mobile query sources, QuickSilver as the middleware

11 How to build big systems today? Programmer is on his own! Expected to use GIG/NCES standards, base on Service Oriented Architectures (SOAs) No support for this architecture as a whole Focus is on isolated aspects, like legacy wrappers Existing SOAs focus on Single client, single server No attention to performance, stability, scale Structure of data center is overlooked! Results in high costs, lower quality solutions

12 x y z Drill down: An example Many services (not all) will be RAPS of RACS RAPS: A reliable array of partitioned services RACS: A reliable array of cluster-structured server processes General Pershing searching for “Faluja SITREP 11-22-04 0900h” Pmap “Faluja”: {x, y, z} (equivalent replicas) Here, y gets picked, perhaps based on load A set of RACS RAPS

13 Multiple datacenters Query sourceUpdate source Services are hosted at data centers but accessible system-wide pmap Server pool l2P map Logical partitioning of services Logical services map to a physical resource pool, perhaps many to one Data center A Data center B One application can be a source of both queries and updates. Operators can controlpmap, l2P map, other parameters. Large-scale multicast used to disseminate updates

14 Problems you must solve by hand Membership Within RACS Of the service Services in data centers Communication Multicast Streaming media Resource management Pool of machines Set of services Subdivision into RACS Fault-tolerance Consistency

15 Replication The unifying “concept” here? Replication within a clustered service “Notification” in publish-subscribe apps. Replicated system configuration data Replication of streaming media Existing platforms lack replication tools or provide them in small-scale forms

16 QuickSilver vision We’ll develop a new generation of solutions that At its core offers scalable replication Is presented to the user through GIG/NCES interfaces (Web Services, CORBA) Is fast, stable and self-managed, self- repairs when disrupted

17 Core challenges To solve our problem… Reduce the big challenge to smaller ones Tackle these using a new conceptual tools Then integrate solutions into a publish- subscribe platform And apply to high-value scenarios

18 Milestones Scalable reliable multicast (many receivers, “groups”) Time-critical event notification Management and self-repair Streaming real-time media data Scalable content filtering Integrate into Core Platform 9/0410/0411/0412/041/052/053/054/055/056/057/058/059/0510/0511/0512/05 Develop baselines, overall architecture Solve key subproblems Integrate into platform Deliver to early users

19 Large Scale makes it hard! Want… Reliability Performance Publish rates, Latency, Recovery time Scalability # Participants # Topics Subscription or failure rates Self-tuning Nice interfaces Structured Solution –Detecting regularities –Introducing some structure –Sophisticated methods –Re-adjusting dynamically

20 Techniques Detecting overlap patterns IP multicast Buffering Aggregation, routing Gossip (structured) Receivers forwarding data Flow control Reconfiguring upon failure Self-monitoring Reconfiguring for speed-up Modular structure Reusable hot-plug modules The system ~ 65,000 lines in C# Modular architecture Testing on a cluster

21 Drill down: How will we do it? Combine scalable multicast… Uses Peer to Peer gossip to enhance reliability of a scalable multicast protocol Achieves dramatic scalability improvements … with a scalable “groups” framework Uses gossip to take many costly aspects of group management “offline” Slashes costs of huge numbers of groups!

22 Reliable multicast is too “fragile” Most members are healthy…. … but one is slow Most members are healthy….

23 Performance drops with scale 00.10.20.30.40.50.60.70.80.9 0 50 100 150 200 250 Virtually synchronous Ensemble multicast protocols perturb rate average throughput on nonperturbed members group size: 32 group size: 64 group size: 96 32 128 group size: 128

24 Gossip 101 Suppose that I know something I’m sitting next to Fred, and I tell him Now 2 of us “know” Later, he tells Mimi and I tell Anne Now 4 This is an example of a push epidemic Push-pull occurs if we exchange data

25 Gossip scales very nicely Participants’ loads independent of size Network load linear in system size Information spreads in log(system size) time % infected 0.0 1.0 Time 

26 Gossip in distributed systems We can gossip about membership Need a bootstrap mechanism, but then discuss failures, new members Gossip to repair faults in replicated data “I have 6 updates from Charlie” If we aren’t in a hurry, gossip to replicate data too

27 Bimodal Multicast ACM TOCS 1999 Gossip source has a message from Mimi that I’m missing. And he seems to be missing two messages from Charlie that I have. Here are some messages from Charlie that might interest you. Could you send me a copy of Mimi’s 7’th message? Mimi’s 7’th message was “The meeting of our Q exam study group will start late on Wednesday…” Send multicasts to report events Some messages don’t get through Periodically, but not synchronously, gossip about messages.

28 Bimodal multicast in baseline scenario Bimodal multicast scales well Baseline multicast: throughput collapses under stress

29 Bimodal Multicast Summary Imposes a constant overhead on participants Many optimizations and tricks needed, but nothing that isn’t practical to implement Hardest issues involve “biased” gossip to handle LANs connected by WAN long-haul links Reliability is easy to analyze mathematically using epidemic theory Use the theory to derive optimal parameter setting Theory also let’s us predict behavior Despite simplified model, the predictions work!

30 So we have part of our solution To multicast in many groups: Map down to IP multicast in popular overlap regions Multicast unreliably Then, in background’ Use gossip to repair omissions Also for flow control (rate based), surge handling (deals with bursty traffic)

31 Techniques Detecting overlap patterns IP multicast Buffering Aggregation, routing Gossip (structured) Receivers forwarding data Flow control Reconfiguring upon failure Self-monitoring Reconfiguring for speed-up Modular structure Reusable hot-plug modules The system ~ 65,000 lines in C# Modular architecture Testing on a cluster

32 Other components of QuickSilver? Astrolabe: Developed during seedling A hierarchical distributed database It also uses gossip… … and is used for self-organizing, scalable, robust distributed management and control Slingshot: Uses FEC for low-latency time- critical event notification ChunkySpread: Focus is on streaming media Event Filter: Rapidly scans event stream to identify relevant data

33 State Merge: Core of Astrolabe epidemic NameTimeLoadWeblogic ? SMTP?Word Version swift20112.0016.2 falcon19762.7104.1 cardinal22013.5116.0 NameTimeLoadWeblogic?SMTP?Word Versi on swift20112.0016.2 falcon19711.5104.1 cardinal22013.5106.0 swift.cs.cornell.edu cardinal.cs.cornell.edu

34 Scaling up… and up… With a stack of domains, we don’t want every system to “see” every domain Cost would be huge So instead, we’ll see a summary NameTimeLoadWeblogic ? SMTP?Word Version swift20112.0016.2 falcon19762.7104.1 cardinal22013.5116.0 cardinal.cs.cornell.edu NameTimeLoadWeblogic ? SMTP?Word Version swift20112.0016.2 falcon19762.7104.1 cardinal22013.5116.0 NameTimeLoadWeblogic ? SMTP?Word Version swift20112.0016.2 falcon19762.7104.1 cardinal22013.5116.0 NameTimeLoadWeblogic ? SMTP?Word Version swift20112.0016.2 falcon19762.7104.1 cardinal22013.5116.0 NameTimeLoadWeblogic ? SMTP?Word Version swift20112.0016.2 falcon19762.7104.1 cardinal22013.5116.0 NameTimeLoadWeblogic ? SMTP?Word Version swift20112.0016.2 falcon19762.7104.1 cardinal22013.5116.0 NameTimeLoadWeblogic ? SMTP?Word Version swift20112.0016.2 falcon19762.7104.1 cardinal22013.5116.0

35 Build a hierarchy using a P2P protocol that “assembles the puzzle” without any servers NameLoadWeblogic?SMTP?Word Version … swift2.0016.2 falcon1.5104.1 cardinal4.5106.0 NameLoadWeblogic?SMTP?Word Version … gazelle1.7004.5 zebra3.2016.2 gnu.5106.2 NameAvg Load WL contactSMTP contact SF2.6123.45.61.3123.45.61.17 NJ1.8127.16.77.6127.16.77.11 Paris3.114.66.71.814.66.71.12 San Francisco New Jersey SQL query “summarizes” data Dynamically changing query output is visible system-wide

36 Astrolabe “compared” to multicast Both used gossip in similar ways But here, data comes from all nodes in a system, not just a few sources Rates are low… hence overhead is low… … but invaluable when orchestrating adaptation and self-repair Astrolabe is extremely robust to disruption Hierarchy is self-constructed, self-healing

37 Remaining time: 2 baselines First focuses on latency of real-time event notification Second on speed of event filtering Both involve key elements of QuickSilver and both are easy to compare with prior state-of-the-art

38 Slingshot Time-critical event notifaction protocol Idea: probabilistic real-time goals Pay a higher overhead but reduce frequency of missed deadlines Already yielding multiple order of magnitude improvements in latency, throughput!

39 Redefining Time-Critical Probabilistic Guarantees: With x% overhead, y% data is delivered within t seconds. Data ‘expires’: stock quotes, location updates Urgency-Sensitive: New data is prioritized over old Application runs in COTS settings, co-existing with other non-time-critical applications on the same machine

40 Time-Critical Eventing Eventing: Publishers publish events to topics, which are then received by subscribers Applications characterized by many-to-many flow of small, discrete units of data Scalability Dimensions: number of topics numbers of publishers and subscribers per topic degree of subscription overlap

41 Slingshot: Receiver-Based FEC Topics are mapped to multicast groups Publishers multicast events unreliably Subscribers constantly exchange error correction packets for message history suffixes

42 Slingshot: Tunable Reliability

43 Slingshot: Scalability in Topics

44 A second baseline Scalable Stateful Content Filtering (Gehrke) Arises when deciding which events to deliver to the client system Usually pub-sub is “coarse grained”, then does content filtering A chance to apply security policy, prune unwanted data… but can be slow

45 Model and problem stmt. Model: Event is a set of (attribute, value) pairs Example: Event notifying location of a vehicle {(Type, “Tank”), (Latitude, 10), (Longtitude, 25)} Subscription is a set of predicates on event attributes (conjunctive semantics) Example: Subscription looking for tanks in the area {(Type = “Tank”), (8 < Latitude < 12)} Equality and range predicates Problem: Given: A (large) set of subscriptions, S, and a stream of events, E Find: For each event e in E, determine the set of subscriptions whose predicates are satisfied by e Scalability: With the event rate With the number of subscriptions

46 What About State? Model: Event is a set of (attribute, value) pairs Example: Event notifying location of a vehicle {(Type, “Tank”), (Latitude, 10), (Longtitude, 25)} Subscription is a query over sequences of events Example: Subscription looking for adversaries with suspicious behavior “Notify me if enemy first visits location A and then location B” Subscriptions need to maintain state across events Problem: Given: A (large) set of stateful subscriptions, S, and a stream of events, E Find: For each event e in E, determine set of subscriptions whose predicates are satisfied by e

47 Managing State Use linear finite state automaton with self loops to encapsulate state

48 Baseline System Architecture App Server

49 Experimental Results Y axis in Log scale!

50 Putting it all together Scalable reliable multicast (many receivers, “groups”) Time-critical event notification Management and self-repair Streaming real-time media data Scalable content filtering Integrate into Core Platform 9/0410/0411/0412/041/052/053/054/055/056/057/058/059/0510/0511/0512/05 Develop baselines, overall architecture Solve key subproblems Integrate into platform Deliver to early users

51 Services are hosted at data centers but accessible system-wide Will QuickSilver solve our problem? Query sourceUpdate source pmap Server pool l2P map Logical partitioning of services Logical services map to a physical resource pool, perhaps many to one Data center A Data center B One application can be a source of both queries and updates. Operators can controlpmap, l2P map, other parameters. Large-scale multicast used to disseminate updates Scalable multicast used to update system-wide parameters and management controls Within and between groups, we need stronger reliability properties and higher speeds. Groups are smaller but there are many of them We need a way to monitor and manage the collection of services in our data center. A good match to Astrolabe We need a way to monitor and manage the machines in the server pool… another good match to Astrolabe We’re exploring the limits beyond which a strong (non-probabilistic) replication scheme is needed in clustered services. QuickSilver will support virtual synchrony too

52 DoD “Typical” Baseline Data - 1 According to a study by the Congressional Budget Office for the Department of the Army in 2003, bandwidth demands for the Army alone will exceed bandwidth supply by a factor of between 10:1 and 30:1 by the year 2010. The Army’s Bandwidth Bottleneck, A CBO Report, August 2003, http://www.cbo.gov/ftpdoc.cfm?index=4500&type=1 http://www.cbo.gov/ftpdoc.cfm?index=4500&type=1 The growth rates, data volumes, and characterization of networked transactions described in a DCGS Block 10.2 Navy Study are consistent with the CBO study. In many cases the DCGS-N Study predicts earlier bandwidth saturation given the disparate rates of growth in total network capacity compared to technological innovation that necessarily will increase demand. Throughput requirements of 3-10Mbps for Imagery Data, 200K-1Mbps for others forms (see next slide)

53 DoD “Typical” Baseline Data - 2 Inputs System size System topology Event type Event rate Perturbation rate “Typical” DoD Scenario 100s-1000s nodes Hierarchical Networks with “Bridges” using LAN/WAN, SATCOM, and Wireless RF (LOS & BLOS) Multiple Situational Awareness Updates (Binary/Text/XML) Plans and Reports (Text) Imagery Multiple SA – 100s/sec (size = 1KB per entity) Plans & Reports – Aperiodic/Sporadic (size = 10KB) Imagery – Aperiodic/Continuous (size = 50MB) Since most of this sort of data is short-lived yet requires processing in a time-valued ordering scheme

54 DoD Challenges for SRS (2010-2025) – Network Oriented Granular, Scalable Redundancy: USN FORCEnet Source: NETWARCOM Official FORCEnet World Wide Web site http://forcenet.navy.mil/fnep/FnEP_Brief.zip http://forcenet.navy.mil/fnep/FnEP_Brief.zip

Using Self-Regenerative Tools to Tackle Challenges of Scale Ken Birman QuickSilver Project Cornell University.

Similar presentations

Presentation on theme: "Using Self-Regenerative Tools to Tackle Challenges of Scale Ken Birman QuickSilver Project Cornell University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Self-Regenerative Tools to Tackle Challenges of Scale Ken Birman QuickSilver Project Cornell University.

Similar presentations

Presentation on theme: "Using Self-Regenerative Tools to Tackle Challenges of Scale Ken Birman QuickSilver Project Cornell University."— Presentation transcript:

Similar presentations

About project

Feedback