Ensemble: A Tool for Building Highly Assured Networks Professor Kenneth P. Birman Cornell University

Slides:



Advertisements
Similar presentations
Distributed Processing, Client/Server and Clusters
Advertisements

Hardware & the Machine room Week 5 – Lecture 1. What is behind the wall plug for your workstation? Today we will look at the platform on which our Information.
IP datagrams Service paradigm, IP datagrams, routing, encapsulation, fragmentation and reassembly.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
Reliability on Web Services Presented by Pat Chan 17/10/2005.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
An Associative Broadcast Based Coordination Model for Distributed Processes James C. Browne Kevin Kane Hongxia Tian Department of Computer Sciences The.
Lab 2 Group Communication Andreas Larsson
Distributed Processing, Client/Server, and Clusters
Distributed components
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
1. Introducing Java Computing  What is Java Computing?  Why Java Computing?  Enterprise Java Computing  Java and Internet Web Server.
Group Communication Phuong Hoai Ha & Yi Zhang Introduction to Lab. assignments March 24 th, 2004.
G Robert Grimm New York University Pulling Back: How to Go about Your Own System Project?
Chapter 15 – Part 2 Networks The Internal Operating System The Architecture of Computer Hardware and Systems Software: An Information Technology Approach.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
Real-time Transport Protocol Matt Boutell CS457: Computer Networks November 15, 2001.
1 A Framework for Highly Available Services Based on Group Communication Alan Fekete Idit Keidar University of Sidney MIT.
The Horus and Ensemble Projects: Accomplishments and Limitations Ken Birman, Robert Constable, Mark Hayden, Jason Hickey, Christoph Kreitz, Robbert van.
CS514: Intermediate Course in Operating Systems Professor Ken Birman Ben Atkin: TA Lecture 24: Nov. 16.
Masking the Overhead of Protocol Layering CS514: Intermediate Course in Operating Systems Robbert van Renesse Cornell University Lecture 14 Oct. 12.
Distributed Systems Architecture Presentation II Presenters Rose Kit & Turgut Tezir.
IP Ports and Protocols used by H.323 Devices Liane Tarouco.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
Presentation on Osi & TCP/IP MODEL
Lecture 2 TCP/IP Protocol Suite Reference: TCP/IP Protocol Suite, 4 th Edition (chapter 2) 1.
What is a Protocol A set of definitions and rules defining the method by which data is transferred between two or more entities or systems. The key elements.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
M3UA Patrick Sharp.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
MILCOM 2001 October page 1 Defense Enabling Using Advanced Middleware: An Example Franklin Webber, Partha Pal, Richard Schantz, Michael Atighetchi,
Chapter Three Network Protocols By JD McGuire ARP Address Resolution Protocol Address Resolution Protocol The core protocol in the TCP/IP suite that.
Data and Computer Communications Chapter 2 – Protocol Architecture, TCP/IP, and Internet-Based Applications.
Overlay Network Physical LayerR : router Overlay Layer N R R R R R N.
1 Next Few Classes Networking basics Protection & Security.
ARMADA Middleware and Communication Services T. ABDELZAHER, M. BJORKLUND, S. DAWSON, W.-C. FENG, F. JAHANIAN, S. JOHNSON, P. MARRON, A. MEHRA, T. MITTON,
SPREAD TOOLKIT High performance messaging middleware Presented by Sayantam Dey Vipin Mehta.
Introduction GOALS:  To improve the Quality of Service (QoS) for the JBI platform and endpoints  E.g., latency, fault tolerance, scalability, graceful.
Lab 2 Group Communication Farnaz Moradi Based on slides by Andreas Larsson 2012.
Ensemble and Beyond Presentation to David Tennenhouse, DARPA ITO Ken Birman Dept. of Computer Science Cornell University.
Chapter 15 – Part 2 Networks The Internal Operating System The Architecture of Computer Hardware and Systems Software: An Information Technology Approach.
Internetworking Internet: A network among networks, or a network of networks Allows accommodation of multiple network technologies Universal Service Routers.
Intrusion Tolerant Software Architectures Bruno Dutertre, Valentin Crettaz, Victoria Stavridou System Design Laboratory, SRI International
Farnaz Moradi Based on slides by Andreas Larsson 2013.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved RPC Tanenbaum.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Ensemble Fault-Tolerance Security Adaptation. The Horus and Ensemble Projects Accomplishments and Limitations Kent Birman, Bob Constable, Mayk Hayden,
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
November NC state university Group Communication Specifications Gregory V Chockler, Idit Keidar, Roman Vitenberg Presented by – Jyothish S Varma.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
1 BBN Technologies Quality Objects (QuO): Adaptive Management and Control Middleware for End-to-End QoS Craig Rodrigues, Joseph P. Loyall, Richard E. Schantz.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Group Communication Theresa Nguyen ICS243f Spring 2001.
SensorWare: Distributed Services for Sensor Networks Rockwell Science Center and UCLA.
Chapter 1 Database Access from Client Applications.
1 Transport Layer: Basics Outline Intro to transport UDP Congestion control basics.
By Nitin Bahadur Gokul Nadathur Department of Computer Sciences University of Wisconsin-Madison Spring 2000.
Chapter Five Distributed file systems. 2 Contents Distributed file system design Distributed file system implementation Trends in distributed file systems.
Multimedia Communication Systems Techniques, Standards, and Networks Chapter 4 Distributed Multimedia Systems.
Computer Networking A Top-Down Approach Featuring the Internet Introduction Jaypee Institute of Information Technology.
What is a Protocol A set of definitions and rules defining the method by which data is transferred between two or more entities or systems. The key elements.
Self Healing and Dynamic Construction Framework:
OSI Protocol Stack Given the post man exemple.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT -Sumanth Kandagatla Instructor: Prof. Yanqing Zhang Advanced Operating Systems (CSC 8320)
Software models - Software Architecture Design Patterns
Computer Networking A Top-Down Approach Featuring the Internet
Presentation transcript:

Ensemble: A Tool for Building Highly Assured Networks Professor Kenneth P. Birman Cornell University

Ensemble Project Goals Provide a powerful and flexible technology for “hardening” distributed applications by introducing security and reliability properties Make the technology available to DARPA investigators and the Internet community Apply Ensemble to help develop prototype of the Highly Assured Network

Today Review recent past for the effort –Emphasis was Middleware –About minutes total Then focus on 1997 goals and milestones –More attention to security opportunities, standards –Shift emphasis to lower levels of network –Ensemble “manages” protocol stacks, servers

Why Ensemble? With the Isis Toolkit and the Horus system, we demonstrated that virtually synchronous process groups could be a powerful tool But Isis was inflexible, monolithic Ensemble is layered and can hide behind various interfaces (C, C++, Java, Tcl/Tk…) Ensemble is coded in ML, this facilitates automated code transformations

Key Idea in Ensemble: Process Groups Processes within network cooperate in groups Group tools support group communication (multicast), membership, failure reporting Embed beneath interfaces specialized to different uses –Cluster-style server management –WAN architecture of connected servers –Groups of PC clients for “groupware”, CSCW

Group Members could be interactive processes or automated applications

Processes Communicate Through Identical Multicast Protocol Stacks encrypt vsync ftol encrypt vsync ftol encrypt vsync ftol

Superimposed Groups in Application With Multiple Subsystems encrypt vsync ftol encrypt vsync ftol encrypt vsync ftol encrypt vsync ftol encrypt vsync ftol encrypt vsync ftol Yellow group for video communication Orange for control and coordination

Layered Microprotocols in Ensemble Interface to Ensemble is extremely flexible Ensemble manages group abstraction group semantics (membership, actions, events) defined by stack of modules encrypt vsync filter sign ftol Ensemble stacks plug-and-play modules to give design flexibility to developer

Why Process Groups? Used for replication, load-balancing, transparent fault-tolerance in servers Useful for secure multicast key management Can support flexible firewalls and filters Groups of clients in conference share media flows, agree on who is involved and what they are doing, manage security keys and QoS, etc... WAN groups for adaptive, partitionable systems

Virtual Synchrony Model crash G 0 ={p,q} G 1 ={p,q,r,s} G 2 ={q,r,s} G 3 ={q,r,s,t} pqrstpqrst r, s request to join r,s added; state xfer t added, state xfer t requests to join p fails... to date, the only widely adopted model for consistency and fault-tolerance in highly available networked applications

Horus/Ensemble Performance A major focus for Van Renesse Over UNet: 85,000 to 100,000 small multicasts per second, saturates a 155Mbit ATM, end-to- end latencies as low as 65us. We obtain this high performance by “protocol compilation” of our stacks Ensemble is coded in ML which facilitates automated code transformations

Getting those impressive numbers First had to work with a non-standard UNIX communication stack. Problem is that UNIX does so much copying that latency and throughput are always very poor. We used U-Net, a zero-copy communications stack from Thorsten Von Eicken’s group. It runs on UNIX and NT

But U-Net Didn’t Help Very Much Layers have intrinsic costs: –Each tends to assume that it will run “by itself” hence each has its own header format. Even a single bit will need to be padded to 32 or 64 bits –Many layers only touch a small percentage of messages, yet each layer “sees” every message –Little opportunity for amortization of costs

Overhead encrypt vsync ftol Data header

Van Renesse: Reorganizing Layers First create a notion of virtual headers –Layer says “I need 2 bits and an 8-bit counter” –Dynamically (at run time), Horus system “compiles” layers and builds shared message headers –Each layer accesses its fields through macros –Then separate into often changing, rarely changing, and static header information. Send the static stuff once, the rarely changing information only if it changes, the dynamic part on every message.

Impact of header optimizations? Average message in Horus used to carry one hundred bytes or more of header data Now see true size of header drop by 50% due to compaction opportunity Highly dynamic header: just a few bytes One bit to signal presence of “rarely changing” header information

Next step: Code restructuring View original Horus layers as having 3 parts: –“Pre” computation (can do before seeing message) –Data touching computation (needs to see message) –“Post” computation (can delay until message sent) Move “pre” computing to after “post” and do both off critical path Effect is to slash latencies on critical path

Three stages to a layer Pre-computation Data touching computation Post-computation

Restructured layer Pre-computation Data touching computation Post-computation Message k Message k+1

Final step: Batch messages Look for places where lots of messages pass by Combine (if safe) into groups of messages blocked for efficient use of the network Effect is to amortize costs over many messages at a time

Final step: Batch messages Look for places where lots of messages pass by Combine (if safe) into groups of messages blocked for efficient use of the network Effect is to amortize costs over many messages at a time … but a problem emerges: all of this makes Horus messy, much less modular

Ensemble: Move to ML Idea now is to offer a C/C++/Java interface but build stack itself in ML NuPrl can manipulate the ML stacks offline Hayden exploits this to obtain same performance as in Horus but with less complexity

Example: Partial Evaluation Idea Annotate the Ensemble stack components with indications of critical path: –Green messages always go left. Red messages right –For green messages, this loop only loops once –… etc Now NuPrl can partially evaluate a stack: once for “message is green”, once for “red”

Why are two stacks better than one? Now have an if statement above two machine- generated stacks: If green … else (red) …. Each stack may be much compacted; critical path drastically shorter Also can do inline code expansion Result is a single highly optimized stack that is provably equivalent to original stack! Ensemble perf. is even better than Horus

Friedman: Performance isn’t enough Is this blinding performance fast enough for a demanding real-time use? Finding: yes, if Ensemble is used “very carefully” and if other effort is employed, but no, if Ensemble is just slapped into place

IN coprocessor example SS7 switch Query Element (QE) processors do the 800-number lookup (in-memory database). Goals: scaleable memory without loss of processing performance as number of nodes is increased Switch itself asks for help when 800 number call is sensed External adapter (EA) processors run the query protocol EA Primary backup scheme adapted (using small Horus process groups) to provide fault-tolerance with real-time guarantees QE

Traditional Realtime Approach EA QE 1. Request received in duplicate

Traditional Realtime Approach EA QE 2. Request multicast to selected QE’s

Traditional Realtime Approach EA QE 3. QE’s multicast reply

Traditional Realtime Approach EA QE 4. EA’s forward reply

Criticism? Heavy overheads to obtain fault-tolerance No “batching” of requests Obvious match with group communication but overheads are prohibitive Likely performance? A few hundred requests per second, delays of 4-6 seconds to “fail- over” when a node is taken offline

Friedman’s Realtime Approach EA QE Ensemble used to monitor status (live / faulty, load) of processing elements. EA’s have this data. EA’s batch requests, primary sends a group at a time to single QE

Friedman’s Realtime Approach EA QE QE or EA could fail. Ensemble needs a few seconds to report this QE replies to both EA’s, they forward result

Friedman’s Realtime Approach EA QE Consistency of replicated data is key to correctness of this scheme If half of deadline elapses, backup EA retries with some other QE

Friedman’s Realtime Approach EA QE Consistency of replicated data is key to correctness of this scheme … QE replies

Friedman’s Realtime Approach EA QE Consistency of replicated data is key to correctness of this scheme EA forwards reply, within deadline

Friedman’s Work Uses Horus/Ensemble to “manage” the cluster Designs special protocols based on Active Messages for batch-style handling of requests Demonstrates 20,000+ “calls” per second even during failures and restart of nodes, 98%+ responses within 100ms deadline Scaleable memory, computing and ability to upgrade components are big wins

Broader Project Goals for 1997 Increased emphasis on integration with security standards and emerging world of Quality of Service guarantees More use of Ensemble to manage protocol stacks external to our system Explore adaptive behavior, especially for secure networks or secured subsystems Emphasis on four styles of computing system

Secure Real-Time Cluster Servers This work extends Friedman’s real-time server architecture to deal with IP fail-over Think of a TCP connection to a cluster server that remains up even if the compute node fails Our effort also deals with session key management so that security properties are preserved as fail-over takes place Goal: a “tool kit” in Ensemble distribution

Secure Adaptive Networks This work uses Ensemble to manage a subgroup of an Ensemble process group, or a set of “external” communication endpoints Goal is to demonstrate that we can exploit this to dynamically secure a network application that must adapt to changing conditions Can also download protocol stacks at runtime, a form of Active Network behavior

Secure Adaptive Networks Ensemble tracks membership in “core” group Subgroup membership automatically managed “Has ATM link” “Cleared for sensitive data”

Secure Adaptive Networks Paper on initial work: on “Maestro”, a tool for management of subgroups of a group Initial version didn’t address security issues Now extending to integrate with our security layers, will automatically track subgroups and automatically handle

Probablistic Quality of Service Developing new protocols that scale better by relaxing reliability guarantees Easiest to understand these as having probablistic quality of service properties Our first solution of this sort is now working experimentally; seems extremely tolerant of transient misbehavior that seriously degrade performance in Isis and Horus/C

Four target computing environments Network layer itself: Ensemble to coordinate use of IPv6 or RSVP in multicast settings. We see as a prototype Highly Assured Network Server clustering and fault-tolerance Wide-area file systems and server networks that tolerate partitioning failures User-level tools for building group conferencing and collaboration tools

Deliverables From Effort Ensemble is already available for UNIX platforms and port to NT is nearly complete Working with BBN to integrate with AquA for use in Quorum program (Gary Koob) R/T cluster tools and WAN partitioning tools available by mid summer Adaptive & probablistic tools by late this year