A FIT Event Broker for trustworthy infrastructure monitoring and management António Casimiro University of Lisbon Faculty of Sciences LASIGE – Navigators.

Slides:



Advertisements
Similar presentations
Tivoli Software from IBM Storage Resource Management Webcast
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Database Architectures and the Web
Microsoft ® System Center Configuration Manager 2007 R3 and Forefront ® Endpoint Protection Infrastructure Planning and Design Published: October 2008.
Reliability on Web Services Presented by Pat Chan 17/10/2005.
Objektorienteret Middleware Presentation 2: Distributed Systems – A brush up, and relations to Middleware, Heterogeneity & Transparency.
Extensible Networking Platform IWAN 2005 Extensible Network Configuration and Communication Framework Todd Sproull and John Lockwood
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Rheeve: A Plug-n-Play Peer- to-Peer Computing Platform Wang-kee Poon and Jiannong Cao Department of Computing, The Hong Kong Polytechnic University ICDCSW.
The Architecture Design Process
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
1© Copyright 2015 EMC Corporation. All rights reserved. SDN INTELLIGENT NETWORKING IMPLICATIONS FOR END-TO-END INTERNETWORKING Simone Mangiante Senior.
Barracuda Networks Confidential1 Barracuda Backup Service Integrated Local & Offsite Data Backup.
BFT3W'091 Intrusion Tolerance: The Killer App for BFT (?) Alysson Bessani, Miguel Correia, Paulo Sousa, Nuno Ferreira Neves, Paulo Veríssimo Universidade.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Fault and Intrusion Tolerant (FIT) Event Broker & BFT-SMaRt A. Casimiro, D. Kreutz, A. Bessani, J. Sousa, I. Antunes, P. Veríssimo University of Lisboa,
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Michael Ernst, page 1 Collaborative Learning for Security and Repair in Application Communities Performers: MIT and Determina Michael Ernst MIT Computer.
Gil EinzigerRoy Friedman Computer Science Department Technion.
Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.
Computer Science Open Research Questions Adversary models –Define/Formalize adversary models Need to incorporate characteristics of new technologies and.
Cluster Reliability Project ISIS Vanderbilt University.
Wireless Networks Breakout Session Summary September 21, 2012.
Distributed Systems: Concepts and Design Chapter 1 Pages
9 September 2008CIS 340 # 1 Topics reviewTo review the communication needs to support the architectures variety of approachesTo examine the variety of.
21/05/2010 AU DEPARTMENT OF COMPUTER SCIENCE FACULTY OF SCIENCE AARHUS UNIVERSITY TATIONpRESEN The homeport system Jeppe Brønsted, Post Doc, Phd Aarhus.
Distributed Database Systems Overview
Clever Framework Name That Doesn’t Violate Copyright Laws MARCH 27, 2015.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Practical Byzantine Fault Tolerance
SCALABLE EVOLUTION OF HIGHLY AVAILABLE SYSTEMS BY ABHISHEK ASOKAN 8/6/2004.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
Distributed Computing Environment (DCE) Presenter: Zaobo He Instructor: Professor Zhang Advanced Operating System Advanced Operating System.
TTCN-3 MOST Challenges Maria Teodorescu
 Apache Airavata Architecture Overview Shameera Rathnayaka Graduate Assistant Science Gateways Group Indiana University 07/27/2015.
Oracle's Distributed Database Bora Yasa. Definition A Distributed Database is a set of databases stored on multiple computers at different locations and.
Distributed System Concepts and Architectures 2.3 Services Fall 2011 Student: Fan Bai
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
Cracow Grid Workshop ‘06 17 October 2006 Execution Management and SLA Enforcement in Akogrimo Antonios Litke Antonios Litke, Kleopatra Konstanteli, Vassiliki.
Amit Warke Jerry Philip Lateef Yusuf Supraja Narasimhan Back2Cloud: Remote Backup Service.
Shuman Guo CSc 8320 Advanced Operating Systems
VMware vSphere Configuration and Management v6
Abstract A Structured Approach for Modular Design: A Plug and Play Middleware for Sensory Modules, Actuation Platforms, Task Descriptions and Implementations.
Rob Davidson, Partner Technology Specialist Microsoft Management Servers: Using management to stay secure.
Internet of Things. IoT Novel paradigm – Rapidly gaining ground in the wireless scenario Basic idea – Pervasive presence around us a variety of things.
Computer Simulation of Networks ECE/CSC 777: Telecommunications Network Design Fall, 2013, Rudra Dutta.
GRID ANATOMY Advanced Computing Concepts – Dr. Emmanuel Pilli.
Intrusion Tolerant Distributed Object Systems Joint IA&S PI Meeting Honolulu, HI July 17-21, 2000 Gregg Tally
Cisco Consulting Services for Application-Centric Cloud Your Company Needs Fast IT Cisco Application-Centric Cloud Can Help.
Internet of Things. Creating Our Future Together.
Efficient Opportunistic Sensing using Mobile Collaborative Platform MOSDEN.
Langley Research Center An Architectural Concept for Intrusion Tolerance in Air Traffic Networks Jeffrey Maddalon Paul Miner {jeffrey.m.maddalon,
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
AMSA TO 4 Advanced Technology for Sensor Clouds 09 May 2012 Anabas Inc. Indiana University.
Chapter 1 Characterization of Distributed Systems
BChain: High-Throughput BFT Protocols
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng
Connected Maintenance Solution
Connected Maintenance Solution
Grid Computing.
Exploring Azure Event Grid
Providing Secure Storage on the Internet
Concept of VLAN (Virtual LAN) and Benefits
Principles of Computer Security
Quality-aware Middleware
Design.
Presentation transcript:

A FIT Event Broker for trustworthy infrastructure monitoring and management António Casimiro University of Lisbon Faculty of Sciences LASIGE – Navigators group 61 st IFIP WG 10.4 Meeting, January, 2012

Outline Motivation: the TRONE project Goals and challenges FIT broker architecture and operation Other work in TRONE Conclusions 61st IFIP WG 10.4 Meeting, January

The TRONE project Develop innovative solutions for Network Operation, Administration and Management Proactive hazard reduction: architectural robustness Reactive hazard reduction: detection and recovery Achieve trustworthy network operation Solutions for dynamic dependability & security enforcement Deal with increasing levels of accidental and malicious faults Diagnosis, detection Prevention/tolerance Automatic reconfiguration Provide architectural solutions and resilient components 61st IFIP WG 10.4 Meeting, January Focus of this presentation

Why we need TRONE? Trustworthy and Resilient Operations in a Network Environment Technology push: Next Generation Networks, Cloud Computing Need for seamless integration of new and heterogeneous technologies Consumer pull: More demanding requirements Increased QoS and QoP (fast is not enough!) The confluence of these forces leads to: Increased operational risks Inadequate network operation and management 61st IFIP WG 10.4 Meeting, January

Example scenario: Portugal Telecom Cloud Computing Infrastructure Cloud computing environments: Will be, in the next few years, the hottest topic in Portugal Telecom services portfolio Present a set of new challenges regarding security controls Same infrastructure is used by clients with strong security awareness and others that do not share this awareness The reach and impact of an attack is potentially greater than in traditional IT infrastructures 61st IFIP WG 10.4 Meeting, January 2012 Portugal Telecom Reference Architecture 5

Example scenario: Portugal Telecom Cloud Computing Infrastructure Focusing on a specific problem: Centralized monitoring approach 61st IFIP WG 10.4 Meeting, January

Goals and challenges Overarching goals: To provide support for trustworthy and resilient monitoring of cloud/datacenter infrastructures To achieve improved Quality of Protection while considering Quality of Service (performance) needs Some specific challenges: Deal with large flows of information (events) Support different kinds of events (e.g. criticality) Low intrusiveness and easy integration 61st IFIP WG 10.4 Meeting, January

Fault and Intrusion Tolerant (FIT) Event Broker 8

Assumptions System entities: Probes, event collectors/brokers, consoles Some event processing may be done by collectors Fully connected network E.g., all the entities lie in the same monitoring VLAN Partially synchronous system Clocks may be used to timestamp events Faults Some FIT brokers may fail in a Byzantine way (e.g. be attacked) We do not require/enforce clients (probes/consoles) to be correct If this is a problem for monitoring, then it must also be solved 61st IFIP WG 10.4 Meeting, January

Baseline design options Topic-based Publish-Subscribe paradigm Good fit to considered scenarios State Machine Replication Active replication is better for Byzantine fault tolerance f out of n replicas of a FIT Broker may fail in a Byzantine way Public-key cryptography Client authentication, avoid attacks from malicious probes Event channels with support for QoP and QoS Differentiated event handling, on a channel basis Differentiated fault-tolerance support (e.g. crash only or BFT) Possible support for ordering, urgency and other requirements 61st IFIP WG 10.4 Meeting, January

FIT Monitoring system: Overview 61st IFIP WG 10.4 Meeting, January

FIT Monitoring system: High level architectural view 61st IFIP WG 10.4 Meeting, January

FIT Event Broker Basic concepts Event Channels Fundamental abstraction to differentiate event flows within the event broker An event channel is identified (within the system) by a TAG The characteristics of event channels are set by means of a CLASS attribute Several event channels may be created, defining communication domains An event channel may be used to transmit specific kinds of events For instance: network, storage, security threats, … CLASS Defines the desired QoS and QoP for an event channel Fault-tolerance (e.g. whether the event channel/service should be tolerant to crash faults or to Byzantine faults) Ordering (e.g. whether the events transmitted through the event channel should be delivered in the same order to all subscribers, or any order is acceptable) Priority (e.g. whether the event flow is allocated more bandwidth within the broker, or events may be processed before events from other channels within the broker) Several channels with the same CLASS may coexist 61st IFIP WG 10.4 Meeting, January

FIT Event Broker Interface Create event channel In: TAG and CLASS Register to channel In: TAG Publish event In: EVENT Subscribe to channel In: TAG Receive event Out: EVENT Destroy event channel In: TAG 61st IFIP WG 10.4 Meeting, January

FIT Event Broker Internal state From the SMR perspective, it is important to identify the relevant state that needs to be maintained consistent across replicas Data related to the broker configuration Existing channels and their CLASS Registered publishers and subscribers Data related to events Events that are ready to be delivered 61st IFIP WG 10.4 Meeting, January All client input that affects the state of the FIT broker state (e.g. channel and subscription data, some events) must be handled as a state machine command

FIT Event Broker Operation Depending on requirements (determined by the channel CLASS), input events are handled by different protocols within the FIT broker Crash-resilient channel, no order requirements No consistency among replicas is needed Simple forwarding protocols High performance – adequate to most periodic and non-critical monitoring events Byzantine-resilient channel Agreement is needed Performance implications – adequate to critical monitoring events 61st IFIP WG 10.4 Meeting, January

FIT Event Broker Internal event processing 61st IFIP WG 10.4 Meeting, January FIT Event Broker (replicas) Replica n Replica 1 Replica 2 PROBE CONSOLEVoter CLASS? Forward Prot Urgency Prot Agreem Prot INOUT queues Voter component is always needed to handle responses from replicas

FIT Event Broker Crash-tolerant instantiation 61st IFIP WG 10.4 Meeting, January

FIT Event Broker BFT instantiation 61st IFIP WG 10.4 Meeting, January We use BFT-SMaRt as a fundamental building block in the implementation of the FIT event broker

BFT-SMaRt Java-based platform for BFT SMR, available at Actively being developed and improved in our group BFT SMR “common” features State machine programming model n ≥ 3f+1 replicas required Many bugs (but not as many as competitors) Advanced features Replica recovery (state transfer) Reconfigurations Extensible API: e.g. custom voter 61st IFIP WG 10.4 Meeting, January

BFT Replication Library Programming interface A very simple and constrained API for implementing state machine replication 61st IFIP WG 10.4 Meeting, January

Using SMaRt Service invocation 61st IFIP WG 10.4 Meeting, January PROBE FIT Broker state Agreement on order performed by SMaRt

Using SMaRt Execution and response 61st IFIP WG 10.4 Meeting, January Commands are delivered to the FIT broker, which updates the state/queues and replies Voting on client side

The FIT Broker is currently being implemented On-going work on SMaRt integration Evaluation: Throughput Aim is to deal with 40K events/sec Resilience Measure performance under attack Verify recovery and reconfiguration capabilities 61st IFIP WG 10.4 Meeting, January Implementation & Evaluation

61st IFIP WG 10.4 Meeting, January 2012 Other work in TRONE 25

Failure Diagnosis Overview Diagnosing problems Creates major headaches for administrators Worsens as scale and system complexity grows Goal: automate failure diagnosis and get proactive Failure detection and prediction Problem determination (“automated fingerpointing”) Problem visualization How: Instrumentation plus statistical analysis 61st IFIP WG 10.4 Meeting, January

Failure Diagnosis Goals and Non-Goals Goals of the failure diagnosis algorithm Lightweight and Transparent: use metrics that can be collected with minimum overhead and without modifying the applications Scalable: low complexity so that it can scale to several nodes used in the cloud computing infrastructure Versatile: Should work well with all the different kinds of applications that might run on the cloud computing infrastructure Non-goals (for now) Tracing problem down to offending line of code Online implementation 61st IFIP WG 10.4 Meeting, January

Multi-homing Objectives: Improved communication resilience Client multi-homing at the transport layer Use of Stream Control Transport Protocol (SCTP) Interaction between SCTP and the FIT Event Broker for trustworthy handling of SCTP monitoring data 61st IFIP WG 10.4 Meeting, January

Conclusions TRONE will contribute to improve the resilience of Portugal Telecom’s datacenter monitoring Excellent opportunity to Design practically apply and verify the effective benefits of our solutions 61st IFIP WG 10.4 Meeting, January

61st IFIP WG 10.4 Meeting, January 2012 Thank you! Visit us at 30