FLARe: a Fault-tolerant Lightweight Adaptive Real-time Middleware for Distributed Real-time and Embedded Systems Dr. Aniruddha S. Gokhale

Slides:



Advertisements
Similar presentations
EE5900 Advanced Embedded System For Smart Infrastructure
Advertisements

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Consistency and Replication Chapter 7 Part II Replica Management & Consistency Protocols.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Net-Centric Software and Systems I/UCRC Copyright © 2011 NSF Net-Centric I/UCRC. All Rights Reserved. High-Confidence SLA Assurance for Cloud Computing.
P. Albertos* & A. Crespo + Universidad Politécnica de Valencia * Dept. of Systems Engineering and Control, + Dept. of Computer Engineering POB E
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
SIGMETRICS 2008: Introduction to Control Theory. Abdelzaher, Diao, Hellerstein, Lu, and Zhu. CPU Utilization Control in Distributed Real-Time Systems Chenyang.
Team 1: Box Office : Analysis of Software Artifacts : Dependability Analysis of Middleware JunSuk Oh, YounBok Lee, KwangChun Lee, SoYoung Kim,
WPDRTS ’05 1 Workshop on Parallel and Distributed Real-Time Systems 2005 April 4th and 5th, 2005, Denver, Colorado Challenge Problem Session Detection.
1 of 14 1 Fault-Tolerant Embedded Systems: Scheduling and Optimization Viacheslav Izosimov, Petru Eles, Zebo Peng Embedded Systems Lab (ESLAB) Linköping.
Investigating Lightweight Fault Tolerance Strategies for Enterprise Distributed Real-time Embedded Systems Tech-X Corporation Boulder, Colorado Vanderbilt.
Real-Time Distributed Databases By: Chris Scardino CSC536 Monday, May 2, 2005.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
1 of 14 1 Scheduling and Optimization of Fault- Tolerant Embedded Systems Viacheslav Izosimov Embedded Systems Lab (ESLAB) Linköping University, Sweden.
CprE 458/558: Real-Time Systems
Real-time Publish/subscribe ECE Expert Topic Lizhong Cao Milenko Petrovic March 6 th,2003.
23 September 2004 Evaluating Adaptive Middleware Load Balancing Strategies for Middleware Systems Department of Electrical Engineering & Computer Science.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
Real-Time Software Design Yonsei University 2 nd Semester, 2014 Sanghyun Park.
26 Sep 2003 Transparent Adaptive Resource Management for Distributed Systems Department of Electrical Engineering and Computer Science Vanderbilt University,
Adaptive Failover Mechanism Motivation End-to-end connectivity can suffer during net failures Internet path outage detection and recovery is slow (shown.
Presented at University of Alabama CIS, Birmingham Monday, April 9, 2001 Patterns-based Fault Tolerant CORBA Implementation for Predictable Performance.
Wireless Access and Terminal Mobility in CORBA Dimple Kaul, Arundhati Kogekar, Stoyan Paunov.
Dependable Systems (CSE 890), Thursday, 27 th 2003 IRL Interoperable Replication Logic: A three-tier approach to FT-CORBA Infrastructures Authors: R. Baldoni,
ARMADA Middleware and Communication Services T. ABDELZAHER, M. BJORKLUND, S. DAWSON, W.-C. FENG, F. JAHANIAN, S. JOHNSON, P. MARRON, A. MEHRA, T. MITTON,
© Oxford University Press 2011 DISTRIBUTED COMPUTING Sunita Mahajan Sunita Mahajan, Principal, Institute of Computer Science, MET League of Colleges, Mumbai.
Managing Real-Time Transactions in Mobile Ad-Hoc Network Databases Le Gruenwald The University of Oklahoma School of Computer Science Norman, Oklahoma,
HPEC’02 Workshop September 24-26, 2002, MIT Lincoln Labs Applying Model-Integrated Computing & DRE Middleware to High- Performance Embedded Computing Applications.
Vanderbilt University Department of Mechanical Engineering The Vibro-Acoustics Laboratory Observation and Control with Embedded Systems Prof. Ken Frampton.
Sunday, October 15, 2000 JINI Pattern Language Workshop ACM OOPSLA 2000 Minneapolis, MN, USA Fault Tolerant CORBA Extensions for JINI Pattern Language.
Real-Time Scheduling CS4730 Fall 2010 Dr. José M. Garrido Department of Computer Science and Information Systems Kennesaw State University.
Investigating Survivability Strategies for Ultra-Large Scale (ULS) Systems Vanderbilt University Nashville, Tennessee Institute for Software Integrated.
1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.
Transparent Fault-Tolerant Java Virtual Machine Roy Friedman & Alon Kama Computer Science — Technion.
Architecture-Driven Context-Specific Middleware Specializations for Distributed Real-time and Embedded Systems Akshay Dabholkar, and Aniruddha Gokhale.
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
FT-ERF Fault-Tolerance in an Event Rule Framework for Distributed Systems Hillary Caituiro-Monge, Graduate Student. Advisor: Javier Arroyo-Figueroa, Ph.D.
Fault Tolerance in CORBA and Wireless CORBA Chen Xinyu 18/9/2002.
A Utility-based Approach to Scheduling Multimedia Streams in P2P Systems Fang Chen Computer Science Dept. University of California, Riverside
1 An Efficient, Low-Cost Inconsistency Detection Framework for Data and Service Sharing in an Internet-Scale System Yijun Lu †, Hong Jiang †, and Dan Feng.
MDDPro: Model-Driven Dependability Provisioning in Enterprise Distributed Real-time and Embedded Systems Sumant Tambe* Jaiganesh Balasubramanian Aniruddha.
NetQoPE: A Middleware-based Netowork QoS Provisioning Engine for Distributed Real-time and Embedded Systems Jaiganesh Balasubramanian
Chap 7: Consistency and Replication
WS-DREAM: A Distributed Reliability Assessment Mechanism for Web Services Zibin Zheng, Michael R. Lyu Department of Computer Science & Engineering The.
A QoS Policy Modeling Language for Publish/Subscribe Middleware Platforms A QoS Policy Modeling Language for Publish/Subscribe Middleware Platforms Joe.
1 BBN Technologies Quality Objects (QuO): Adaptive Management and Control Middleware for End-to-End QoS Craig Rodrigues, Joseph P. Loyall, Richard E. Schantz.
Tufts Wireless Laboratory School Of Engineering Tufts University Paper Review “An Energy Efficient Multipath Routing Protocol for Wireless Sensor Networks”,
Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003.
CS 6401 Overlay Networks Outline Overlay networks overview Routing overlays Resilient Overlay Networks Content Distribution Networks.
Topic 2: The Role of Open Standards, Open-Source Development, & Different Development Models & Processes (on Industrializing Software) ARO Workshop Outbrief,
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Network Computing Laboratory Load Balancing and Stability Issues in Algorithms for Service Composition Bhaskaran Raman & Randy H.Katz U.C Berkeley INFOCOM.
Fault-Tolerant Rate- Monotonic Scheduling Sunondo Ghosh, Rami Melhem, Daniel Mosse and Joydeep Sen Sarma.
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
Reliable energy management System reliability is affected by use of energy management The use of DVS increases the probability of faults, thus damaging.
Embedded System Scheduling
International Service Availability Symposium (ISAS) 2007
Wayne Wolf Dept. of EE Princeton University
Unit OS9: Real-Time and Embedded Systems
Real-Time Fault Tolerant CORBA
Real-time Software Design
Arvind S. Krishna, Aniruddha Gokhale and Douglas C. Schmidt
Transparent Adaptive Resource Management for Middleware Systems
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT -Sumanth Kandagatla Instructor: Prof. Yanqing Zhang Advanced Operating Systems (CSC 8320)
International Service Availability Symposium (ISAS) 2007
Deployment and Runtime Techniques for Fault-tolerance in Distributed, Real-time and Embedded Systems Aniruddha Gokhale Associate Professor, Dept of EECS,
Replication and Availability in Distributed Systems
Transparent Adaptive Resource Management for Middleware Systems
Presentation transcript:

FLARe: a Fault-tolerant Lightweight Adaptive Real-time Middleware for Distributed Real-time and Embedded Systems Dr. Aniruddha S. Gokhale (Co-Advisor) Dr. Douglas C. Schmidt (Advisor) Jaiganesh Balasubramanian Department of Electrical Engineering and Computer Science Vanderbilt University, Nashville, TN, USA Middleware 2007 Doctoral Symposium (MDS 2007) Newport Beach, CA, USA

2 Stringent simultaneous QoS demands, e.g., “never die,” soft real-time, etc. predominantly stateless, tolerates weaker consistency if stateful Distributed Object Computing middleware used to design and develop DRE systems support for highly available systems (e.g., FT-CORBA) end-to-end predictable behavior for requests (e.g., RT-CORBA) Focus: Distributed Real-time Embedded (DRE) Systems Goal is to provide real-time fault tolerance to DRE systems FT uses redundancy; RT assured by resource management

3 Active replication client requests multicast and executed at all the replicas strong state consistency deterministic behavior of replicas very fast recovery resource-expensive Passive replication low resource/execution overhead better suited for weaker consistency no restrictions on deterministic behavior enables making tradeoffs between FT and resource consumption applies to a class of soft real-time DRE systems Determining the Replication Scheme for DRE Systems Passive replication better suited for our purpose Goal is to provide RT+FT for DRE systems using passive replication

4 Challenges: Using Passive Replication for DRE Systems Challenge 1: Maintain real-time performance of applications at all times Focus: Real-time performance after failover Decision-making algorithms used for electing a new primary Client response times depend on the loads of the processor hosting the failover target Task deadlines are met if the CPU utilization is under a threshold Failure could affect multiple clients – failover to multiple processors

5 Challenges: Using Passive Replication for DRE Systems Challenge 2: Fast failover on client side Focus: Faster and predictable failover Client-side middleware could maintain static list of references Round-robin approach of trying out different references Faster failover – but not appropriate failover No RT guarantee after failover Client-side middleware need to be updated with references based on dynamic operating conditions

6 Challenges: Using Passive Replication for DRE Systems Challenge 3: FT+RT in spite of resource overloads Focus: Dynamic reconfigurations and overload management Long running systems – continued operation through many failures Periodic loss of resources – simultaneous failures Graceful degradation of applications Operate higher priority applications at all times Overload management – predictable and fast Alternate degraded and assured functionality

7 Challenges: Using Passive Replication for DRE Systems Challenge 4: Resource-aware stateful replication Focus: State consistency in stateful DRE systems State transfer requires CPU and network reservations Support for resource-constrained operations Different consistency models – strong, weak, and no consistency Adapt consistency of certain tolerant applications depending on available resources Utility optimizations – better state consistency for higher priority applications

8 Our Approach: FLARe RT-FT Middleware FLARe = Fault-tolerant Lightweight Adaptive Real-time Middleware Transparent and Fast Failover Redirection using client-side portable interceptors catches COMM_FAILURE exceptions and transparently throws LOCATION_FORWAR D exceptions Failure detection can be improved with better protocols – e.g., SCTP

9 Our Approach: FLARe RT-FT Middleware Real-time performance after failover monitor CPU utilizations at hosts where backups are deployed adaptive failover target selection algorithms operated by a resource manager failover targets chosen on the least loaded host hosting the backups better chance to provide RT performance

10 Our Approach: FLARe RT-FT Middleware Predictable failover failover target decisions computed periodically by the resource manager conveyed to client-side middleware agents – forwarding agents agents work in tandem with portable interceptors redirect clients quickly and predictably to appropriate targets agents periodically/proactively updated when targets change

11 Current Progress Initial prototype of FLARe developed using The ACE ORB (TAO) Stateless FT using passive replication Implemented a resource- aware adaptive failover target selection algorithm Compared and contrasted the performance of the FT middleware when using static failover strategies versus adaptive failover strategies Significant reduction in client response times and system utilization Current Progress FLARe is open-source and available at

12 Proposed Research and Expected Milestones Overload management investigate overload management algorithms that do not degrade application QoS minimum client disturbance implemented within the resource manager extreme resource constrained operating conditions – investigate opportunities to change implementations and reduce overloads Utility optimizations – when to degrade QoS and when not to RT/FT trade-offs Deadline : March 2008

13 Proposed Research and Expected Milestones State Synchronization View state synchronization as an aperiodic scheduling problem more slack available – more time available to synchronize state slack devoted for higher priority applications always availability of slack – support for different consistency management schemes (e.g., weak, strong, none) application informs middleware when to synchronize state Deadline : July 2008

14 Proposed Research and Expected Milestones Network Reservations View real-time fault- tolerance as an end-to- end scheduling problem network reservations are required for state transfers without reservations, no predictability middleware-mediated mechanisms to use external network QoS mechanisms such as DiffServ network monitors for alternate routes in the presence of failures (leverage existing network research) Deadline : September 2008

15 Concluding Remarks Passive replication – a promising approach for DRE systems Resource-aware adaptive fault-tolerance – required for adapting passive replication for DRE system requirements Adaptive algorithms required for trading off RT versus FT requirements Middleware transparently supports FT for applications – works in conjunction with adaptive algorithms to take care of RT requirements as well FLARe is open-source and available at