An Algorithm for Automatically Obtaining Distributed and Fault Tolerant Static Schedules Alain Girault - Hamoudi Kalla - Yves Sorel - Mihaela Sighireanu.

Slides:



Advertisements
Similar presentations
An Overview of ABFT in cloud computing
Advertisements

Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.
Fault-Tolerance for Distributed and Real-Time Embedded Systems
1 of 14 1 /23 Flexibility Driven Scheduling and Mapping for Distributed Real-Time Systems Paul Pop, Petru Eles, Zebo Peng Department of Computer and Information.
1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.
Making Services Fault Tolerant
1 Building Reliable Web Services: Methodology, Composition, Modeling and Experiment Pat. P. W. Chan Department of Computer Science and Engineering The.
1 of 16 April 25, 2006 Minimizing System Modification in an Incremental Design Approach Paul Pop, Petru Eles, Traian Pop, Zebo Peng Department of Computer.
1 of 14 1/14 Design Optimization of Time- and Cost-Constrained Fault-Tolerant Distributed Embedded Systems Viaceslav Izosimov, Paul Pop, Petru Eles, Zebo.
Introduction Designing cost-sensitive real-time control systems for safety-critical applications requires a careful analysis of the cost/fault-coverage.
Example (1) Two computer systems have been tested using three benchmarks. Using the normalized ratio formula and the following tables below, find which.
1 Data Persistence in Large-scale Sensor Networks with Decentralized Fountain Codes Yunfeng Lin, Ben Liang, Baochun Li INFOCOM 2007.
The Phoenix Recovery System: Rebuilding from the ashes of an Internet catastrophe Flavio Junqueira, Ranjita Bhagwan, Keith Marzullo, Stefan Savage, and.
Bogdan Tanasa, Unmesh D. Bordoloi, Petru Eles, Zebo Peng Department of Computer and Information Science, Linkoping University, Sweden December 3, 2010.
1 of 30 June 14, 2000 Scheduling and Communication Synthesis for Distributed Real-Time Systems Paul Pop Department of Computer and Information Science.
Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.
Reliability-Aware Frame Packing for the Static Segment of FlexRay Bogdan Tanasa, Unmesh Bordoloi, Petru Eles, Zebo Peng Linkoping University, Sweden 1.
Reliability on Web Services Pat Chan 31 Oct 2006.
Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.
1 of 16 March 30, 2000 Bus Access Optimization for Distributed Embedded Systems Based on Schedulability Analysis Paul Pop, Petru Eles, Zebo Peng Department.
Design of Fault Tolerant Data Flow in Ptolemy II Mark McKelvin EE290 N, Fall 2004 Final Project.
Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,
1 Oct 2, 2003 Design Optimization of Mixed Time/Event-Triggered Distributed Embedded Systems Traian Pop, Petru Eles, Zebo Peng Embedded Systems Laboratory.
A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.
On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
1 of 14 1 / 18 An Approach to Incremental Design of Distributed Embedded Systems Paul Pop, Petru Eles, Traian Pop, Zebo Peng Department of Computer and.
Scheduling Parallel Task
Tabu Search-Based Synthesis of Dynamically Reconfigurable Digital Microfluidic Biochips Elena Maftei, Paul Pop, Jan Madsen Technical University of Denmark.
REAL-TIME SOFTWARE SYSTEMS DEVELOPMENT Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical Engineering, WVU.
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2007 (TPDS 2007)
Towards a Contract-based Fault-tolerant Scheduling Framework for Distributed Real-time Systems Abhilash Thekkilakattil, Huseyin Aysan and Sasikumar Punnekkat.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Adapting Convergent Scheduling Using Machine Learning Diego Puppin*, Mark Stephenson †, Una-May O’Reilly †, Martin Martin †, and Saman Amarasinghe † *
Network Aware Resource Allocation in Distributed Clouds.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
An efficient active replication scheme that tolerate failures in distributed embedded real-time systems Alain Girault, Hamoudi Kalla and Yves Sorel Pop.
Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 March 01, 2005 Session 14.
1 Software Reliability Assurance for Real-time Systems Joel Henry, Ph.D. University of Montana NASA Software Assurance Symposium September 4, 2002.
Chih-Ming Chen, Student Member, IEEE, Ying-ping Chen, Member, IEEE, Tzu-Ching Shen, and John K. Zao, Senior Member, IEEE Evolutionary Computation (CEC),
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
Energy/Reliability Trade-offs in Fault-Tolerant Event-Triggered Distributed Embedded Systems Junhe Gan, Flavius Gruian, Paul Pop, Jan Madsen.
1 Nasser Alsaedi. The ultimate goal for any computer system design are reliable execution of task and on time delivery of service. To increase system.
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
Distributed Genetic Algorithms with a New Sharing Approach in Multiobjective Optimization Problems Tomoyuki HIROYASU Mitsunori MIKI Sinya WATANABE Doshisha.
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
10 th December, 2013 Lab Meeting Papers Reviewed:.
REAL-TIME SOFTWARE SYSTEMS DEVELOPMENT Instructor: Dr. Hany H. Ammar Dept. of Computer Science and Electrical Engineering, WVU.
A genetic approach to the automatic clustering problem Author : Lin Yu Tseng Shiueng Bien Yang Graduate : Chien-Ming Hsiao.
Job scheduling algorithm based on Berger model in cloud environment Advances in Engineering Software (2011) Baomin Xu,Chunyan Zhao,Enzhao Hua,Bin Hu 2013/1/251.
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
Service-oriented Resource Broker for QoS-Guaranteed in Grid Computing System Yichao Yang, Jin Wu, Lei Lang, Yanbo Zhou and Zhili Sun Centre for communication.
Parallelizing Video Transcoding Using Map-Reduce-Based Cloud Computing Speaker : 童耀民 MA1G0222 Feng Lao, Xinggong Zhang and Zongming Guo Institute of Computer.
Contention-aware scheduling with task duplication J. Parallel Distrib. Comput. (2011) Oliver Sinnen ∗, Andrea To, Manpreet Kaur Tai, Yu-Chang 11/23/2012.
Real-Time Support for Mobile Robotics K. Ramamritham (+ Li Huan, Prashant Shenoy, Rod Grupen)
Dzmitry Kliazovich University of Luxembourg, Luxembourg
Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003.
Multi-objective Topology Synthesis and FPGA Prototyping Framework of Application Specific Network-on-Chip m Akram Ben Ahmed Xinyu LI, Omar Hammami.
Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.
1 Developing Aerospace Applications with a Reliable Web Services Paradigm Pat. P. W. Chan and Michael R. Lyu Department of Computer Science and Engineering.
1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.
A stochastic scheduling algorithm for precedence constrained tasks on Grid Future Generation Computer Systems (2011) Xiaoyong Tang, Kenli Li, Guiping Liao,
TECHNICAL SEMINAR On. introduction  Cloud support for real time system is really important because, today we found a lot of real time systems around.
Reliable energy management System reliability is affected by use of energy management The use of DVS increases the probability of faults, thus damaging.
Paul Pop, Petru Eles, Zebo Peng
Multi-hop Coflow Routing and Scheduling in Data Centers
Construction Engineering Department Construction Project with Resources Constraints By. M. Chelaka, D. Greenwood & E. Johansen, /9/2019.
Reliable Web Services: Methodology, Experiment and Modeling International Conference on Web Services (ICWS 2007) Pat. P. W. Chan, Michael R. Lyu Department.
Presentation transcript:

An Algorithm for Automatically Obtaining Distributed and Fault Tolerant Static Schedules Alain Girault - Hamoudi Kalla - Yves Sorel - Mihaela Sighireanu San Francisco, USA, June 23, 2003 POP ART team & OSTRE Team

Outline Introduction Modeling distributed real-time systems Problem : How to introduce fault-tolerance ? The proposed solution for fault-tolerance Principles and example Simulations Conclusion and future work

3 High level program Compiler Architecture specification Distribution constraints Execution times Real-time constraints Failure specification Fault-tolerant distributed static schedule Fault-tolerant distributed code Code generator Distribution and scheduling fault-tolerant heuristic Model of the algorithm 1.Introduction

4 2.Modeling distributed real-time systems b.Architecture Modela.Algorithm Model P1 P2 P3 m1 m2 m3 « P1, P2 and P3 » are processors « m1, m2 and m3 » are communications links « I 1 and I 2 » are inputs operations « O » is output operation « A, B and C » are computations operations I1I1 A B C O I2I2

5 fault-tolerant  Find a distributed schedule of the algorithm on the architecture which is fault-tolerant to processors failures ? 3.Problem : How to introduce fault-tolerance ? Problem P1 P2P3 m1 m2 m3 I1I1 A B C O I2I2 scheduleschedule

6 Solution active software replication  A list scheduling heuristic which use the active software replication of operations and communications. fail-silent  Processors are assumed to be fail-silent 3.The proposed solution for fault-tolerance Assumption Npf  1  Tolerate a number of processor failures Npf  1

7 more than Npf+1 times  Each operation/communication is replicated more than Npf+1 times on different processors/links of the architecture graph. 4.The proposed solution for fault-tolerance Principles (1)

8 Principles (2) 4.The proposed solution for fault-tolerance

9 Principles (3) 4.The proposed solution for fault-tolerance

10 schedule pressure   The schedule pressure  is used as a cost function to select the best processor p for each operation o : where, 4.The proposed solution for fault-tolerance Principles (4)

11 1.   o | o is an input operation  ;   ; 2.While   do smallest Npf+1 results a.Compute the schedule pressure  for each operation o of on each processor p and keep the smallest Npf+1 results; best candidate operation b.Select the best candidate operation o best which has the greatest schedule pressure  (o best, p) ; replicated Npf+1 times scheduled on parallel links c.Schedule o best on each processor p computed at step a and the communications implied by this schedule are replicated Npf+1 times and scheduled on parallel links; minimise the start time d.Try to minimise the start time of o best on each processor p computed at step a by replicating these predecessors on p [ahmad and al.]; e. Update the list of candidate operations :  -  o best    o | o  (succs o best ) & (preds o)  )     o best  end while; 5.Heuristic

12 P1 P2 P3 m1 m2 m3 Npf = 1 Number of fail-silent processor that the system must tolerate Npf = 1 Architecture graph Algorithm graph Failures 5.Example I1I1 A B C O I2I2

13 1.   o | o is an input operation  ;   ; 2.While   do a.Compute the schedule pressure  for each operation o of on each processor p and keep the smallest Npf+1 results; b.Select the best candidate operation o best which has the greatest schedule pressure  (o best, p) ; c.Schedule o best on each processor p computed at step a and the communications implied by this schedule are replicated Npf+1 times and scheduled on parallel links; d.Try to minimise the start time of o best on each processor p computed at step a by replicating these predecessors on p [ahmad and al.]; e. Update the list of candidate operations :  -  o best    o | o  (succs o best ) & (preds o)  )     o best  end while; 5.Heuristic

14 =  Npf = 1 I1I1 A B C O I2I2 = { I 1, I 2 } 5.Example Step 1. (1) P 1 m 2 P 3 m 3 P 2 m 1

15 P 1 m 2 P 3 m 3 P 2 m 1 I1I1 I1I1 Step 2. (1) I1I1 A B C O I2I2 5.Example Npf = 1 = { } = { I 1 } = { I 1, I 2 } = { I 2, B } Schedule I 1 on P 1 and P 2

16 Step 2. (2) I1I1 A B C O I2I2 5.Example Npf = 1 = { I 1 } = { I 1, I 2 } = { I 2, B } = { A, B } P 1 m 2 P 3 m 3 P 2 m 1 I1I1 I1I1 I2I2 I2I2 Schedule I 2 on P 1 and P 2

17 = { I 1, I 2 } P 1 m 2 P 3 m 3 P 2 m 1 I1I1 I1I1 I2I2 I2I2 Step 2. (3) I1I1 A B C O I2I2 5.Example Npf = 1 = { A, B }

18 1.   o | o is an input operation  ;   ; 2.While   do a.Compute the schedule pressure  for each operation o of on each processor p and keep the smallest Npf+1 results; b.Select the best candidate operation o best which has the greatest schedule pressure  (o best, p) ; c.Schedule o best on each processor p computed at step a and the communications implied by this schedule are replicated Npf+1 times and scheduled on parallel links; d.Try to minimise the start time of o best on each processor p computed at step a by replicating these predecessors on p [ahmad and al.]; e. Update the list of candidate operations :  -  o best    o | o  (succs o best ) & (preds o)  )     o best  end while; 5.Heuristic

19 = { I 1, I 2 }  ( A, { P 1, P 2, P 3 } ) = { 7,10, 9 }  ( B, { P 1, P 2, P 3 } ) = { 9, 6, 8 }  ( A, { P 1, P 3 } ) = { 7, 9 }  ( B, { P 2, P 3 } ) = { 6, 8 } Min  P 1 m 2 P 3 m 3 P 2 m 1 I1I1 I1I1 I2I2 I2I2 Step 2.a. (3) I1I1 A B C O I2I2 5.Example Npf = 1 = { A, B }

20 1.   o | o is an input operation  ;   ; 2.While   do a.Compute the schedule pressure  for each operation o of on each processor p and keep the smallest Npf+1 results; b.Select the best candidate operation o best which has the greatest schedule pressure  (o best, p) ; c.Schedule o best on each processor p computed at step a and the communications implied by this schedule are replicated Npf+1 times and scheduled on parallel links; d.Try to minimise the start time of o best on each processor p computed at step a by replicating these predecessors on p [ahmad and al.]; e. Update the list of candidate operations :  -  o best    o | o  (succs o best ) & (preds o)  )     o best  end while; 5.Heuristic

21  ( A, { P 1, P 3 } ) = { 7, 9 }  ( B, { P 2, P 3 } ) = { 6, 8 } P 1 m 2 P 3 m 3 P 2 m 1 I1I1 I1I1 I2I2 I2I2 Max   ( A, { P 1, P 3 } ) = { 7, 9 } I1I1 A B C O I2I2 5.Example Step 2.b. (3) Npf = 1 = { I 1, I 2 } = { A, B }

22 1.   o | o is an input operation  ;   ; 2.While   do a.Compute the schedule pressure  for each operation o of on each processor p and keep the smallest Npf+1 results; b.Select the best candidate operation o best which has the greatest schedule pressure  (o best, p) ; c.Schedule o best on each processor p computed at step a and the communications implied by this schedule are replicated Npf+1 times and scheduled on parallel links; d.Try to minimise the start time of o best on each processor p computed at step a by replicating these predecessors on p [ahmad and al.]; e. Update the list of candidate operations :  -  o best    o | o  (succs o best ) & (preds o)  )     o best  end while; 5.Heuristic

23 Schedule A on P 1 and P 3 P 1 m 2 P 3 m 3 P 2 m 1 I1I1 I1I1 I2I2 I2I2 A A  ( A, { P 1, P 3 } ) = { 7, 9 } I1I1 A B C O I2I2 5.Example Step 2.c. (3) Npf = 1

24 1.   o | o is an input operation  ;   ; 2.While   do a.Compute the schedule pressure  for each operation o of on each processor p and keep the smallest Npf+1 results; b.Select the best candidate operation o best which has the greatest schedule pressure  (o best, p) ; c.Schedule o best on each processor p computed at step a and the communications implied by this schedule are replicated Npf+1 times and scheduled on parallel links; d.Try to minimise the start time of o best on each processor p computed at step a by replicating these predecessors on p [ahmad and al.]; e. Update the list of candidate operations :  -  o best    o | o  (succs o best ) & (preds o)  )     o best  end while; 5.Heuristic

25 Replicating I 2 on P 3 P 1 m 2 P 3 m 3 P 2 m 1 I1I1 I1I1 I2I2 I2I2 A A I1I1 A B C O I2I2 5.Example Step 2.d. (3) Npf = 1 I2I2

26 1.   o | o is an input operation  ;   ; 2.While   do a.Compute the schedule pressure  for each operation o of on each processor p and keep the smallest Npf+1 results; b.Select the best candidate operation o best which has the greatest schedule pressure  (o best, p) ; c.Schedule o best on each processor p computed at step a and the communications implied by this schedule are replicated Npf+1 times and scheduled on parallel links; d.Try to minimise the start time of o best on each processor p computed at step a by replicating these predecessors on p [ahmad and al.]; e. Update the list of candidate operations :  -  o best    o | o  (succs o best ) & (preds o)  )     o best  end while; 5.Heuristic

27 I1I1 A B C O I2I2 P 1 m 2 P 3 m 3 P 2 m 1 I1I1 I1I1 I2I2 I2I2 A I2I2 A 5.Example Step 2.e. (3) Npf = 1 = { I 1, I 2 } = { A, B } = { I 1, I 2, A} = { B }

28 Aim : Aim :  Compare the proposed heuristic with the HBP heuristic [Hashimoto and al. 2002]. Assumptions : Assumptions :  Architecture with fully connect processors,  Number of fail-silent processor Npf = 1. Simulation parameters: Simulation parameters:  Communication-to-computation ratio, defined as the average communication time divided by the average computation time, CCR = 0.1, 0.5, 1, 2, 5 and 10,  Number of operations N = 10, 20, …, 80. Comparison parameter : Comparison parameter : Overhead = length (HTBR or HBP) - length (HTBR without fault-tolerance) longueur (HTBR without fault-tolerance) x 100 % 6.Simulations

29 No processor failure Impact of the number of operation One processor fails

30 Impact of the communication-to computation ratio No processor failure One processor fails

31 7.Conclusion and future work A new fault-tolerant scheduling heuristics:  Processors and communications links failures. reliability  Maximise the system’s reliability. A new scheduling heuristics based on the active replication strategy. It produces a static distributed schedule of a given algorithm on a given distributed architecture, tolerant to Npf processor failures. Result Future work