Transparent Fault-Tolerant Java Virtual Machine Roy Friedman & Alon Kama Computer Science — Technion
FT-JVM Goals Fault-tolerant environment for executing Java applications Apps should execute without interruption, overcoming failures of individual machines Apps should not have to be modified in order to run on the system Highly Reliable Fault-tolerance can be extended by utilizing more machines Low Maintainability Recovery upon failure of individual machines should be swift Transparency Failures should be masked and the transition to another machine should be transparent
Fault tolerance by Replication Replication — Coordinating a set of replicas of the computation on processors that fail independently Potential for a dramatic decrease in Mean Time To Repair (MTTR) Achieve t -fault-tolerance, where t is the number of replicas Increased cost of hardware for duplication of effort Overhead and complexity of maintaining consistency Replication + Transparency (masking of failures, maintaining the illusion of a single copy) = High availability
Replication for Java Replication at the Java Virtual Machine level Replication at this level is cost-effective, portable, and transparent to the application developer and the user Approach extends Bressoud & Schneider (1995) who implemented active replication below the Operating System T. Bressoud and F. Schneider. Hypervisor-based Fault-Tolerance, SOSP-15
Design of the FT-JVM Replication requires deterministic execution. Difficult to achieve because of: Preemptive context switches Lock contention in SMP I/O availability differences Environment-specific attributes Changes made to the VM: Deterministic thread scheduling Deterministic thread switching Non-deterministic ops relay info to replication module
Design of the FT-JVM Replication module: One replication engine per processor, on both primary and backups Data packages are passed to engine on primary, retrieved from it for backups Threads waiting for I/O now yield instead, to be re-scheduled at specific intervals I/O is checked at beginning of a frame, determined by X context- switches or the lack of schedulable application threads primary backup End frame ACK Frame n data Frame n+1 data
Performance Results
SMP Raytrace
Conclusion Ideal for long-running, low-I/O Java applications Only a small performance degradation even for frequent synchronization between replicas (e.g. every second) Quick detection and recovery from failure