Presentation is loading. Please wait.

Presentation is loading. Please wait.

Supporting Fault-Tolerance in Streaming Grid Applications

Similar presentations


Presentation on theme: "Supporting Fault-Tolerance in Streaming Grid Applications"— Presentation transcript:

1 Supporting Fault-Tolerance in Streaming Grid Applications
Qian Zhu, Liang Chen, Gagan Agrawal Department of Computer Science and Engineering The Ohio State University IPDPS 2008 Conference April 15th, Miami, Florida IPDPS 2008

2 Data Streaming Applications
Computational Steering Interactively control scientific simulations Computer Vision Based Surveillance Track people and monitor critical infrastructure Images captured by multiple cameras Online Network Intrusion Detection Analyze connection request logs Identify unusual patterns IPDPS 2008

3 Fault-Tolerance Definition Fault-Tolerance in Grid Applications
The ability of a system to respond gracefully to an unexpected hardware or software failure Fault-Tolerance in Grid Applications Redundancy-based fault-tolerance Checkpointing-based fault-tolerance IPDPS 2008

4 Fault-Tolerance in Data Streaming Applications
Fault-Tolerance is Important for Data Stream Processing Distributed data sources Pipelined real-time processing and long running nature Frequent and large-volume data transfers Dynamic and unpredictable resource availability IPDPS 2008

5 Overview of GATES Middleware
Distributed Data Stream Processing Automatic Resource Discovery To Achieve the Best Accuracy While Maintaining the Real-Time Constraints (Self-Adaptation Algorithm) Easy-To-Use (Java, XML, Web Services) Our previous work: HPDC04, SC06, IPDPS06 IPDPS 2008

6 Outline Motivation and Introduction Overall Design for Fault-Tolerance
Experimental Evaluation Related Work Conclusion IPDPS 2008

7 Overall Design for Fault-Tolerance
Design Alternatives Redundancy-based Checkpointing-based Drawbacks Resource requirements Synchronization of states for all replicas Platform dependent Large-volume checkpoints IPDPS 2008

8 Our Proposed Approach Light-Weight Summary Structure (LSS)
Locally updated each processing round Transferred to remote nodes Heartbeat-based Fault Detection Failure Recovery using LSS Other Issues and Enhancements Data Backup Buffer Efficient Resource Allocation Algorithm IPDPS 2008

9 Definition of Light-weight Summary Structure (LSS)
Data Stream Processing Structure Summary Information Accumulated Each Processing Loop Iteration A Small Memory Size ... while(true) { read_data_from_streams(); process_data(); accumulate_intermediate_results(); reset_auxiliary_structures(); } IPDPS 2008

10 LSS: An Example Application: Counting Samples counting-lss: S M F
Data Source Computing the m most frequent numbers Computing the 10 most frequent numbers counting-lss: int: value of m int array: the m most frequent numbers int array: corresponding frequencies IPDPS 2008

11 Using LSS for Fault-Tolerance
Much Smaller Memory Size Than That of the Application Auxiliary Structures are reset at the end of each iteration Approximate Processing on Data Streams IPDPS 2008

12 Using LSS for Fault-Tolerance –cont’d
Compare LSS-based Fault-Tolerance to checkpointing in grid environments Much smaller memory size than that of the application A small amount of data is lost during failure recovery LSS is independent of platforms IPDPS 2008

13 GATES Implementation for Fault-Tolerance
Application: // Initialize auxiliary structures initialize_auxiliary_structures(); // Get an LSS instance from GATES counting-lss lss = GATES.getLSS(”counting-lss”); // Process streaming data while true // check if input buffer is invalid if inBuffer.getInputBufferStatus()==INVALID // Stop processing then break; read_data_from_streams(); process_data(); accumulate_intermediate_results_to_LSS(lss); update_local_LSS(lss); GATES: // Monitor service if local LSS updated then send_LSS_to_Candidates(lss); // Replication service remote_store_LSS(lss); IPDPS 2008

14 Failure Recovery Procedure
IPDPS 2008

15 Other Issues and Enhancements
Data Backup Buffer Data is stored in the backup buffer until acknowledgment is received Obsolete data in the backup buffer will be replaced Efficient Resource Allocation Algorithm Candidate nodes Dijkstra’s shortest path algorithm IPDPS 2008

16 Outline Motivation and Introduction Overall Design for Fault-Tolerance
Experimental Evaluation Related Work Conclusion IPDPS 2008

17 Streaming Applications
Counting Samples (count-samps) To determine the n most frequent numbers LSS: m most frequent numbers Clustering Evolving Streams (clustream) To group data into n clusters LSS: m micro-clusters Distributed Frequency Counting (dist-freq-counting) To find the most frequent itemset with threshold LSS: most frequent itemset with threshold IPDPS 2008

18 Goals for the Experiments
Show that LSS Uses a Small Amount of Memory Evaluate the Overhead of LSS for Fault-Tolerance Show the Impact on Accuracy IPDPS 2008

19 Experiment Setup and Datasets
64-Node Computing Cluster Simulate Different Inter-node Bandwidths Datasets count-samps: data generated by a simulator clustream: KDD-CUP’99 Network Intrusion Detection dataset dist-freq-counting: IBM synthetic data generator IPDPS 2008

20 Size of count-samps (KB)
Memory Usage of LSS Value of m Size of LSS (KB) Size of count-samps (KB) 20 6 954 80 1149 160 36 1432 200 48 1662 LSS only occupied approximately 0.6%, 1.7%, 2.5% and 2.9%, respectively, of memory used by the entire application LSS consumed 0.9% of the clustream application and 1.1% of the dist-freq-counting application IPDPS 2008

21 Using LSS for Fault-Tolerance: Performance
Execution Time of count-samps 4% 7% 10% IPDPS 2008

22 Using LSS for Fault-Tolerance: Performance
Execution Time of clustream 2.5% IPDPS 2008

23 Using LSS for Fault-Tolerance: Performance
Execution Time of dist-freq-counting 3.5% IPDPS 2008

24 Using LSS for Fault-Tolerance: Accuracy
Accuracy of count-samps 1% 6% IPDPS 2008

25 Using LSS for Fault-Tolerance: Accuracy
Accuracy of clustream IPDPS 2008

26 Outline Motivation and Introduction Overall Design for Fault-Tolerance
Experimental Evaluation Related Work Conclusion IPDPS 2008

27 Related Work Application-Level Checkpointing
Bronevetsky et al. (PPoPP03, ASPLOS04, SC06) Replication-based Fault Tolerance Abawajy et al. (IPDPS04), Murty et al. (HotDep06) Hwang et al. (ICDE05), Zheng et al. (Cluster04) Fault Tolerance in Distributed Data Stream Processing Balazinska et al. (SIGMOD05, ICDE05) IPDPS 2008

28 Outline Motivation and Introduction Overall Design for Fault-Tolerance
Experimental Evaluation Related Work Conclusion IPDPS 2008

29 Conclusion Use of LSS to Enable Efficient Failure-Recovery
Use of Additional Buffers to Control Data Loss Efficient Resource Allocation Algorithm Modest Overhead Associated with Fault-Detection and Failure-Recovery Small Loss of Accuracy IPDPS 2008

30 Thank you! IPDPS 2008


Download ppt "Supporting Fault-Tolerance in Streaming Grid Applications"

Similar presentations


Ads by Google