Team 6: Slackers 18749: Fault Tolerant Distributed Systems Team Members Puneet Aggarwal Karim Jamal Steven Lawrance Hyunwoo Kim Tanmay Sinha.

Team 6: Slackers 18749: Fault Tolerant Distributed Systems Team Members Puneet Aggarwal Karim Jamal Steven Lawrance Hyunwoo Kim Tanmay Sinha

Team Slackers - Park 'n Park2 Team Members URL: http://www.ece.cmu.edu/~ece749/teams-06/team6/

Team Slackers - Park 'n Park3 Overview Baseline Application Baseline Architecture FT-Baseline Goals FT-Baseline Architecture Fail-Over Mechanisms Fail-Over Measurements Fault Tolerance Experimentation Bounded “Real Time” Fail-Over Measurements FT-RT-Performance Strategy Other Features Conclusions Baseline Application Baseline Architecture FT-Baseline Goals FT-Baseline Architecture Fail-Over Mechanisms Fail-Over Measurements Fault Tolerance Experimentation Bounded “Real Time” Fail-Over Measurements FT-RT-Performance Strategy Other Features Conclusions

Team Slackers - Park 'n Park4 Baseline Application

Team Slackers - Park 'n Park5 Baseline Application A system that manages the information and status of multiple parking lots. Keeps track of how many spaces are available in the lot and at each level. Recommends other available lots that are nearby if the current lot is full. Allows drivers to enter/exit lots and move up/down levels once in a parking lot. A system that manages the information and status of multiple parking lots. Keeps track of how many spaces are available in the lot and at each level. Recommends other available lots that are nearby if the current lot is full. Allows drivers to enter/exit lots and move up/down levels once in a parking lot. What is Park ‘n Park?

Team Slackers - Park 'n Park6 Baseline Application Why is it interesting? Easy to implement Easy to distribute over multiple systems Potential of having multiple clients Middle-tier can be made stateless Hasn’t been done before in this class And most of all…who wants this? Easy to implement Easy to distribute over multiple systems Potential of having multiple clients Middle-tier can be made stateless Hasn’t been done before in this class And most of all…who wants this?

Team Slackers - Park 'n Park7 Baseline Application Development Tools Java –Familiarity with language –Platform independence CORBA –Long story (to be discussed later…) MySQL –Familiarity with the package –Free!!! –Available on ECE cluster Linux, Windows, and OS X –No one has the same system nowadays Eclipse, Matlab, CVS, and PowerPoint –Powerful tools in their target markets Java –Familiarity with language –Platform independence CORBA –Long story (to be discussed later…) MySQL –Familiarity with the package –Free!!! –Available on ECE cluster Linux, Windows, and OS X –No one has the same system nowadays Eclipse, Matlab, CVS, and PowerPoint –Powerful tools in their target markets

Team Slackers - Park 'n Park8 Baseline Application High-Level Components Client –Provides an interface to interact with the user –Creates an instance of Client Manager Server –Manages Client Manager Factory –Handles CORBA functions Client Manager –Part of middle tier –Manages various client functions –Unique for each client Client Manager Factory –Part of middle tier –Factory for Client Manager instances Database –Stores the state for each client –Stores the state of the parking lots (i.e. occupancy of lots and levels, distances to other parking lots) Naming Service –Allows client to obtain reference to a server Client –Provides an interface to interact with the user –Creates an instance of Client Manager Server –Manages Client Manager Factory –Handles CORBA functions Client Manager –Part of middle tier –Manages various client functions –Unique for each client Client Manager Factory –Part of middle tier –Factory for Client Manager instances Database –Stores the state for each client –Stores the state of the parking lots (i.e. occupancy of lots and levels, distances to other parking lots) Naming Service –Allows client to obtain reference to a server

Team Slackers - Park 'n Park9 Baseline Architecture

Team Slackers - Park 'n Park10 Baseline Architecture High-Level Components Server Client Naming Service Middleware Database 3. Contact naming service 4. Create client manager instance 7. Request data 2. Register name Client Manager Factory Client Manager 6. Invoke service method Processes and Threads x y Data Flow Legend 1. Create instance 5. Create instance

Team Slackers - Park 'n Park11 FT-Baseline Goals

Team Slackers - Park 'n Park12 FT-Baseline Goals Main Goals Replicate the entire middle tier in order to make the system fault- tolerant. The middle tier includes –Client Manager –Client Manager Factory –Server No need to replicate the naming service, replication manager, and database because of added complexity and limited development time Maintain the stateless nature of the middle tier by storing all state in the database For the fault tolerant baseline application –3 replicas of the servers on clue, chess, and go Naming service (boggle), Replication Manager (boggle) and Database (previously on mahjongg, now on girltalk) on the sacred servers –Have not been replicated and are single point-of-failures Replicate the entire middle tier in order to make the system fault- tolerant. The middle tier includes –Client Manager –Client Manager Factory –Server No need to replicate the naming service, replication manager, and database because of added complexity and limited development time Maintain the stateless nature of the middle tier by storing all state in the database For the fault tolerant baseline application –3 replicas of the servers on clue, chess, and go Naming service (boggle), Replication Manager (boggle) and Database (previously on mahjongg, now on girltalk) on the sacred servers –Have not been replicated and are single point-of-failures

Team Slackers - Park 'n Park13 FT-Baseline Goals FT Framework Replication Manager –Responsible for checking liveliness of servers –Performs fault detection and recovery of servers –Can handle an arbitrary amount of server replicas –Can be restarted Fault Injector –kill -9 –Script to periodically kill primary server Added in the RT-FT-Baseline implementation Replication Manager –Responsible for checking liveliness of servers –Performs fault detection and recovery of servers –Can handle an arbitrary amount of server replicas –Can be restarted Fault Injector –kill -9 –Script to periodically kill primary server Added in the RT-FT-Baseline implementation

Team Slackers - Park 'n Park14 FT-Baseline Architecture

Team Slackers - Park 'n Park15 FT-Baseline Architecture High Level Components Server Client Naming Service Middleware Database 4. Contact naming service 5. Create client manager instance 8. Request data 2. Register name Client Manager Factory Client Manager 7. Invoke service method Processes and Threads x y Data Flow Legend 1. Create instance 6. Create instance Replication Manager 3. Notify of existence poke() bind() / unbind()

Team Slackers - Park 'n Park16 Fail-Over Mechanism

Team Slackers - Park 'n Park17 Fail-Over Mechanism Fault Tolerant Client Manager Resides on the client side Invokes service methods on the client Manager on behalf of the client Responsible for fail-over –Detects faults by catching exceptions –If an exception is thrown during a service call/invocation, it gets the primary server reference from the naming service and retries the failed operation using the new server reference Resides on the client side Invokes service methods on the client Manager on behalf of the client Responsible for fail-over –Detects faults by catching exceptions –If an exception is thrown during a service call/invocation, it gets the primary server reference from the naming service and retries the failed operation using the new server reference

Team Slackers - Park 'n Park18 Fail-Over Mechanism Replication Manager Detects faults using method called “poke” Maintains a dynamic list of active servers Restarts failed/corrupted servers Performs naming service maintenance –Unbinds names of crashed servers –Rebinds name of primary server Uses the most-recently-active methodology to choose a new primary server in case the primary server experiences a fault Detects faults using method called “poke” Maintains a dynamic list of active servers Restarts failed/corrupted servers Performs naming service maintenance –Unbinds names of crashed servers –Rebinds name of primary server Uses the most-recently-active methodology to choose a new primary server in case the primary server experiences a fault

Team Slackers - Park 'n Park19 Fail-Over Mechanism The Poke Method “Pokes” the server periodically Not only checks whether or not the server is alive, but also whether the server’s database connectivity is intact or is corrupted Throws exceptions in case of faults (i.e. can’t connect to database) The replication manager handles faults accordingly “Pokes” the server periodically Not only checks whether or not the server is alive, but also whether the server’s database connectivity is intact or is corrupted Throws exceptions in case of faults (i.e. can’t connect to database) The replication manager handles faults accordingly

Team Slackers - Park 'n Park20 Fail-Over Mechanism Exceptions Handled COMM_FAILURE: CORBA exception OBJECT_NOT_EXIST: CORBA exception SystemException: CORBA exception Exception: Java exception AlreadyInLotException: Client is already in a lot AtBottomLevelException: Car cannot move to a lower level because it's on the bottom floor AtTopLevelException: Car cannot move to a higher level because it's on the top floor InvalidClientException: ID provided by Client doesn’t match the ID stored in the system LotFullException: System throws exception when the lot is full LotNotFoundException: Lot number not found in the database NotInLotException: Client's car is not in the lot NotOnExitLevelException: Client is not on an exit level in the lot ServiceUnavailableException: Exception that gets thrown when an unrecoverable database exception or some other error prevents the server from successfully completing a client-requested operation COMM_FAILURE: CORBA exception OBJECT_NOT_EXIST: CORBA exception SystemException: CORBA exception Exception: Java exception AlreadyInLotException: Client is already in a lot AtBottomLevelException: Car cannot move to a lower level because it's on the bottom floor AtTopLevelException: Car cannot move to a higher level because it's on the top floor InvalidClientException: ID provided by Client doesn’t match the ID stored in the system LotFullException: System throws exception when the lot is full LotNotFoundException: Lot number not found in the database NotInLotException: Client's car is not in the lot NotOnExitLevelException: Client is not on an exit level in the lot ServiceUnavailableException: Exception that gets thrown when an unrecoverable database exception or some other error prevents the server from successfully completing a client-requested operation

Team Slackers - Park 'n Park21 Fail-Over Mechanism Response to Exceptions Get new server reference and then re-try the failed operation when the following exception occurs –COMM_FAILURE –OBJECT_NOT_EXIST –ServiceUnavailableException Report error to user and prompt for next command when the following exceptions occur –AlreadyInLotException –AtBottomLevelException –AtTopLevelException –LotFullException –LotNotFoundException –NotInLotException –NotOnExitLevelException Client terminates when the following exceptions occur –InvalidClientException –SystemException –Exception Get new server reference and then re-try the failed operation when the following exception occurs –COMM_FAILURE –OBJECT_NOT_EXIST –ServiceUnavailableException Report error to user and prompt for next command when the following exceptions occur –AlreadyInLotException –AtBottomLevelException –AtTopLevelException –LotFullException –LotNotFoundException –NotInLotException –NotOnExitLevelException Client terminates when the following exceptions occur –InvalidClientException –SystemException –Exception

Team Slackers - Park 'n Park22 Fail-Over Mechanism Server References The client obtains the reference to the primary server when –it is initially started –it notices that the server has crashed or been corrupted (i.e. COMM_FAILURE, ServiceUnavailableException) When the client notices that there is no primary server reference in the naming service, it displays an appropriate message and then terminates The client obtains the reference to the primary server when –it is initially started –it notices that the server has crashed or been corrupted (i.e. COMM_FAILURE, ServiceUnavailableException) When the client notices that there is no primary server reference in the naming service, it displays an appropriate message and then terminates

Team Slackers - Park 'n Park23 RT-FT-Baseline Architecture

Team Slackers - Park 'n Park24 RT-FT-Baseline Architecture High Level Components Server Client Naming Service Middleware Database 4. Contact naming service 5. Create client manager instance 8. Request data 2. Register name Client Manager Factory Client Manager 7. Invoke service method 1. Create instance 6. Create instance Replication Manager 3. Notify of existence poke() bind()/unbind() Testing Manager Processes and Threads x y Data Flow Legend x y Launches

Team Slackers - Park 'n Park25 Fault Tolerance Experimentation

Team Slackers - Park 'n Park26 Fault Tolerance Experimentation The Fault Free Run - Graph 1 While the mean latency stayed almost constant, the maximum latency varied

Team Slackers - Park 'n Park27 Fault Tolerance Experimentation The Fault Free Run - Graph 2 This demonstrates the conformance with the magical 1% theory

Team Slackers - Park 'n Park28 Fault Tolerance Experimentation The Fault Free Run - Graph 3 Mean latency increases as the reply size increases

Team Slackers - Park 'n Park29 Fault Tolerance Experimentation The Fault Free Run - Conclusions Our data conforms to the magical 1% theory, indicating that outliers account for less than 1% of the data points We hope that this helps with Tudor’s research Our data conforms to the magical 1% theory, indicating that outliers account for less than 1% of the data points We hope that this helps with Tudor’s research

Team Slackers - Park 'n Park30 Bounded “Real Time” Fail Over Measurements

Team Slackers - Park 'n Park31 Bounded “Real-Time” Fail Over Measurements The Fault Induced Run - Graph High latency is observed during faults

Team Slackers - Park 'n Park32 Bounded “Real-Time” Fail Over Measurements The Fault Induced Run - Pie Chart Client’s fault recovery timeout causes most of the latency

Team Slackers - Park 'n Park33 Bounded “Real-Time” Fail Over Measurements The Fault Induced Run - Conclusions We noticed that there is an observable latency when a fault occurs Most of the latency was caused by the client’s fault recovery timeout The second-highest contributor was the time that the client has to wait for the client manager to be restored on the new server We noticed that there is an observable latency when a fault occurs Most of the latency was caused by the client’s fault recovery timeout The second-highest contributor was the time that the client has to wait for the client manager to be restored on the new server

Team Slackers - Park 'n Park34 FT-RT-Performance Strategy

Team Slackers - Park 'n Park35 FT-RT-Performance Strategy Reducing Fail-Over Time Implemented strategies –Adjust client fault recovery timeout –Use IOGRs and cloning-like strategies –Pre-create TCP/IP connections to all servers Other strategies that could potentially be implemented –Database connection pool –Load balancing –Remove client ID consistency check Implemented strategies –Adjust client fault recovery timeout –Use IOGRs and cloning-like strategies –Pre-create TCP/IP connections to all servers Other strategies that could potentially be implemented –Database connection pool –Load balancing –Remove client ID consistency check

Team Slackers - Park 'n Park36 Measurements after Strategies Adjusting Waiting time The following graphs are for different values of wait time at the client end This is the time that the client waits in order to give the replication manager sufficient time to update the naming service with the new primary. The following graphs are for different values of wait time at the client end This is the time that the client waits in order to give the replication manager sufficient time to update the naming service with the new primary.

Team Slackers - Park 'n Park37 Measurements after Strategies Plot for 0 waiting time

Team Slackers - Park 'n Park38 Measurements after Strategies Plot for 500ms waiting time

Team Slackers - Park 'n Park46 Measurements after Strategies Observations after After Adjusting Wait times The best results can be seen with 4000ms wait time. Even though there is a lot of reduction in fail-over time for lower values, we can observe significant amount of jitter. The reason for the jitter is that the client doesn’t get the updated primary from the naming service. Since our primary concern is bounded fail-over, we chose the strategy that has the least jitter, rather than the strategy that has the lowest latencies. The average recovery time is reduced by a decent amount (from about 5-6 secs to 4.5-5 sec for 4000ms wait time). The best results can be seen with 4000ms wait time. Even though there is a lot of reduction in fail-over time for lower values, we can observe significant amount of jitter. The reason for the jitter is that the client doesn’t get the updated primary from the naming service. Since our primary concern is bounded fail-over, we chose the strategy that has the least jitter, rather than the strategy that has the lowest latencies. The average recovery time is reduced by a decent amount (from about 5-6 secs to 4.5-5 sec for 4000ms wait time).

Team Slackers - Park 'n Park47 Measurements after Strategies Implementing IOGR Interoperable Object Group Reference In this, the client gets the list of all active servers from the naming service The client refreshes this list if all the servers in the list have failed The following graphs were produced after this strategy was implemented Interoperable Object Group Reference In this, the client gets the list of all active servers from the naming service The client refreshes this list if all the servers in the list have failed The following graphs were produced after this strategy was implemented

Team Slackers - Park 'n Park48 Measurements after Strategies > Plot after IOGR strategy (same axis)

Team Slackers - Park 'n Park49 Measurements after Strategies Plot after IOGR strategy (different axis)

Team Slackers - Park 'n Park50 Measurements after Strategies Pie Chart after IOGR strategy

Team Slackers - Park 'n Park51 Measurements after Strategies Observations after IOGR Strategy The recovery time is significantly reduced, from between 5- 6 seconds to less than half a second The time to get the new primary from the naming service is eliminated Most of the time is spent in obtaining an object of client manager The graph that is plotted on the different axis shows some amount of jitter, since, when all the servers in the client’s list are dead, then the client will have to go to the naming service The recovery time is significantly reduced, from between 5- 6 seconds to less than half a second The time to get the new primary from the naming service is eliminated Most of the time is spent in obtaining an object of client manager The graph that is plotted on the different axis shows some amount of jitter, since, when all the servers in the client’s list are dead, then the client will have to go to the naming service

Team Slackers - Park 'n Park52 Measurements after Strategies Implementing Open TCP/IP Connections This strategy was implemented since, after implementing the IOGR strategy, most of the time was spent in establishing a connection with the next server and getting the client manager In this, the client maintains the open TCP/IP connections with all the servers So the time to create a connection is saved The following graphs were produced after the open TCP/IP connections strategy was implemented This strategy was implemented since, after implementing the IOGR strategy, most of the time was spent in establishing a connection with the next server and getting the client manager In this, the client maintains the open TCP/IP connections with all the servers So the time to create a connection is saved The following graphs were produced after the open TCP/IP connections strategy was implemented

Team Slackers - Park 'n Park53 Measurements after Strategies Plot after maintaining opening connections (same axis, 1 client)

Team Slackers - Park 'n Park54 Measurements after Strategies Plot after maintaining opening connections (different axis, 1 client)

Team Slackers - Park 'n Park55 Measurements after Strategies Pie Chart after maintaining opening connections (1 Client)

Team Slackers - Park 'n Park56 Measurements after Strategies Plot after maintaining opening connections (same axis, 10 clients)

Team Slackers - Park 'n Park57 Measurements after Strategies Plot after opening connections (different axis, 10 clients)

Team Slackers - Park 'n Park58 Measurements after Strategies Pie Chart after Opening Connections (10 Clients)

Team Slackers - Park 'n Park59 Measurements after Strategies Observations after implementation of open connections for 1 client The recovery time is reduced compared to the cloning strategy Maximum time taken in is still in obtaining an object of client manager There is noticeable jitter when observed from different axis The recovery time is reduced compared to the cloning strategy Maximum time taken in is still in obtaining an object of client manager There is noticeable jitter when observed from different axis

Team Slackers - Park 'n Park60 Measurements after Strategies Observations after implementation of open connections for 10 clients Significant reduction is observed in fail-over time Maximum time taken in is still in obtaining an object of client manager It can also be observed that a significant amount of time is taken in waiting for acquiring a lock on the thread Significant reduction is observed in fail-over time Maximum time taken in is still in obtaining an object of client manager It can also be observed that a significant amount of time is taken in waiting for acquiring a lock on the thread

Team Slackers - Park 'n Park61 Other Features

Team Slackers - Park 'n Park62 Other Features The Long Story - EJB 3.0 It’s actually not that long of a story…we tried to use EJB 3.0 and failed miserably. The End. The main issues with using EJB 3.0 are –It is a new technology, so documentation on it is very sparse –It is still evolving and changing, which can cause problems (e.g. JBoss 4.0.3 vs 4.0.4) –The development and deployment is significantly different from EJB 2.1, which introduces a new learning curve –It is not something that can be learned in one weekend… It’s actually not that long of a story…we tried to use EJB 3.0 and failed miserably. The End. The main issues with using EJB 3.0 are –It is a new technology, so documentation on it is very sparse –It is still evolving and changing, which can cause problems (e.g. JBoss 4.0.3 vs 4.0.4) –The development and deployment is significantly different from EJB 2.1, which introduces a new learning curve –It is not something that can be learned in one weekend…

Team Slackers - Park 'n Park63 Other Features Bells and Whistles Replication manager can be restarted Replication manager can handle an arbitrary number of servers Any server can be dynamically added and removed due to no hard-coding Cars can magically teleport in and out of parking lots (for testing robustness) Clients can manually corrupt the server’s database connection (for testing robustness) Use the Java Reflection API in the client to consolidate fault detection and recovery code Prevents Sun’s CORBA implementation from spewing exception stack traces to the user Highly-modularized dependency structure in the code (as proved by Lattix LDM) Other stuff that we can’t remember … Replication manager can be restarted Replication manager can handle an arbitrary number of servers Any server can be dynamically added and removed due to no hard-coding Cars can magically teleport in and out of parking lots (for testing robustness) Clients can manually corrupt the server’s database connection (for testing robustness) Use the Java Reflection API in the client to consolidate fault detection and recovery code Prevents Sun’s CORBA implementation from spewing exception stack traces to the user Highly-modularized dependency structure in the code (as proved by Lattix LDM) Other stuff that we can’t remember …

Team Slackers - Park 'n Park64 Other Features Lessons Learned It’s difficult to implement real-time, fault tolerance, and high performance, especially if it is not factored into the architecture from the start Choose an application that will permit you to easily apply the concepts learned in the class Don’t waste time with bells and whistles until you have time to do so Run your measurements before other teams hog and crash the server Set up your own database server Kill the server such that logs are flushed before the server dies Catch and handle as many exceptions as possible It’s a good thing that we did not use JBoss! Use the girltalk server because no one else is going to use that one … It’s difficult to implement real-time, fault tolerance, and high performance, especially if it is not factored into the architecture from the start Choose an application that will permit you to easily apply the concepts learned in the class Don’t waste time with bells and whistles until you have time to do so Run your measurements before other teams hog and crash the server Set up your own database server Kill the server such that logs are flushed before the server dies Catch and handle as many exceptions as possible It’s a good thing that we did not use JBoss! Use the girltalk server because no one else is going to use that one …

Team Slackers - Park 'n Park65 Other Features … Painful Lessons Learned Most painful lessons learned: 1.The EJB concept takes time to learn and use 2.EJB 3.0 introduces another learning curve 3.JBoss provides many, many configuration options, which makes deploying an application a challenging task 4 – 10. Don’t try to learn the concepts of EJB… …and EJB 3.0… …and JBoss… …all at the same time… …in one weekend… …especially when the project is due the following Monday… …!!!!!!!!!!!!!!!!!!!! Most painful lessons learned: 1.The EJB concept takes time to learn and use 2.EJB 3.0 introduces another learning curve 3.JBoss provides many, many configuration options, which makes deploying an application a challenging task 4 – 10. Don’t try to learn the concepts of EJB… …and EJB 3.0… …and JBoss… …all at the same time… …in one weekend… …especially when the project is due the following Monday… …!!!!!!!!!!!!!!!!!!!!

Team Slackers - Park 'n Park66 Conclusions

Team Slackers - Park 'n Park67 Conclusions If we had the Time Turner!! We would start right away with CORBA, (100+ hours were a little too much) And we just found out… … would have counted the number of invocations in the experiments before submitting We would start right away with CORBA, (100+ hours were a little too much) And we just found out… … would have counted the number of invocations in the experiments before submitting

Team Slackers - Park 'n Park68 Conclusions …the final word. Yeayyyyyyyyyyyyyyyyyyyyyyyyyy!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Team Slackers - Park 'n Park69 Thank You.

Team 6: Slackers 18749: Fault Tolerant Distributed Systems Team Members Puneet Aggarwal Karim Jamal Steven Lawrance Hyunwoo Kim Tanmay Sinha.

Similar presentations

Presentation on theme: "Team 6: Slackers 18749: Fault Tolerant Distributed Systems Team Members Puneet Aggarwal Karim Jamal Steven Lawrance Hyunwoo Kim Tanmay Sinha."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Team 6: Slackers 18749: Fault Tolerant Distributed Systems Team Members Puneet Aggarwal Karim Jamal Steven Lawrance Hyunwoo Kim Tanmay Sinha.

Similar presentations

Presentation on theme: "Team 6: Slackers 18749: Fault Tolerant Distributed Systems Team Members Puneet Aggarwal Karim Jamal Steven Lawrance Hyunwoo Kim Tanmay Sinha."— Presentation transcript:

Similar presentations

About project

Feedback