Team 4: 18-749: Fault-Tolerant Distributed Systems Bryan Murawski Meg Hyland Jon Gray Joseph Trapasso Prameet Shah Michael Mishkin.

Slides:



Advertisements
Similar presentations
Configuration management
Advertisements

Snejina Lazarova Senior QA Engineer, Team Lead CRMTeam Dimo Mitev Senior QA Engineer, Team Lead SystemIntegrationTeam Telerik QA Academy SOAP-based Web.
MCTS GUIDE TO MICROSOFT WINDOWS 7 Chapter 10 Performance Tuning.
Netscape Application Server Application Server for Business-Critical Applications Presented By : Khalid Ahmed DS Fall 98.
Chapter One The Essence of UNIX.
Copyright 2004 Monash University IMS5401 Web-based Systems Development Topic 2: Elements of the Web (g) Interactivity.
Team 1: Box Office : Analysis of Software Artifacts : Dependability Analysis of Middleware JunSuk Oh, YounBok Lee, KwangChun Lee, SoYoung Kim,
7.1 © 2004 Pearson Education, Inc. Exam Planning, Implementing, and Maintaining a Microsoft Windows Server 2003 Active Directory Infrastructure.
1 SWE Introduction to Software Engineering Lecture 22 – Architectural Design (Chapter 13)
Team 2: The House Party Blackjack Mohammad Ahmad Jun Han Joohoon Lee Paul Cheong Suk Chan Kang.
Presented by IBM developer Works ibm.com/developerworks/ 2006 January – April © 2006 IBM Corporation. Making the most of Creating Eclipse plug-ins.
Team 6: Slackers 18749: Fault Tolerant Distributed Systems Team Members Puneet Aggarwal Karim Jamal Steven Lawrance Hyunwoo Kim Tanmay Sinha.
1 Philippe. Team 3: Spam’n’Beans : Analysis of Software Artifacts : Dependability Analysis of Middleware Gary Ackley Andrew Boyer Charles.
Hands-On Microsoft Windows Server 2003 Administration Chapter 5 Administering File Resources.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
Guide to Linux Installation and Administration, 2e1 Chapter 6 Using the Shell and Text Files.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 8: Implementing and Managing Printers.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 8: Implementing and Managing Printers.
PRASHANTHI NARAYAN NETTEM.
MCDST : Supporting Users and Troubleshooting a Microsoft Windows XP Operating System Chapter 5: User Environment and Multiple Languages.
Passage Three Introduction to Microsoft SQL Server 2000.
Overview SAP Basis Functions. SAP Technical Overview Learning Objectives What the Basis system is How does SAP handle a transaction request Differentiating.
Client/Server Architectures
Object Oriented Databases by Adam Stevenson. Object Databases Became commercially popular in mid 1990’s Became commercially popular in mid 1990’s You.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
Server Load Balancing. Introduction Why is load balancing of servers needed? If there is only one web server responding to all the incoming HTTP requests.
1 The Google File System Reporter: You-Wei Zhang.
Christopher Jeffers August 2012
Basics of Web Databases With the advent of Web database technology, Web pages are no longer static, but dynamic with connection to a back-end database.
MCTS Guide to Microsoft Windows 7
Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.
Chapter Fourteen Windows XP Professional Fault Tolerance.
University of Management & Technology 1 Operating Systems & Utility Programs.
WebLogic Versus JBoss.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Enterprise JavaBeans. What is EJB? l An EJB is a specialized, non-visual JavaBean that runs on a server. l EJB technology supports application development.
Section 1: Introducing Group Policy What Is Group Policy? Group Policy Scenarios New Group Policy Features Introduced with Windows Server 2008 and Windows.
9 Chapter Nine Compiled Web Server Programs. 9 Chapter Objectives Learn about Common Gateway Interface (CGI) Create CGI programs that generate dynamic.
Csi315csi315 Client/Server Models. Client/Server Environment LAN or WAN Server Data Berson, Fig 1.4, p.8 clients network.
Python – Part 1 Python Programming Language 1. What is Python? High-level language Interpreted – easy to test and use interactively Object-oriented Open-source.
Ch 1. A Python Q&A Session Spring Why do people use Python? Software quality Developer productivity Program portability Support libraries Component.
Project Overview Graduate Selection Process Project Goal Automate the Selection Process.
Team 5: Virtual Online Blackjack : Analysis of Software Artifacts : Dependability Analysis of Middleware Philip Bianco John Robert Vorachat.
Project Overview Graduate Selection Process Project Goal Automate the Selection Process.
1 Chapter Overview Performing Configuration Tasks Setting Up Additional Features Performing Maintenance Tasks.
Debugging and Profiling With some help from Software Carpentry resources.
Chapter 10 Chapter 10: Managing the Distributed File System, Disk Quotas, and Software Installation.
Apache JMeter By Lamiya Qasim. Apache JMeter Tool for load test functional behavior and measure performance. Questions: Does JMeter offers support for.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
SOAP-based Web Services Telerik Software Academy Software Quality Assurance.
High Availability in DB2 Nishant Sinha
Lesson 3-Touring Utilities and System Features. Overview Employing fundamental utilities. Linux terminal sessions. Managing input and output. Using special.
© FPT SOFTWARE – TRAINING MATERIAL – Internal use 04e-BM/NS/HDCV/FSOFT v2/3 JSP Application Models.
IT System Administration Lesson 3 Dr Jeffrey A Robinson.
The Project Presentation April 28, : Fault-Tolerant Distributed Systems Team 7-Sixers Kyu Hou Minho Jeung Wangbong Lee Heejoon Jung Wen Shu.
Performance Testing Test Complete. Performance testing and its sub categories Performance testing is performed, to determine how fast some aspect of a.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
VMware Certified Professional 6-Data Center Virtualization Beta 2V0-621Exam.
 Project Team: Suzana Vaserman David Fleish Moran Zafir Tzvika Stein  Academic adviser: Dr. Mayer Goldberg  Technical adviser: Mr. Guy Wiener.
Monitoring Dynamic IOC Installations Using the alive Record Dohn Arms Beamline Controls & Data Acquisition Group Advanced Photon Source.
11 DEPLOYING AN UPDATE MANAGEMENT INFRASTRUCTURE Chapter 6.
Architecture Review 10/11/2004
Jean-Philippe Baud, IT-GD, CERN November 2007
Introduction to Distributed Platforms
Netscape Application Server
Introduction to J2EE Architecture
SDMX IT Tools SDMX Registry
Presentation transcript:

Team 4: : Fault-Tolerant Distributed Systems Bryan Murawski Meg Hyland Jon Gray Joseph Trapasso Prameet Shah Michael Mishkin

2 Team Members Meg Hyland BrYan Murawski Joe Trapasso Prameet Shah Michael Mishkin Jonathan Gray

3 Baseline Application  System Description – EJBay is a distributed auctioning system that allows users to buy and sell items in an auction plaza  Baseline Applications – A user can create, login, update, logout, view other users’ account information. – A user can post, view, search, post a bid, view bid history of auctions. – Application Exceptions: DuplicateAccount, InvalidAuction, InvalidBid, InvalidUserInfo, InvalidUserPass, UserNotLoggedIn  Why is it Interesting? – A service used by many commercial vendors.  Configuration – Operating System Server & Client: Linux – Language Java SDK – Middleware Enterprise Java Beans – Third-party Software Database: MySQL Application Server: JBoss IDE: XEmacs, Netbeans

4 Baseline Application – Configuration Selection Criteria  Operating System: Linux – Easier to use, since ECE clusters are configured. – System is managed and backed up nightly by Computing Services.  Enterprise Java Beans (EJB) – Popular technology in the industry. – Every members’ preference.  MySQL – World’s most popular open source database. – Easy to install and use. – Couple of group members knew it well.  JBoss – Easily available on the servers. – Environment that was used in previous projects.  XEmacs – Most commonly learned text editor. – Members were familiar with syntax.  Netbeans – Easy to install and incorporates tab completion. – Allows you to see available functions within a class.

5 Baseline Architecture

6 Experimental Evaluation – Architecture  Unmodified Server Application  New Automated Client – Experimental variables taken as command-line inputs – Performs specified number of invocations and dies  Central Library of MATLAB scripts – One script to read in data from all probes – Others scripts each responsible for a specific graph

7 Experimental Evaluation – Results  Expected results – Increasing clients yield increasing latency – Most time spent in Middleware – “Magical 1%” – Slightly longer latencies in non-standard reply size cases  Actual results – Memory / Heap problems – Java optimizations changing behavior of code Shorter latency in non-standard reply size cases – Database INSERTs take much longer than SELECTs – Only exhibited “Magical 1%” to some extent – Very high variability and some unusual/unexpected results During test runs close to deadline; very high server/database loads

8 Experimental Evaluation – Original Latency – First set of experiments revealed unusual characteristics at high load – Default Java heap-size was not large enough – Garbage collector ran constantly after ~4500 requests w/ 10 clients

9 Experimental Evaluation – Improved Latency – Increased heap from default to 300MB

10 Experimental Evaluation – Improved Latency – Mean and 99% Latency area graph only loosely exhibited the “Magic 1%” behavior

11 Fault-Tolerance Framework  Replicate servers – Passive replication – Stateless servers – Allow for up to 14 replicas One for each machine in the Games cluster (minus ASL and Mahjongg)  Sacred Machines – Clients – Replication Manager – Naming Service – Fault Injector – Database  Elements of Fault-tolerance Framework – Replication Manager Heartbeat Fault detector Automatic recovery (maintenance of number of replicas) – Fault Injector

12 FT-Baseline Architecture

13 Replication Manager  Responsible for launching and maintaining servers  Heartbeats replicas periodically – 500ms period  Differentiates between crash faults and process faults – Crash fault: Server is removed from the active list – Process fault: Process is killed and restarted  Catches port binding exceptions – A server is already running on the current machine  remove from active list  Maintains global JNDI – Updating server references for clients – Indicates which server is primary/secondary – Keeps a count of the number of times any primary has failed  Advanced Features – Allows the user to see the current status of all replicas – Allows the user to see the bindings in the JNDI

14 Fault Injector  2 Modes  Manual Fault Injection – Runs a “kill -9” on a user specified server  Periodic Fault Injection – Prompts user to set up a kill timer Base period Max jitter about the base period Option to only kill primary replica, or a random replica

15 Mechanisms for Fail-Over  Replication Manager detected fail-over – Detects that a heartbeat thread failed – Kills the associated server – Checks cause of death – Launches new replica – If no active servers are free, the replication manager will print a message, kill all servers and exit  Client detected fail-over – Receives a RemoteException – Queries naming service for a new primary Previously accessed JNDI directly – Required a pause for JNDI to be corrected Sometimes this resulted in multiple failover attempts – When JNDI was not ready after predetermined wait time

16 Round Trip Client Latency w/Faults Average Latency for all Invocations – ms

17 Fail-Over Measurements – Half fault time is client delay waiting for JNDI to be updated – Rest of time spent between detection and correction in Rep Manager – This discrepancy between delay-time and correction time is the major target for improvement

18 RT-FT-Baseline Architecture Improvements  Target fault-detection and correction time in Replication Manager – Tweaking heartbeat frequency and heartbeat monitor frequency – Improvements in interactions with JNDI Additional parameters to specify primary server Update JNDI by modifying entries rather than rebuilding each time  Target fail-over time in client – Client pre-establishes connections to all active servers – Background thread queries JNDI and maintains updated list – On fail-over, client immediately fails-over to next active server No delay waiting for Replication Manager to update JNDI Background thread will synchronize client’s server list once it has been updated by the Replication Manager

19 RT-FT-Baseline Architecture

20 RT-FT- Post-Improvement Performance Old 1 Client Measurements Avg. Latency for all Invocations: ms Avg. Latency during a Fault: 4544ms New 1 Client Measurements Avg. Latency for all Invocations: ms Avg. Latency during a Fault: ms (82.2% Improvement)

21 RT-FT- Post-Improvement Performance – 4 Clients New 4 Client Measurements Avg. Latency for all Invocations: ms Avg. Latency during a Fault: ms

22 RT-FT- Post-Improvement Performance  More even distribution of time  Client reconnect time still dominates, but is a much smaller number

23 Special Features  Experimental Evaluation – Utilized JNI for microsecond precision timers – Maintained a central library of MATLAB processing scripts – Perl and shell scripts to automate entire process  Fault-Tolerant Baseline – Powerful Replication Manager that starts, restarts, and kills servers – Integrated command-line interface for additional automation – Fault-Injector with dual-modes  Fault-Case Performance – New client functionality to pre-establish all connections – Contents of JNDI directly correlated to actual status of servers Online, offline, booting

24 Open Issues  Problems launching multiple servers concurrently from Rep Manager – Many attempts to address/debug this issue with only some success – If multiple faults occur within short period of time, some servers may die unexpectedly  Improved Client Interface – GUI or Web-Based  Additional Application Features – Allow deletion of accounts, auctions, and bids – Security! – Improved search functionality

25 Conclusions  What we have learned – Stateless middle tier requires less overhead – XML has poor documentation. XDoclet would have been a good tool to use. – Running experiments takes an extremely long time. Automating test scripts increases throughput.  What we accomplished – A robust fault-tolerant system with a fully automated Replication Manager – Fully automated testing and evaluation platform  What we would do differently – Spending more time with XDoclet to reduce debugging – Use one session bean instead of separating functionality into two