Reliable Web Services: Methodology, Experiment and Modeling International Conference on Web Services (ICWS 2007) Pat. P. W. Chan, Michael R. Lyu Department.

Slides:



Advertisements
Similar presentations
Availability in Globally Distributed Storage Systems
Advertisements

Reliability on Web Services Presented by Pat Chan 17/10/2005.
Making Services Fault Tolerant
1 Building Reliable Web Services: Methodology, Composition, Modeling and Experiment Pat. P. W. Chan Department of Computer Science and Engineering The.
Distributed components
Software Reliability Engineering: A Roadmap
The Phoenix Recovery System: Rebuilding from the ashes of an Internet catastrophe Flavio Junqueira, Ranjita Bhagwan, Keith Marzullo, Stefan Savage, and.
Design, Implementation, and Experimentation on Mobile Agent Security for Electronic Commerce Applications Anthony H. W. Chan, Caris K. M. Wong, T. Y. Wong,
An Authentication Service Against Dishonest Users in Mobile Ad Hoc Networks Edith Ngai, Michael R. Lyu, and Roland T. Chin IEEE Aerospace Conference, Big.
Reliability on Web Services Pat Chan 31 Oct 2006.
© Chinese University, CSE Dept. Distributed Systems / Distributed Systems Topic 9: Time, Coordination and Replication Dr. Michael R. Lyu Computer.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
1 Building Reliable Web Services: Methodology, Composition, Modeling and Experiment Pat. P. W. Chan Supervised by Michael R. Lyu Department of Computer.
A Progressive Fault Tolerant Mechanism in Mobile Agent Systems Michael R. Lyu and Tsz Yeung Wong July 27, 2003 SCI Conference Computer Science Department.
ATIF MEHMOOD MALIK KASHIF SIDDIQUE Improving dependability of Cloud Computing with Fault Tolerance and High Availability.
3 Cloud Computing.
Failure Spread in Redundant UMTS Core Network n Author: Tuomas Erke, Helsinki University of Technology n Supervisor: Timo Korhonen, Professor of Telecommunication.
Dependable Web Service Compositions usng a Semantic Replication Scheme LABORATÓRIO DE SISTEMAS DISTRIBUÍDOS – LASID DEPARTAMENTO DE CIÊNCIA DA COMPUTAÇÃO.
Distributed QoS Evaluation for Real- World Web Services Zibin Zheng, Yilei Zhang, and Michael R. Lyu July 07, 2010 Department of Computer.
BFTCloud: A Byzantine Fault Tolerance Framework for Voluntary-Resource Cloud Computing Yilei Zhang, Zibin Zheng, and Michael R. Lyu
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Survey of Adding Fault Tolerance to Service Oriented Architecture Ingrid Buckley 03/26/09.
1 Reliable Web Services by Fault Tolerant Techniques: Methodology, Experiment, Modeling and Evaluation Term Presentation Presented by Pat Chan 3 May 2006.
CprE 458/558: Real-Time Systems
April 28, 2003 Early Fault Detection and Failure Prediction in Large Software Systems Felix Salfner and Miroslaw Malek Department of Computer Science Humboldt.
Idaho RISE System Reliability and Designing to Reduce Failure ENGR Sept 2005.
Optimal Resource Allocation for Protecting System Availability against Random Cyber Attack International Conference Computer Research and Development(ICCRD),
WS-DREAM: A Distributed Reliability Assessment Mechanism for Web Services Zibin Zheng, Michael R. Lyu Department of Computer Science & Engineering The.
© Chinese University, CSE Dept. Distributed Systems / Distributed Systems Topic 1: Characterization of Distributed & Mobile Systems Dr. Michael R.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Slide 1 Service-centric Software Engineering. Slide 2 Objectives To explain the notion of a reusable service, based on web service standards, that provides.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
1 Developing Aerospace Applications with a Reliable Web Services Paradigm Pat. P. W. Chan and Michael R. Lyu Department of Computer Science and Engineering.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Seminar On Rain Technology
Fault Tolerance in Distributed Systems Gökay Burak AKKUŞ Cmpe516 – Fault Tolerant Computing.
Week#3 Software Quality Engineering.
1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.
Service Oriented Architecture (SOA) Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Configuring File Services
Data Management on Opportunistic Grids
Sabri Kızanlık Ural Emekçi
RAID, Programmed I/O, Interrupt Driven I/O, DMA, Operating System
CHAPTER 3 Architectures for Distributed Systems
Fault Tolerance In Operating System
Frequently asked questions about software engineering
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
Overview of Web Services
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT -Sumanth Kandagatla Instructor: Prof. Yanqing Zhang Advanced Operating Systems (CSC 8320)
Service-centric Software Engineering
Fault Injection: A Method for Validating Fault-tolerant System
Testing, Reliability, and Interoperability Issues in CORBA Programming Paradigm 11/21/2018.
A New Multipath Routing Protocol for Ad Hoc Wireless Networks
Outline Announcements Fault Tolerance.
Fault Tolerance Distributed Web-based Systems
CSSSPEC6 SOFTWARE DEVELOPMENT WITH QUALITY ASSURANCE
3 Cloud Computing.
Active replication for fault tolerance
Pei Fan*, Ji Wang, Zibin Zheng, Michael R. Lyu
Gang Xing and Michael R. Lyu The Chinese University of Hong Kong
Prof. Leonardo Mostarda University of Camerino
Replica Placement Model: We consider objects (and don’t worry whether they contain just data or code, or both) Distinguish different processes: A process.
Introduction To Distributed Systems
The SMART Way to Migrate Replicated Stateful Services
Review and comparison of the modeling approaches and risk analysis methods for complex ship system. Author: Sunil Basnet.
Seminar on Enterprise Software
Presentation transcript:

Reliable Web Services: Methodology, Experiment and Modeling International Conference on Web Services (ICWS 2007) Pat. P. W. Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek Department of Computer Science and Engineering Humboldt University Berlin, Germany Presented by Pat Chan

Outline Introduction to Web Services Problem Statement Methodologies for Web Service Reliability New Reliable Web Service Paradigm Optimal Parameters Experimental Results and Discussion Conclusion Introduction Problem Statement Methodologies for Web Service Reliability New Reliable Web Service Paradigm Road Map for Experiment Experimental Results and Discussion Conclusion Future Directions

Introduction Service-oriented computing is becoming a reality. The problems of service dependability, security and timeliness are becoming critical. We propose experimental settings and offer a roadmap to dependable Web services. With service-oriented computing becoming a reality, there is an increasing demand for dependability. Service-oriented Architectures (SOA) are based on a simple model of roles. Every service may assume one or more roles such as being a service provider, a broker or a user (requestor). This model simplifies interoperability as only standard communication protocols and simple broker-request architectures are needed to facilitate exchange (trade) of services. Not surprisingly, the use of services, especially Web services, became a common practice. The expectations are that services will dominate software industry within the next five years. As services begin to pervade all aspects of life, the problems of service dependability, security and timeliness are becoming critical and appropriate solutions need to be found. Several fault tolerance approaches have been proposed for Web services in the literature [1, 2, 3, 4], but the field still requires appropriate solid theory, appropriate models, effective design paradigms, a practical implementation, and an in-depth experimentation for building highly-dependable Web services [5, 6]. In this paper, we propose such experimental settings and offer a roadmap to dependable Web services. We compose a list of parameters that are closely related to evaluating quality of service from the dependability perspective. We focus on service availability and timeliness and consider them a cornerstone of our approach. Security, which can also be viewed as a part of dependability, is beyond the scope of this paper.

Problem Statement Fault-tolerant techniques Replication Diversity Replication is one of the efficient ways for providing reliable systems by time or space redundancy. Increasing the availability of distributed systems Key components are re-executed or replicated Protect against hardware malfunctions or transient system faults Another efficient technique is design diversity Employ independently designed software systems or services with different programming teams, Defend against permanent software design faults. We focus on the analysis of the replication techniques when applied to Web services. A generic Web service system with spatial as well as temporal replication is proposed and investigated. There are many fault-tolerant techniques that can be applied to the Web services including replication and diversity. Replication is one of the efficient ways for providing reliable systems by time or space redundancy. Redundancy has long been used as a means of increasing the availability of distributed systems, with key components being re-executed (replication in time) or replicated (replication in space) to protect against hardware malfunctions or transient system faults. Another efficient technique is design diversity. By independently designing software systems or services with different programming teams, diversity provides an ultimate resort in defending against permanent software design faults. In this paper, we focus on the analysis of the replication techniques when applied to Web services. We will analyze the performance and the availability of the Web services with using spatial and temporal redundancy and study the tradeoffs between them. A generic Web service system with spatial as well as temporal replication is proposed and investigated.

Proposed Paradigm Web Service Client UDDI IIS Replication Manager RR Algorithm / Voting WatchDog UDDI Registry WSDL Web Service IIS Application Database Client Port Register Look up Get WSDL Invoke web service Keep check the availability of all the web. If Web service failed, update the list of availability of Web services Update the WSDL

Proposed Paradigm Round Robin Parallel N-Version Majority result Web Service IIS Application Database Web Service IIS Application Database Web Service IIS Application Database Web Service IIS Application Database Web Service IIS Application Database Adv: Fully use the resource Majority result Voting Client

Experiments A series of experiments are designed and performed for evaluating the reliability of the Web service, Spatial replication 1 Reboot Retry

Testing system Best Route Finding. Provide traveling suggestions for users. Starting point and destination. The system needs to provide the best route and the price for the users.

System Architecture

Experimental Setup Examine the computation to communication ratio Examine the request frequency to limit the load of the server to 75% Fix the following parameters Computation to communication ratio (e.g 10:1) Request frequency

Experimental Setup Communication time: Computation time 143:14 (10:1) Request frequency 1 request per min Load 78.5% Timeout period of retry 1 min Timeout for Web service in RM 1s (web service specific) Polling frequency 10 requests per min Number of replicas 5 Max number of retries Round-robin rate 1 s

Experiment Parameters Fault mode Temporary (fault probability: 0.01) Permanent (fault probability: 0.001) Experiment time 5 days (7200 requests) Measure: Number of failures Average response time (ms) Failure definition: 5 retries are allowed. If there is still no correct result from the Web service after 5 retries, it is considered as a failure.

Experimental Result with Round-robin (failures / response time in ms) Experiments Single server Single server with retry reboot (continues no response for 3 requests) server with retry and Spatial Replication RR Hybrid approach RR+Retry RR+ Reboot All round spatial + Retry (5 times) + reboots Normal case 0 / 183 0 / 193 0 / 190 0 / 187 0 / 188 0 / 195 Temp 705 / 190 0 / 223 723 / 231 0 / 238 711 / 187 0 / 233 726 / 188 0 / 231 Perm 6144 / -- 6337 / -- 1064 / -- 5 / 2578 5637 / -- 5532 / -- 152 / 187 0 / 191

Experimental Result with N-Version (failures / response time in ms) Experiments Single server Single server with retry reboot (continues no response for 3 requests) server with retry and Spatial Replication Voting Hybrid approach Voting+ Retry Voting + Reboot All round spatial + (5 times) + reboots Normal case 0 / 318 0 / 320 0 / 315 0 / 319 0 / 322 0 / 321 Temp 429 / 321 0 / 356 423 / 364 0 / 325 0 / 324 Perm 3861 / -- 3864 / -- 614 / -- 3 / 4027 1544 / 323 1546 / 324 63 / 324 0 / 323

Varying the parameters Number of tries Timeout period for retry in single server Timeout period for retry in our paradigm Polling frequency Number of replicas Load of server

Number of tries Number of tries Number of failures in Temp failures in Perm 95 76 1 2 3 4 5

Timeout period for retry in single server Number of failures in Temp Perm 95 7265 2 7156 5 7314 6 6890 7 189 8 82 9 11 10 12 14 16 18

Timeout period for retry in single server # of failure If the timeout period for retry is too short, the server does not reboot on time to provide the service again. Timeout period

Timeout period for retry in our paradigm (s) Number of failures in Temp Perm 2 81 5 10 20 As the timeout period for retry is too short, the replica cannot reboot on time and cause the number of failures increases. Also, another cause of the failure is that the replication manager cannot response on time to switch the primary web service. There are 5 tries, and the reboot time for a server is around 50 second,

Polling frequency Polling frequency (number of requests per min) Number of failures in Temp in Perm 7124 1 811 2 30 5 12 10 15 213 254 20 1124 1023 If the polling frequency is low, the replication manager cannot response on time and the request will still send to the failed server and cause the failures in the system. When the polling frequency increases, the situation improves. The replication manager can response on time and reduce the number of failures

Polling frequency # of failure Polling frequency

Number of Replicas Number of replicas Number of failures in Temp Number of failures in Perm No replica 91 8152 2 356 3 4

Load of Web Server Load of the web server (%) Number of failures in Temp Perm 70 75 80 2 3 85 10 14 90 512 528 95 3214 3125 100 8792 8845 110 8997 8994

Summary of Parameters Number of tries = 2 Timeout period for retry in single server = 10s Timeout period for retry in our paradigm = 5s Polling frequency = 10 request per min Number of replicas = 3 Load of server = < 75%

Petri-Net (Four identical replicas)

Petri-Net (N-version Web service with voting)

Reliability Model (a) (b) P1 λ1 μ1c2 S-j P2 μ2c2 λ2 S S-n F λN μ*c2 λ* S-1 S-2 (1-c1)μ1 (1-c1)μ2 (1-c2)μ1 (1-c2)μ2 C1 = P( RM response on time ) C2 = P( reboot success ) We develop the reliability model of the proposed Web service paradigm with Markov chain model. The model is shown in Figure 8. In Figure 8(a), the state s represents the normal execution state of the system with n Web service replicas. In the event of an error, the primary Web service failed, the system will either go into the other states, i.e s-j which represents the system with n-j working replicas remaining, the replication manager response on time, or it will go to the failure state F with conditional probability c1. λ* denotes this error rate at which recovery cannot complete in this state and c1 represents the probability that the replication manager response on time to switch to another Web service. When the failed Web service is repaired, the system will back to the previous state, s-j+1. μ* denotes the rate at which successful recovery is performed in this state, c2 represents the probability that the failed Web service server reboot successfully. If the Web service is failed, it will go to another Web service. When all Web services are failed, the system enters the F state. λn is the network failure rate. s-1 to s-n represents the working state of the n Web service replicas and the reliability model of each Web service is shown in Figure 8(b). There are two types of failures are simulated in our experiments, they are P1 recourses problem (server busy) and P2 entry point failure (server reboot). If failure occurred in the Web service, either the Web service can be repaired with μ1 or μ2 repair rate with conditional probability c1 or the error cannot be recovered, it will go to the next state s-j-1 with one less Web service replica available. If the replication manager cannot response on time, it will go to the failure state. From the graph, two formulas can be obtained:

Reliability Model ID Description Value λn Network failure rate 0.02 λ* Web service failure rate 0.025 λ1 Resource problem rate 0.142 λ2 Entry point failure rate 0.150 μ* Web service repair rate 0.286 μ1 Resource problem repair rate 0.979 μ2 Entry point failure repair rate C1 Probability that the RM response on time 0.9 C2 Probability that the server reboot successfully

Outcome (SHARPE)

Conclusion Surveyed replication and design diversity techniques for reliable services. Proposed a hybrid approach to improving the availability of Web services. Carried out a series of experiments to evaluate the availability and reliability of the proposed Web service system. Optimal parameters are obtained. In the paper, we surveyed replication and design diversity techniques for reliable services and proposed a hybrid approach to improving the availability of Web services. Furthermore, we carried out a series of experiments to evaluate the availability and reliability of the proposed Web service system. From the experiments, we conclude that both temporal and spatial redundancy is important to the availability improvement of the Web service. In the future, we plan to test the proposed scheme with wider variety of fault injection scenarios. Moreover, we will also evaluate the schemes with design diversity techniques.