Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Reliability Week 11 - Lecture 2

What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be available within the agreed time frame Consistency – provide much the same response time on each occasion

Service Level Agreement Reliability and performance requirements are usually built into an SLA or Service Level Agreement An SLA defines the level of service the organisation and the users can expect from the DIS It is negotiated between the organisation and the service provider, be that the internal IT dept or an outside body

All components affect reliability Any component can effect the reliability of the whole system, but each component can affect different aspects: correctness, availability and consistency We will look at: Application software System software – O/S, DBMS & Middleware Server hardware Network Storage Change management and Problem management

Application Software Application software can affect availability for a few, some or all customers in the event of a failure. Main area for bugs – particularly if developed in- house or modified. Can affect correctness and consistency if changes to application software are not rigorously tested.

System software (DBMS, O/S, etc) System software failures generally affect availability for all customers on a server. Operating at high utilisation (90-95% capacity) can affect reliability. Parts of system not often used can become active (eg. queuing logic).

Server hardware Hardware failure will affect availability for all users on the server. One server supporting an application/database provides a Single Point of Failure (to be avoided). Server problems can affect consistency (eg failure of one procesor in multi-processor server will affect performance.)

Networks - LAN Lan failures will affect availability for a few or many users. Changes to routers, switches or cabling can affect availability. Lan component failures/changes generally affect availability and consistency.

Networks - WAN It is a Purchased service, controlled by an external company. Wan failure will generally affect all users (eg ISP failure will affect all access to the Internet) It requires Careful selection of supplier Sufficient capacity for peak loads Carefully negotiated SLA Capable network management

Planning for Reliability Managing problems and changes. Planning for application and system software reliability Planning for hardware reliability Planning for disaster recovery

Managing Problems/Changes The cause of all problems MUST be determined and then resolved (or they will simply return again and again to affect availability) All application and system software changes MUST –be reviewed by a committee before implementation –have been thoroughly tested –have a back-out plan –be APPROVED by all affected parties –implemented out of normal availability periods

Planning System Reliability Server selection and operating system must fit the scale of the operation. Regular system software update plan should be followed to fix bugs, implement new features. Update plan should be fully investigated –update may introduce new bugs –may cause problems for applications –may intoduce performance problems

Planning Application Reliability Starts in design – how the objects and components are packaged and the interfaces designed Software package selection must place high weight on reliability factors (availability etc.) Implementations need formal processes Test plans Testing techniques Test scripts

Planning for Harware Reliability Build in redundancy, avoid single points of failure (even within hardware items). Use servers with multiple processors and hot-swap capability. Use server clusters if appropriate. Build redundancy and alternate routes into the network. Lan can be controlled. Disks have many mechanical parts and will fail often. Use Raid or redundancy when-ever possible

RAID Redundant Arrays of Independent Disks Groups of drives are linked to a special controller They appear as a single logical drive Take advantage of multiple physical drives to store data redundantly Six different RAID approaches numbered 0 to 5

0 Data striping, block oriented No redundancy – no protection from disk loss Reads and writes for contiguous block overlap, giving improved performance No space overhead

1 Disk mirroring – all data written to two disks Full data protection Improved read access Doubles disk space required Easy to implement, easy to recover

5 Data striping, block oriented, distributed parity Full error protection, but slower to recover than 1 Slow write, good read performance 25% overhead in disk space

Planning for Business Continuance (or Disaster/Recovery) Planning to continue business in the event of a disaster - is a design job. 1993 and 9/11. Consider all scenarios, plan recovery approach, test & document. Common causes are fires (Sydney), floods (Brisbane) or back-hoes. Test recovery regularly (3- 6 months)

Performance Week 11 - Lecture 2

Why is Performance Important DIS systems have potential for performance issues New systems almost always require performance tuning DIS performance affects user productivity Performance is a measure of value for money

A simple test In most systems, what is likely to be the highest priority for users? –Improved functionality –Improved reliability –Improved performance

Performance Measures Response time - time taken to complete a task or transaction Throughput - the amount of work (transactions) that can be completed in a set time period (sec or hour) The relationship between the two is generally inverse (although not always)

Concurrency is the answer Slow response time High throughput Fast response time Low throughput Time

A user requires consistency, then speed. A user wants a transaction to run consistently. The faster, the better. A user sees response time at the PC or terminal. A user is not concerned with the entire infrastructure that supports a transaction. It staff see reponse time only in their domain of responsibility (server, database, network etc)

Difficult to measure total response time How do you add together web server + application server + database server + network Do you get statistics from each group ? Will each group maintain statistics is the same format ? You need to measure total response time and response in each area (server, database etc). New network monitors may be able to provide statistics closer to what you need

Improving performance You can add more resources (faster servers, faster disks, networks etc) to improve response time and throughput. However, performance improvements may not be proportional to the additional resources. 100% increase in resources may only bring, say, 70% performance improvement. Scalability.

Monitoring Performance Performance is a process, not a task. Performance should be constantly monitored. Cost of monitoring must weighed against “do nothing” Performance tuning should be carried out to correct performance problems.

Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Similar presentations

Presentation on theme: "Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be.

Similar presentations

Presentation on theme: "Reliability Week 11 - Lecture 2. What do we mean by reliability? Correctness – system/application does what it has to do correctly. Availability – Be."— Presentation transcript:

Similar presentations

About project

Feedback