Download presentation
Presentation is loading. Please wait.
Published byVirgil Willis Modified over 8 years ago
1
May 16 393SYS 1 Aviation reliability: Programs & calculation Chapter 19 text
2
May 16393SYS2 Reliability Definition (in statistical term): ‘the probability of failure free operation of an item in a specified environment for a specified amount of time’ Examples: If eight delays and cancellations are experienced in 200 flights, that means 96% of flights dispatched on time for the airline. Effective February 15, 2007, the FAA ruled that US-registered ETOPS-207 operators can fly over most of the world provided that the IFSD rate is 1 in 100,000 engine hours. This limit is more stringent than ETOPS-180 (2 in 100,000 engine hours).
3
Two main approaches of reliability in the aviation industry First approach is the overall airline reliability, essential means the dispatch reliability, that is, how often the airline achieves an on-time departure of its scheduled flights. The reasons of delay are categorized as maintenance, procedures, personnel, flight operations, air traffic control (ATC). etc. Second approach is to consider reliability as programs specifically designed to address the problems of maintenance- whether or not they cause delays and provide analysis of and corrective actions for those items to provide the overall reliability of equipment. This contributes to the dispatch reliability as well as the overall operation. May 16393SYS3
4
Reliability Program (for maintenance) A set of rules and practices for managing and controlling a maintenance program. The main function is to monitor the performance of the vehicles and their associated equipment and call attention to any need for corrective action. Additional functions: Monitor the effectiveness of those corrective actions Provide data to justify adjusting the maintenance interval or maintenance program procedure as appropriate May 16393SYS4
5
Maintenance programs have four types of reliability Statistical reliability Historical reliability Event-oriented reliability Dispatch reliability May 16393SYS5
6
Statistical reliability Based upon collection and analysis of ‘events’ such as failure, removal, and repair rates of systems or components. May 16393SYS6
7
Historical reliability Comparison of current event rates with those of past experience. Commonly used when new equipment is introduced and no established statistic is available. May 16393SYS7
8
Event-oriented reliability Events like bird strikes, hard landing, in-flight shutdowns (IFSD), lighting strikes or other accidents that do not occur on a regular basis and therefore produce no useable statistical or historical data. In ETOPS, FAA designated certain events to be tracked as ‘event-oriented reliability program’. Each occurrence of the events must be investigated to determinate the cause to prevent recurrence. IFSD causes; for example: due to flameout, internal failure, crew-initiated shutoff, foreign object ingestion, icing, inability to obtain and/or control desired thrust. May 16393SYS8
9
Dispatch reliability Measurement of an airline operation respect to on-line departure. It receives considerable attention from regulatory authorities(e.g. FAA), airlines and passengers. Actually, it is just a special form of the event-oriented reliability approach. May 16393SYS9
10
Danger of misinterpreted reliability data (1) A pilot experienced a rudder control problem and called in two hours from arriving an airport. He writes up the problem in the aircraft logbook and reports it by radio to the flight operation unit at the airport. Upon arrival, the maintenance crew check the log and find the write-up and begin troubleshooting. The repair actions take a little longer then scheduled turnaround time and cause delay. Since maintenance is at work and rudder is the problem, the delay is charged to the maintenance department. If the pilot and the flight operation unit knew the problem and informed the maintenance two hours before landing, the maintenance people can spent the time prior to landing to perform troubleshooting analysis and the delay could have been prevented. So, an alter in airline procedure can avoid the delay. A good reliability program should avoided same delay in the future by altering the procedure, not regardless of who or what is to blame. May 16393SYS10
11
Danger of misinterpreted reliability data (2) If there were 12 write-ups of rudder problems during the month and only one of them caused a delay, there is actually two problems to investigate. 1. The delay, which may/or may not be caused by rudder the problems 2. The 12 rudder write-ups that may,in fact, be related to an underlying maintenance problem. Dispatch delay constitutes one problem and the rudder system malfunction constitutes another. They may overlap but they are two different problems. Delay is a event-oriented reliability that must be investigated on its own; the 12 rudder problems should be addressed by the statistical (or historical) reliability problem separately. May 16393SYS11
12
Elements of a Reliability Program 1. Data collection 2. Problem area alerting 3. Data display 4. Data analysis 5. Corrective actions 6. Follow-up analysis 7. Monthly report May 16393SYS12
13
Data Collection: allows operator to compare present performance with the past, typical data type are: 1. Flight time and cycle for each aircraft 2. Cancellations and delays over 15 minutes 3. Unscheduled component removals 4. Unscheduled engine removals 5. In-flight shutdowns of engines 6. Pilot reports or logbook write-ups 7. Cabin logbook write-up 8. Component failures (shop maintenance) 9. Maintenance check package findings 10. Critical failures May 16393SYS13
14
Problem detection: alerting systems alerting systems for quick identify areas where performance is significantly different from normal so that possible problems can be investigated. Standards for event rates are set according to past performance. May 16393SYS14
15
Problem detection 2: setting & adjusting alert levels alert levels recalculation (yearly) and filtering of false alarms May 16393SYS15
16
May 16393SYS16 Quality Control Charts and the Seven Run Rule A control chart is a graphic display of data that illustrates the results of a process over time. It helps prevent defects and allows you to determine whether a process is in control or out of control The seven run rule states that if seven data points in a row are all below the mean, above the mean, or increasing or decreasing, then the process needs to be examined for non-random problems
17
May 16393SYS17 Control Chart of 12” ruler
18
May 16393SYS18 Control Chart contiu. The output of a production process will fluctuate. The causes of fluctuation can just be random or non-random due to desirable/undesirable process change. Control charts graph and measure process data against control limits. Control charts can distinguish the random variation from assignable causes or non- random causes. We cannot adjust random variation out of a process. Process adjustments for random variation are neither necessary nor desirable. This is over-adjustment or tempering, and it makes the process worse. We can and must investigate assignable causes (or non-random causes). Points outside the control limits are evidence of process problems. Analyst must investigate every out of control point for an assignable cause. They must record their findings and any corrective actions. For example, a tool adjustment, or change in Formal Technical Review format or worn tooling, may correct the problem.
19
May 16393SYS19 Pattern analyzing of Control Chart 7-Run rule 7-run-rule is used to filter out the random variation in a production process. shows the ‘trends’ that are caused by the ‘assignable causes’ or non-random causes that required investigation and possible corrective action to be taken. 7-run-rule pattern: seven points above mean value; seven points below mean value; seven points or all increasing ; or seven points all decreasing the patterns are indicators of non-random problems which can be symptom of process out of control.
20
May 16393SYS20 To develop a Control Chart to determine project stability Plot individual metric values on a chart. Compute the mean value for the metrics value and plot the line. Plot the Upper Control Limit and Lower Control Limit. Compute a standard deviation as (Upper-control-limit - mean)/3. Plot lines one and two standard deviation above and below Am. If any of the standard deviation lines is less than 0.0, it need not be plotted unless the metric being evaluated takes on values that are less than 0.0. The Std Dev.# is then plotted on the control chart.
21
Other Pattern analyzing of Control Chart 7 run rule is just another method to filter false alarms. Other pattern analyzing methods include but not limited to : metric value lay outside UCL or LCL 2 out of 3 successive metrics values lay more than 2 standard deviations away from the mean; 4 out of 5 successive metrics values lay more than 1 standard deviations away from the mean; others… May 16393SYS21
22
Reliability : Basic Calculation & Application May 16 393SYS 22
23
May 16393SYS23 CHARACTERIZING FAILURE OCCURRENCES IN TIME Four general ways: 1. time of failure 2. time interval between failures 3. cumulative failures experienced up to a given time 4. failures experienced in a time interval
24
May 16393SYS24 Time-based failure specification Failure number Failure time (sec) Failure interval (sec) 12341234 10 19 32 43 10 9 13 11
25
May 16393SYS25 Failure-based failure specification Time (sec) Cumulative failures Failure in interval 30 60 90 120 25782578 23212321
26
May 16393SYS26 TABLE 3 Typical probability distribution of failures Value of random variable (failures in time period) Probability Product of value and probability 0 1 2 3 4 5 6 7 8 9 10 0.10 0.18 0.22 0.16 0.11 0.08 0.05 0.04 0.03 0.02 0.01 0 0.18 0.44 0.48 0.44 0.40 0.30 0.28 0.24 0.18 0.1 Mean failures 3.04
27
May 16393SYS27 TIME VARIATION Mean value function - represents the average cumulative failures associated with each time point. Failure intensity function - is the rate of change of the mean value function or the no. of failures per unit time.
28
May 16393SYS28 Probability distributions at times t A and t B Value of random variable (failures in time period) Probability 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0.10 0.18 0.22 0.16 0.11 0.08 0.05 0.04 0.03 0.02 0.01 0 0.01 0.02 0.03 0.04 0.05 0.07 0.09 0.12 0.16 0.13 0.10 0.07 0.05 0.03 0.02 0.01 Mean failures3.04 Elapsed time t a = 1 hrElapsed time t B = 5 hr 7.77
29
May 16393SYS29 Mean Failures Mean Failure function F(t) Failure intensity (failures/hr) 5 10 failure intensity function f(t) Time t B =5 Figure showing the Mean value function and failure intensity function Time Time t A = 1
30
May 16393SYS30 f(x)-- probability density function F(x)-- probability cumulative distribution function eg :When we toss three coins, there will be eight events If we set x is the number of coins which head side face up, we have: EventHHHHHTHTHTHHHTTTHTTTHTTT x32221110 H = Head side up; T = Tail side up : X0123X0123 f(x)1/83/83/81/8 F(x)1/81/27/81
31
DISCRETE FAILURE FUNCTION May 16393SYS31 f(t), the failure density function over a time interval [t 1, t 2 ] and is defined as the ratio of the number of failures occurring in the interval to the size of the original population, divided by the length of the time interval: Where n(t) is the number of the fault survivors at time t The f(t) can measure the overall speed at which failures are occurring
32
DISCRETE HAZARD FUNCTION May 16393SYS32 Z(t), the failure rate, or the Hazard function is the probability that a failure occurs in some time interval [t 1, t 2 ], given that the system has survived up to time t. It is the ratio of the number of failures occurring in the interval to the size of the original population, divided by the length of the time interval: The Z(t) can measure the instantaneous speed of failure
33
FAILURE CURVES May 16393SYS33
34
PROBABILITY OF SUCCESS May 16393SYS34 F(t) is the probability of failure (= Cumulative Distribution) R(t) is the probability of success (= Reliability) F(t) + R(t) = 1
35
DISCRETE FUNCTION EXAMPLE May 16393SYS35 Failure data for 10 hypothetical electrical components Failure Number Operating time, h
36
May 16393SYS36 8/1 *10 = 80 20/2 *10 = 100 Interval t 1 to t 2 Failure density/hr f(t) Hazard rate/hr Z(t) F(t)R(t)MTTF Overall MTTF 0-8 8-20 20-34 [10-9]/10 / (8-0) = 1//10/8 = 0.0125 [9-8]/10/(20-8) = 1/10/12 = 0.0083 [8-7]/10/(34-20) = 1/10/14 = 0.0071 [9-8]/9/(20-8) = 1/9/12 = 0.093 [8-7]/8/(34-20) = 1/8/14 = 0.089 1/10 2/10 3/10 9/10 8/10 7/10 8*10 =80 12 *9 =108 14*8 =112 34/3 *10 = 113 [10-9]/10 / (8-0) = 1//10/8 = 0.0125 f t ntntN tt () [()()]/ () 12 21
37
Achieving a reliable system ref. Ian Summerville, 7e Ch20 Three basic strategies to achieve reliability Fault Avoidance Build fault-free systems from the start Fault Tolerance Build facilities into the system to let the system continue when faults cause system failures Fault Detection Use software validation techniques to discover faults prior to the system being put into operation For most systems, fault avoidance and fault detection suffice to provide the required level of reliability May 16393SYS37
38
Implementing Fault Avoidance Availability of a formal and unambiguous system specification Adoption of a quality philosophy by developers. Developers should be expected to write bug-free systems … May 16393SYS38
39
Implementing Fault Tolerance Even if somehow we build a fault-free system, we still need fault-tolerance in critical systems Fault-free does not mean failure-free Fault-free means that the system correctly meets its specifications Specifications may be incomplete or faulty or unaware of a requirement of the environment Can never conclusively prove that a system is fault-free May 16393SYS39
40
Aspects of Fault Tolerance Failure Detection System must be able to detect that the current state of the system has caused a failure or will cause a failure Damage Assessment System must detect what damage the system failure has caused Fault Recovery System must change the state of the system to a known “safe” state Can correct the damaged state (forward error recovery - harder) Can restore to a previous known “safe” state (backwards error recovery - easier) Fault Repair Modifying the system so that the failure does not recur Many software failures are transient and need no repair and normal processing can resume after fault recovery May 16393SYS40
41
Implementing Fault Tolerance Hardware - Triple-Modular Redundancy (TMR) Hardware unit is replicated three (or more) times Output is compared from three units If one unit fails, its output is ignored Space Shuttle is a classic example May 16393SYS41 Machine 1 Machine 2 Machine 3 Output Comparator Output Comparator
42
Implementing Fault Tolerance (2) Using Software N-Version programming Have multiple teams build different versions of the software and then execute them in parallel Assumes teams are unlikely to make the same mistakes Not necessarily a valid assumption, if teams all work from the same specification … May 16393SYS42
43
N-Version Programming Commonly used approach in railway signaling, aircraft systems & reactor protection system May 16393SYS43
44
System Configuration for Failure Event Diagram Divide system into a hierarchy set of components. The reliability of the components should be known or is easy to estimate or measure. Each component represented as a switch. If component is functioning, the switch is viewed as CLOSED and if not functioning, as OPEN. System success occurs if there is a continuous path through the configuration. The components are described as combination of 2 types - AND & OR configuration with independent failures representation. Express the reliability relationship between the components with Failure Even diagram. May 16393SYS44
45
EVENT DIAGRAM May 16393SYS45 A B C DE Event diagram for AND-OR configuratio n
46
EVENT EXPRESSION May 16393SYS46 R S = (A + B + C) * D * E where R S = Reliability of system = Probability of System success
47
May 16393SYS47 TRUTH TABLE ABA+BA*B 1111 1010 0110 0000 [ OR ][ AND ]
48
.AND. CONFIGURATION Rs = Reliability of the system. = R 1 XR 2 where R 1 & R 2 are the reliability of components C 1 & C 2. Rs with n components arranged in logical.AND. then, Rs = R 1 X R 2 X … X R n May 16393SYS48 Rs R1R1 R2R2
49
.OR. CONFIGURATION Rs = Reliability of the system. R 1, R 2,..R n = Reliability of component 1, 2,..n. in this case, it is easier to calculate by the probability of failure F Fs = (1 - Rs) F 1 = (1 - R 1 ); F 2 = (1 - R 2 ) Fs = F 1 * F 2 = (1 - R 1 ) * (1 - R 2 ) Rs = 1 - Fs = 1 - [(1 - R 1 ) * (1 - R 2 ) * … (1 - Rn)] for n components arranged in logical.OR. May 16393SYS49 Rs R1R1 R2R2
50
May 16393SYS50 Reliability Acronyms MTBF - Mean Time Between Failures MTTF - Mean Time To Failure MTTR - Mean Time To Repair MTBF = MFFT + MTTR Many people consider it to be far more useful than measuring fault rate per LOC
51
May 16393SYS51 Difference btw. Reliability and Availability Reliability - means no failures in the interval, say, 0 to t. Availability - means only that the system is up at time t.
52
May 16393SYS52 AVAILABILITY Where Time (up) A(T) = ----------------------------- Time(up) + Time(down)
53
May 16393SYS53 AVAILABILITY Where MTTF = Mean Time To Fail MTTR = Mean Time To Repair MTBF = MTTF + MTTR Where MTBF = Mean Time Between Failure Time (up) A(T)= ------------------------------------------ Time(up) + Time(down) MTTF MTTF = -------------------------- = -------------- MTTF + MTTR MTBF
54
Failure rate and Reliability during the random failure period May 16393SYS54 R s = e –t / = e -t where: Rs= Probability of failure free operation for period >=t e = 2.718 t = specified period of failure free operation = mean time between failure (MTBF) = failure rate = 1/
55
SYSTEM CONFIGURATION Suppose we have Qp components with constant failure intensities. Their reliabilities are measured over a common time period. Assume that all must function successfully for system success. Then the system failure intensity is given by May 16393SYS55 Where k are the component failure intensities
56
Confusion surrounds MTBF : If the failure rate is constant, the probability that a product will operate without failure for a time equal to or greater than its MTBF is only 37%. This is based on the exponential distribution and it is contrary to the intuitive feeling that there is a 50-50 chance of exceeding an MTBF. May 16393SYS56
57
Computer harddisk with MTBF rating of 1 million hour mean that the average unit will run for 114 years before it fails? But harddisk has not been invented for 114 years, how can we know that it can be used for that long? MTBF of a drive is obtained by multiplying a large quantity of the drives with the number of hours running before experiencing a failure in the batch. For example, when a disk manufacturer batch tested 1500 units of hard disk and achieved an average of 30 days operation out of the batch between each individual unit failure, then MTBF of the disk is 1500 x 30 x 24 hours = 1 million hours.(for the period of testing time) May 16393SYS57
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.