Most people will have some concept of what reliability is from everyday life, for example, people may discuss how reliable their washing machine has been over the length of time they have owned it. Similarly, a car that doesn’t need to go to the garage for repairs often, during its lifetime, would be said to have been reliable. It can be said that reliability is quality over time. Quality is associated with workmanship and manufacturing and therefore if a product doesn’t work or breaks as soon as you buy it you would consider the product to have poor quality. However if over time parts of the product wear-out before you expect them to then this would be termed poor reliability. The difference therefore between quality and reliability is concerned with time and more specifically product life time.
Reliability is associated with unexpected failures of products or services and understanding why these failures occur is key to improving reliability. The main reasons why failures occur include: The product is not fit for purpose or more specifically the design is inherently incapable. The item may be overstressed in some way Failures can be caused by wear-out Failures might be caused by variation. Wrong specifications may cause failures. Misuse of the item may cause failure.
objectives of reliability engineering are: To apply engineering knowledge to prevent or reduce the likelihood or frequency of failures; To identify and correct the causes of failure that do occur; To determine ways of coping with failures that do occur; To apply methods of estimating the likely reliability of new designs, and for analysing reliability data.
Reliability is a measure of uncertainty and therefore estimating reliability means using statistics and probability theory Reliability is quality over time Reliability must be designed into a product or service Most important aspect of reliability is to identify cause of failure and eliminate in design if possible otherwise identify ways of accommodation Reliability is defined as the ability of an item to perform a required function without failure under stated conditions for a stated period of time The costs of unreliability can be damaging to a company
The bath-tub curve is a representation of the reliability performance of components or non-repaired items. It observes the reliability performance of a large sample of homogenous items entering the field at some start time (usually zero). If we observe the items over their lifetime without replacement then we can observe three distinct shapes or periods The infant mortality or early failures portion shows that the population will initially experience a high hazard function that starts to decrease. –burn-on or debugging period
After the initial phase when the weak components have been weeded out and mistakes corrected, the remaining population reaches a relatively constant hazard function period, known as the useful life period. The final portion of the bath-tub curve is called the wear-out phase, this is when the hazard function increases with time.
There are 3 continuous life distributions that are commonly used in reliability engineering; these are called, the exponential, Weibull and lognormal distributions. The Exponential Distribution -When an item is subject to failures that occur in random intervals and the expected number of failures is the same for long periods of time then the distribution of failures is said to fit an exponential distribution. The PDF, CDF and survival function is given as:
Notice that the hazard function is not a function of time and is in fact a constant equal to λ For repaired items, λ, is called the failure rate and 1/λ is called the mean time between failures (MTBF), sometimes denoted as θ. An important point to note is that 63.2% of items will have failed by time t=θ. The failure rate can be calculated as the total number of failures divided by the total operating time. The exponential distribution is the most commonly used distribution in reliability engineering and models the useful life portion of the bath-tub curve.
The Weibull Distribution -This distribution takes account of a non-constant hazard function. The Survival function is given by: where β is the shape parameter and η is the scale parameter or characteristic life. The characteristic life is the life at which 63.2% of the population will have failed. When β = 1, the hazard function is constant and therefore the data can be modelled by an exponential distribution with η=1/λ . When β<1, we get a decreasing hazard function and When β>1, we get a increasing hazard function
When β>3.5, the Weibull distribution is an approximation for the Normal distribution. There is also a three parameter version of the Weibull distribution and this is called the location parameter. It is sometimes called the failure free time or the minimum life The Lognormal Distribution The lognormal distribution is more versatile than the normal distribution and is a better fit to reliability data, such as for populations with wear-out characteristics. Also, it does not have the Normal’s disadvantage of extending below zero i.e. always positive. The lognormal distribution and the normal distribution describe situations when the hazard function is increasing.
Modelling system reliability Series systems The reliability of this model is calculated by: RS = RA * RB ….RZ If the components can be assumed to be exponentially distributed then the system reliability can be calculated as: The Failure rate of the system is calculated as by adding the failure rates together, i.e.
Active redundancy - One of the most common forms of redundancy is the parallel reliability model where two independent items are operating but the system can successfully operate as long as one of them is working, diagrammatically: The reliability of the system is equal to the probability of item 1 or item 2 surviving, e.g. RS = RA + RB – (RA* RB) And for the constant hazard function case:
M-out-of-N redundancy -In some active parallel redundant configurations, m out of the n items may be required to be working for the system to function. The reliability of an m-out-of-n system, with n identical independent items is given by: There are other system reliability models including the standby redundancy situation but those above are the simplest.
PDF, CDF, Reliability function and hazard function Bath-tub curve – infant mortality, useful life and wear-out Exponential distribution most widely used – constant hazard function Weibull with shape parameter, can model decreasing and increasing hazard function. When Beta =1 is equal to exponential. Characteristic life is the 63rd percentile Series systems modelling used for estimating system reliability by using parts count method
Reliability tools and techniques Top-down method Undesirable single event or system success at the highest level of interest (the top event) should be defined. Contributory causes of that event at all levels are then identified and analysed. Start at highest level of interest to successively lower levels Event-oriented method Useful during the early conceptual phase of system design Some examples of top-down methods include: Fault tree analysis (FTA); Reliability block diagram (RBD) and Markov analysis
Bottom-up method Identify fault modes at the component level For each fault mode the corresponding effect on performance is deduced for the next higher system level The resulting fault effect becomes the fault mode at the next higher system level, and so on Successive iterations result in the eventual identification of the fault effects at all functional levels up to the system level. Rigorous in identifying all single fault modes Initially may be qualitative Some examples of bottom-up methods include: Event tree analysis (ETA); FMEA and Hazard and operability study (HAZOP).
FRACAS -A Failure Reporting and corrective action system (FRACAS) is closed-loop coordinated system that is used to manage failures throughout the product life cycle. It records information about failures of a product and when and where they occurred but it also enforces corrective action details are documented
Reliability management Key aspects of reliability management include: Corporate level involvement Integral part of product development not parallel Reliability procedures integrated into design process Built into programme plan and produce a reliability plan Ownership of the reliability plan within the design team A reliability plan should contain the following: • Statement of reliability requirement • Organisation for reliability • Reliability activities to be performed and why • Timing of major activities • Management of suppliers • Standards and company procedures to be used
Mean Time Between Failure (MTBF) is a reliability term used to provide the amount of failures per million hours for a product. This is the most common inquiry about a product’s life span, and is important in the decision-making process of the end user. MTBF is more important for industries and integrators than for consumers. Most consumers are price driven and will not take MTBF into consideration, nor is the data often readily available. On the other hand, when equipment such as media converters or switches must be installed into mission critical applications, MTBF becomes very important.
Mean Time To Repair (MTTR) is the time needed to repair a failed hardware module. In an operational system, repair generally means replacing a failed hardware part. Thus, hardware MTTR could be viewed as mean time to replace a failed hardware module. Taking too long to repair a product drives up the cost of the installation in the long run, due to down time until the new part arrives and the possible window of time required to schedule the installation. To avoid MTTR, many companies purchase spare products so that a replacement can be installed quickly.
Mean Time To Failure (MTTF) is a basic measure of reliability for non-repairable systems. It is the mean time expected until the first failure of a piece of equipment. MTTF is a statistical value and is meant to be the mean over a long period of time and a large number of units. Technically, MTBF should be used only in reference to a repairable item, while MTTF should be used for non-repairable items. However, MTBF is commonly used for both repairable and non-repairable items.
Series
Parallel
K out of m network
Series and parallel network
Bridge and network