Power Supply Basics
18
Jan

How to Improve Power Supply MTBF and Reliability

Posted by Dulcie on January 18, 2017
Posted in: Power Supply Basics

Power supply reliability is important – no one wants their production line, measurement instrument, communications system or electronic product to stop working prematurely due to a failure. How is reliability defined, what methods can designers use to improve reliability and how do you ensure that a selected power supply product will meet expectations?

Defining MTBF and MTTF
The MTBF (Mean Time Between Failures) is a parameter that is widely used for determining the reliability of a power supply, but it is also often misunderstood and misused as a determining factor. A manufacturer’s datasheet quoting an MTBF figure of 300,000 hours does not indicate that the power supply will last that long before failing. Operating 24 hours a day, 365 days represents 8,760 hours operating time in a year. 300,000 hours is over 34 years and very few pieces of equipment installed in 1982 are still operating reliably.

MTBF is the time elapsed between two successive failures, and is a substantially different concept to MTTF (Mean Time To Failure). MTTF is a statistical parameter referring to the time elapsed from the start of operation and the first failure. It assumes that the mathematical model of reliability is the exponential function, which would seem sensible following real-world examples. The MTTF value represents the time where 63 percent of the samples under test will have suffered their first failure. This is a number often used to estimate the number of units needed to be stocked for replacements.

As this sounds a more reasonable method of determining how long a power supply will operate, why then is the MTBF number referred to so frequently? To explain this, one needs to examine figure 1, representing the pattern of the “failure rate” over a period of time.

Figure 1 – The typical bathtub curve

Figure 1 – The typical bathtub curve

The bathtub curve represents the typical development of failures over time. There are three distinct zones: Infant mortality; Constant (random) failure; and Wear-out failure. In the constant failure rate zone, failures occur in a random order – making it difficult to predict which failure mode will occur – but the rate of failures is predictable. It is in this zone where the mathematical models for MTBF and MTTF converge, a reason why the MTBF number is more widely considered.

Unfortunately, this does not assume that a power supply will be reliable. A power supply with an MTBF of 80,000 hours (approximately 9 years) does not mean that it will last for an average of 80,000 hours. When two units are considered, the nine years becomes 9/2 = 4.5 years, or 9/4 = 2.25 years for four units.

When life testing is conducted, the probability of a failure can be calculated. If we test 100 power supplies for one year (8,760 hrs) and during this time ten fail. The estimate of MTBF is the total number of device hours (876,000) divided by the number of failures, which is 87,600 hours / failure.

The probability that a power supply will survive to its calculated MTBF is 36.8 percent, no matter what the MTBF number is. R(t) = e –t / MTBF = e-1 (when t = MTBF).

A system made up of “n” separate parts is defined statistically as a series system; this is common practice for a power supply that is made up of multiple components. The system is defined as functioning when all the parts are working, and will be in a state of failure when just one single part turns out to be defective.

Each individual part making up the system is independent of the others, whether for reliability or for its failure rate. For simplicity the mathematical model used to describe the reliability as a function of the failure rate is of an exponential type.

In a series system, the total reliability of the system is equal to the product of all the single reliabilities of the parts making up the series. The total failure rate though is equal to the summation of the single failure rates. This means that, as the reliability is a lower number than the unit, the total product will certainly be lower than the reliability of every single part that goes into forming it, and the failure rate will certainly be higher.

Improving reliability
Product and/or system reliability should be a key focal point during the design and development process. If not, the ability to identify issues and assess concerns cannot be addressed as the design concept takes place. It is too late to consider reliability implications towards the end of the development.

For a power supply to be reliable, it has to be simple. Design work aimed at simplicity will produce a power supply that is more reliable than a complex one. For example, a basic single output low power converter will have greater calculated reliability than a multiple output high power converter. The addition of protection circuits though, will increase the actual life of the power supply.

In the early stages of development, the power supply circuit should be subdivided into two macro-blocks: critical applications and non-critical applications. This will assist the designer to look at component selection and derating coefficients.

Critical applications will consist of areas where a failure will cause the power supply to stop functioning. Non-critical areas are the auxiliary applications. For the critical applications, the parts must be of the highest quality, while trying to minimise the use of components that will deteriorate over time; such as electrolytic capacitors, fans and relays.

Figure 2 shows a possible derating plan for a given component in a non-critical application, and figure 3 represents the same component in a critical application.

Figure 2 - Non-critical derating plan

Figure 2 – Non-critical derating plan

Figure 3 - Critical derating plan

Figure 3 – Critical derating plan

The y-axis is (S) a coefficient indicating the stress on the component, and the x-axis shows the working temperature of the component. Zone A is the permissible zone; zone Q is the zone where it may be problematic if the component not working; and zone R is the prohibited zone.

Two of the accepted standards for performing reliability predictions are MIL-HDBK-217 and Bellcore/Telcordia Technical Reference TR-332. Both of these empirical prediction methods have several assumptions in common – constant failure rate, the use of thermal and stress acceleration factors, quality factors, and use conditions. They are both based on models developed from statistical curve fitting of historical failure data, which may have been collected in the field, in-house or from manufacturers. Probably the most widely known and used reliability prediction handbook is MIL-HDBK-217.

In the military standards (MIL-HDBK-217F, MIL-HDBK-251M MIL-HDBK 781A and MIL-HDBK 338B) one can get some of the best indications on corrective coefficients applicable to failure rates based on the various components. These are dependent on the conditions of use, on the temperature, information on reliability tests and information on “design reliability”.

For example, the mathematical model of the overall failure rate for a power MOSFET according to MIL-HDBK-217F, is its basic failure rate multiplied by the particular Temperature, Application, Quality and Environment factors. It should be noted that reliability is a factor between 0 and 1 and does not have a dimension. The failure rate though, is measured particularly in the electronics industry. The FIT (failures in time) of a device is the number of failures that can be expected in one billion (109) hours of operation. Considering a power MOSFET suitable for a 250W power supply (basic failure rate is 12 FIT), working in a temperature approaching 100 degC (thermal coefficient is 3.7), with quality factor JANTX (according to MIL-S-19500 is 8) and an environmental factor GF (Ground Fixed is 1.6), the total failure rate will be 2.312 Failures/106h.

Once environmental factors are considered, the basic failure rate of 12 FIT has increased to 2,131.2 FIT. As MTTF is the inverse to failure rate, this results in a MTTF for the power MOSFET of about 470,000 hours. The initial MTTF however, without calculating the environmental and thermal factors, was about 83,000,000 hours.

To improve the reliability of the MOSFET and hence the power supply, the designer could choose to increase the thermal derating of the part. By significantly cooling the part down and setting the working temperature of the MOSFET at 80 degC rather than 100 degC, the thermal coefficient would go from 3.7 to 2.7, making the MTTF 643,000 hours – over a 36 percent improvement.

The reduction in the failure rate, under pre-set operating and environmental conditions, is possible only through careful analysis of the Electrical Stress and Thermal Stress factors for correct dimensioning of the power MOSFET. This is possible by means of a check carried out through the derating plans, illustrated in Figures 2 and 3, suitably constructed on the basis of the application and type of component.

The Technical Reference TR-332 standard was developed, originally by Bellcore and now known as Telcordia, for the telecommunication industry where manufacturers deemed that MIL-HDBK-217 was a little too harsh. Its international popularity is growing, but typically the MIL-HDBK-217 calculated results will show a higher failure rate than Bellcore/Telcordia standard for the same system. This difference between the standards stems from the original intended use of the MIL standard for aerospace and military, or mission critical applications.

The gathering and analysis of data
It is possible to calculate the reliability figures, with a good degree of accuracy, using data gathered from HALT (Highly Accelerated Life Test) trials. The number of samples to test though is usually limited, the number of test stations fixed and the amount of time available to run tests restricted. This impacts the “confidence gap” within which the designer wants the calculated parameter to fall. This analysis requires the use of a statistical function called chi squared χ2 (a,b) – a function of two variables, tabulated in the main statistical trials, as well as in the main electronic calculation sheets.

For the sake of simplicity, calculation functions of a “monolateral lower confidence gap” are reproduced in two special cases: “Trials broken off with substitution” and “Trials broken off without substitution”. The first case mainly concerns calculation of the true MTTF in the case of returns from the field, though this method is unsatisfactory as the manufacturer doesn’t know how long the units have been operating, nor the operating conditions. The second case on the other hand concerns the HALT trials.

Once the HALT process has been completed, and the appropriate corrective actions have been taken, a custom production screening process to can be created to identify process defects. Highly Accelerated Stress Screening, also known as HASS, applies all stresses simultaneously. Based on HALT limits, HASS stress levels evaluate the power supply near its operating limits.

What to look for in a power supply
Start with the desired useful life of the power supply, this will depend upon the application and often the cost of downtime or repair. Review any reliability data that the manufacturer offers. If only an MTBF figure is given, ask for the detailed calculations, particularly if the number seems abnormally high.

In Japan, the JEITA RCR-9102 standard is often used. Originally based on MIL-HDBK-217F, the standard has several revisions and calculated MTBF numbers in RCR-9102B are half those in the original standard.

As a note of caution, electrolytic capacitor life should also be considered, particularly at higher ambient temperatures. Several power supply manufacturers state capacitor life in their reliability data. Ensure that if the power supply has both convection and forced air ratings, the life is stated for the desired cooling method.

Reputable manufacturers will be able to provide detailed calculations – including the effect of ripple current that can shorten capacitor life due to excessive internal heating if not properly considered at the design stage.

TDK-Lambda UK
www.emea.lambda.tdk.com/uk