Why Reliability and Maintenance Professionals Keep Talking Past Each Other

by Howard Penrose | Articles, Maintenance and Reliability

MTBF Reliability

A review of public discussions on professional society websites and social media reveals recurring patterns in debates among reliability and maintenance professionals. While this article focuses on two types of engineering principles, it does not imply that non-engineering reliability and maintenance personnel are excluded; rather, it reflects a focus on several mindsets.

The result has been a challenge in industry communications that foreshadows the challenges for the reliability and maintenance industry in communicating with management.

Reliability and maintenance professionals spend a significant amount of time debating methods, metrics, and best practices. Whether these debates occur among organizations, individuals, or even business sectors, a pattern appears to emerge.

The debate isn’t about competence, it’s about fundamentally different ways of defining reliability.

Mechanical and electrical engineers often view reliability through the lens of deterministic design and absolutes – think stress-strain calculations, fatigue curves, and generous safety factors. Reliability engineers, by contrast, and based on industrial engineering, are expected to embrace probability and statistics, work and hazard rates, survival functions, and life-data analysis.

Miscommunication occurs not only when speaking with the C-suite but also within the reliability community itself: the same acronym, such as MTBF, may have very different meanings depending on who you ask. This article examines why these polarized discussions occur, where the biases originate, and how to bridge the gap.

Where the Divide Starts: Training And Bias

Most mechanical and electrical engineering programs train students to design against worst-case loads. Fatigue design uses curves generated from laboratory tests: specimens are loaded at constant stress (or stressed to standards) levels until failure; the resulting stress versus cycles-to-failure, time, or other factors are used to derive design curves after applying reduction factors and safety margins.

Fatigue failures can occur at stress levels well below expected values as the curve are generally some type of mean value of a determined sample size. To cope with uncertainty, and depending on economic and other factors, a safety factor may be used.

Deterministic methods have limitations: curves are derived from standard specimens and therefore do not account for real-world variability arising from manufacturing defects or differing usage conditions. More importantly, the curves do not include the probabilistic nature of fatigue.

For a given stress level, cycles to failure can vary widely. Because random scatter is hidden behind safety factors, mechanical and electrical engineers are not accustomed to thinking in terms of distributions, hazard rates, or failure probability. As a result, many view probability theory with suspicion, partly because the language and mathematics associated with probability, and related definitions, are unfamiliar.

Reliability engineering emerged from the need to anticipate when and why systems fail. Reliability is defined as the probability that an item performs its required function under specified conditions for a stated period. Reliability engineers, therefore, treat time-to-failure as a random variable.

The use of statistical tools such as life-data (Weibull) analysis, in which failure times are collected, fitted to a probability distribution, and then used to estimate life parameters, plan warranties, and identify failure modes. The hazard function expresses the conditional probability of failure in an interval, given survival up to that time, and plotting hazard rates often reveals different reliability curves (i.e., bathtub curve, random curve, wear out, etc.) with variations of infant-mortality, useful-life, and wear-out regions.

Reliability engineers also compute metrics such as mean time to failure (MTTF) by integrating the reliability functions. They distinguish between reliability (probability of failure-free operation over an interval) and availability (probability of operating on demand). To them, metrics such as mean time between failures (MTBF) and mean time to repair (MTTR) are random variables that require appropriate statistical treatment, not merely averages.

The divide is not merely numerical; it concerns education. Discussions within academic publications ask whether the courses offered in engineering departments are deterministic or probabilistic. They usually note that textbooks often neglect real-world problems involving random variables and that design courses emphasize code compliance rather than reliability assessment. A lack of understanding frequently leads to bias in the use of MTBF, MTTR, and other key performance indicators (KPIs).

KPI Confusion: The Language of Metrics

KPI are supposed to provide common ground, yet they are often the source of misunderstanding. For many in operations and maintenance, MTBF is a default reliability metric – “our pump has an 18-month MTBF, so schedule the overhaul accordingly.” However, reliability engineers caution that MTBF is simply the mean of the time-to-failure distribution and can be heavily skewed by early or late failures. Proper analysis splits the data by failure mode and uses a variety of statistical models to capture variability.

Maintenance View vs Reliability View

Mechanical and electrical engineers may misinterpret MTBF as a deterministic life guarantee. A reliability engineer sees MTBF as a comparative metric: if pump A has a longer MTBF than pump B under similar conditions, it may be more reliable. But using MTBF to plan replacements without considering context and distribution can be misleading. Such miscommunication fuels debates on social media and discussion forums.

MTTR, P-F Curves, and Other KPI

Mean time to repair (MTTR) is, in industrial engineering, the average time between a system’s functional failure and return to service. Reliability professionals also use a variety of distribution methods to model repair times, whereas maintenance teams often focus on the arithmetic means. Some industry practices use a variation of MTTR that includes the time from corrective action initiation to completion, but not from fault occurrence to return to service.

The P-F curve illustrates the interval between the detection of a potential failure and the occurrence of functional failure. Reliability engineers emphasize that condition-monitoring techniques must be selected based on failure modes and asset criticality within the reliability-centered maintenance (RCM) process.

To someone outside the field of reliability, the P-F curve may appear theoretical or irrelevant. Similarly, RCM customizes maintenance tasks based on asset criticality and failure modes; yet many maintenance teams default to fixed preventive maintenance intervals or run-to-failure strategies, leading to disagreements about cost and value.

Metrics don’t create clarity on their own, only context turns numbers into decisions.

One reason discussions become polarized is metric overload. Reliability professionals track MTBF, MTTR, availability, failure probabilities, risk priority numbers from FMEA, and many other KPIs. Operations managers and the C-suite may have limited bandwidth to digest them.

Conversely, focusing solely on a few high-level metrics can obscure root causes. Successful programs, some of which are covered in ‘Physical Asset Management for the Executive’ (Penrose, 2008), have shown that using a few targeted KPIs can significantly improve decision-making and that those metrics must be tailored to the audience and objectives.

The rise of predictive maintenance and artificial intelligence has added fuel to debates. Many vendors claim that their machine-learning (ML) models can predict failures more accurately than human experts. However, ML algorithms can only recognize patterns they were trained on; if a fault falls outside the training set, the model may miss it.

An expert system built on physics and experience may identify issues that an ML model misses. Discussions around AI frequently turn into arguments about which method is ‘better,’ when the real problem is training bias: a model is only as good as the data and knowledge embedded within it.

Communication: It’s Not Just the C-Suite

Many reliability and maintenance professionals lament that senior management doesn’t understand reliability. While communicating value to executives is important, the reliability community itself often fails to communicate effectively across disciplines. Consider these common scenarios:

Three common scenarios

A reliability engineer presents Weibull plots and hazard rates to a maintenance manager who only wants to know when the next shutdown is required. The presentation fails because it doesn’t translate statistics to actionable maintenance windows.
A mechanical engineer designs a component with a factor of safety of two, and calls it ‘reliable.’ The reliability engineer knows that a factor of safety doesn’t indicate probability of failure; reliability must be evaluated based on distributions. The word ‘reliable’ means different things to each person.
A maintenance planner uses MTBF as a scheduling tool, unaware that the underlying data follow multiple failure distributions. Reliability staff criticize the misuse, but don’t explain how to interpret the metric in practical terms.

These examples illustrate communication gaps at multiple levels. They stem from differing definitions, hidden assumptions, and the use of jargon without explanation. When these gaps overlap with strong opinions (“MTBF is useless!” vs. “MTBF is the industry standard!”), discussions quickly become polarized.

Recognize that deterministic and probabilistic bias mindsets are products of education and experience. A mechanical engineer’s reliance on stress curves and safety factors is not ignorance but adherence to what they were taught. A reliability engineer’s focus on distributions and probabilities reflects their training. Respecting each of these perspectives is the first step toward constructive dialogue.

Most reliability conflicts aren’t about right or wrong – they’re about how people were taught to think.

Rather than discarding MTBF, explain its limitations: it’s a mean value that can be skewed and should not be used alone to plan maintenance. Use Weibull analysis to show the spread of failure times and translate that into recommended inspection intervals.

For MTTR, present not only the average but also the 95th percentile, which indicates the time by which most repairs should be completed. When discussing P-F curves, relate them to familiar inspection tasks and how early detection for failure modes reduces unplanned downtime.

Reliability is a strategic function designed to meet business objectives, while maintenance is tactical. Rather than arguing over metrics, align reliability activities with stakeholder-valued outcomes: increased uptime, reduced safety incidents, and lower total cost of ownership. When the conversation focuses on value, it becomes easier to select relevant metrics and justify investments. This may mean that KPI vary depending on stakeholders within the same organization.

Encourage mechanical and electrical engineers take courses in reliability statistics and probabilistic design. Likewise, reliability engineers should understand the fundamentals of design and the constraints imposed by manufacturing and operations.

Simulation-based reliability assessment, such as Monte Carlo techniques and AI modeling tools, can help bridge the gap; it demonstrates how random variables such as stress and material strength interact to produce a probability of failure. Education programs should integrate deterministic and probabilistic approaches to ensure that graduates speak a common language.

For existing professionals, conferences should consider non-commercial workshops (e.g., professional society or academic) on these topics.

Adopt predictive analytics and expert systems not because they are trendy but because they solve specific problems. Recognize the bias inherent in ML/AI models and supplement them with logic and physics-based diagnostics and expert knowledge. Share lessons learned – both successes and failures – so that the community can build trust in new technologies without hype.

Conclusion

Polarized discussions in reliability and maintenance often stem from deeply rooted biases in training and communication rather than from insurmountable disagreements. Mechanical and electrical engineers design against worst-case loads and rely upon safety factors; reliability engineers model distributions and probabilities; operation managers need actionable schedules; and executives need business value – all as generic assumptions. Each group uses its own terminology and metrics. Without translation, these differences create friction.

The solution is not to declare one approach superior for building bridges: respect diverse backgrounds, translate metrics into context, align reliability initiatives with business goals, provide cross-disciplinary education, and deploy technology thoughtfully.

By doing so, the reliability community can move beyond polarized debates and work together to ensure assets perform as intended, safely and economically. The result will be more meaningful conversations – not just with C-suite but among the engineers, technicians, and analysts who keep systems running.

Author

Howard Penrose

Howard W. Penrose, Ph.D., CMRP, CEM, CMVP, is president of MotorDoc® LLC, a Veteran-Owned Small Business. He chairs standards at American Clean Power (2022-25), previously led SMRP (2018), and has been active with IEEE since 1993. He represents the USA for CIGRE machine standards (2024-28) and serves on NEMA rail electrification standards (2024+). A former Senior Research Engineer at the University of Chicago, he’s a 5-time UAW-GM Quality Award winner. His work spans GM and John Deere hybrids, Navy machine repair, and high-temperature motors. He holds certifications in reliability, energy, M&V, and data science from Kennedy-Western, Stanford, Michigan, AWS, and IBM.
View all posts

SHARE

Recent Posts

Titan Safety Anti Slip Clip

Reliable Directory

Find industrial contractors, distributors & integrators

15,000+ verified listings across North America

Search the Directory →

You May Also Like

Air in Oil: The Reliability Variable Many Programs Still Treat Too Late

Air in Oil: The Reliability Variable Many Programs Still Treat Too Late

Air is inert, so it shouldn’t affect your oil, right?! This is a concept that many people get wrong or don’t fully...

The Reliability Culture Myth: You Can’t Train Your Way Out of Bad Leadership

The Reliability Culture Myth: You Can’t Train Your Way Out of Bad Leadership

Reliability training rooms across North America fill up every week. Reliability-Centered Maintenance, root cause...

Why Uptime Doesn’t Need to Be an Uphill Struggle

Why Uptime Doesn’t Need to Be an Uphill Struggle

Preventing downtime can feel like a constant battle, but it doesn’t have to be. Gaining a better understanding of your...

A Ball Mill’s Replacement Gearbox Compromised Before Service

A Ball Mill’s Replacement Gearbox Compromised Before Service

The first gearbox was condemned for excessive gear mesh backlash. A borescope inspection of its replacement found a...

Industrial Downtime Cost Benchmarks: What Published Studies Actually Show

Industrial Downtime Cost Benchmarks: What Published Studies Actually Show

Downtime costs get quoted like gospel. That’s risky. A number that makes sense for an automotive assembly plant can...

Adding Traction to Industrial Stairs Without Pulling a Hot Work Permit

Adding Traction to Industrial Stairs Without Pulling a Hot Work Permit

Slippery grated stairs are one of the most predictable injury sources in any plant. Maintenance teams know which...