Reliability and safety are often discussed as inherently linked, but the reality is more complex. While a highly reliable system reduces unexpected failures and disruptions—thereby decreasing the likelihood of hazardous situations—it does not automatically ensure safety.
Safety is a system-level property that depends on broader factors such as design, procedures, and human factors. This article explores the nuanced relationship between reliability and safety, challenging the assumption of a direct correlation and inviting discussion on how these concepts interact in real-world operations.
Challenging the Assumption: Does Reliability Guarantee Safety?
The following excerpt from Nancy Leveson’s Engineering a Safer World challenges the conventional wisdom that improving reliability alone guarantees safety. This perspective provides a foundation for exploring how reliability contributes to but does not solely determine overall system safety.
Assumption 1: Safety is increased by increasing system or component reliability. If components or systems do not fail, then accidents will not occur. (p. 7)
This assumption is widespread in engineering and other fields but flawed.
Safety is a system-level property, not a component-level property. Controlling safety requires a holistic, system-wide approach rather than focusing solely on individual component reliability.
Revised Assumption 1: High reliability is neither necessary nor sufficient for safety. (p. 13)
From my 30+ years in the reliability field, I see a connection between the two—but it is not a direct correlation. A system can be highly reliable yet still unsafe, and conversely, a safe system can be unreliable.
However, in my experience, a reliable operation is inherently safer than an unreliable one. Fewer unexpected stops starts, and deviations mean fewer instances where operators must react quickly to regain control. Stability in operations naturally reduces the risk of accidents and errors.
The graphic above was from a thought-provoking post by Dustin Etchison that expresses a correlation between Safety and Reliability.
The Role of Root Cause Analysis in Safety and Reliability
I believe the Safety world has an inaccurate current-day view of ‘RCA’ in general and, therefore, treats all RCA as a commodity equivalent to the limited capabilities of the 5-Whys (linear and identifying a single root cause). I believe how well we solve failures (losses resulting from deviations from an acceptable standard) directly impacts the safety of our workforce.
True root cause analysis is not about blame—it’s about understanding intent.
Contrary to popular belief, true RCA does NOT stop at blaming someone (based on a decision resulting in a bad outcome) but rather at understanding the reasoning behind their decision (their intent) at the time. Delving into a person’s intent for their decision will often involve uncovering flawed organizational systems, restraining paradigms, cultural norms, and other sociotechnical influences.
I’ve talked to many people about this topic, and most feel strongly that there is a definite correlation between Reliability and Safety. However, we understand there is not likely a direct correlation. This is because we know we can have a reliable operation that is unsafe and vice versa.
We also strongly believe that when we experience unexpected conditions (upsets), we test the boundaries of our safety controls and are at higher risk of experiencing a safety incident. Steady-state (reliable) operations are typically less prone to such elevated risks.
Industry Perspectives: Examining the Data
Ron Moore’s article, ‘A Reliable Plant is a Safe Plant, is a Cost Effective Plant,’ is the only one I have seen thus far based on studies conducted at actual, specific plant operations over a designated period. The focus of the studies mentioned is to draw the links between Reliability, Safety, and Costs. Here are a few of Ron’s conclusions:
I asked Ron Moore to comment on the quotes above from Ms. Levenson. Here are his thoughts:
“Bob, below are my initial thoughts. I haven’t read her (Leveson) paper, so these are based on the quotes. I’m also assuming you’ve read the paper ‘A reliable plant is a safe plant is a cost-effective plant.’
The data I shared in the paper is only a fraction of what I have, but it’s all consistent with what I’ve shown in the paper. I have six different sets of data from other companies demonstrating that as OEE improves, the injury rate declines. I have three sets of data relating reactive maintenance and injuries, along with PM/PdM and injuries.
When the OEE improves, safety improves.
My initial comments on the quotes you’ve provided from Ms. Leveson are provided below.
“Assumption 1: Safety is increased by increasing system or component reliability. If components or systems do not fail, then accidents will not occur. (p. 7)”. This assumption is one of the most pervasive in engineering and other fields. The problem is that it is not true. Safety is a system property, not a component property, and must be controlled at the system level, not the component level.
This appears to be an incorrect interpretation or characterization of the data. My data says that safety is improved by improving system reliability (and, by inference, component reliability). If you reduce the failures, both component and system level, you reduce the exposure to the risk of injury and, therefore, the probability of injury.
However, I agree that it does not mean that accidents will not occur since accidents are caused by any number of variables, some of which are not controlled by reliability excellence. I also agree that safety is a system property, not a component property, and must be controlled at the system level.
In my view, one of the best, if not the best, measures for reliability is OEE/AU, a system-level measure. Reliability isn’t just about maintenance, but her statements/assumptions seem to imply that it is. Indeed, my data says that maintenance typically only controls some 10% of the loss of production capacity captured in the OEE measure.
Moreover, reliability is driven by our practices in design, procurement, stores, installation and startup, operation, and maintenance, all of which contribute positively or negatively to system-level reliability (not just equipment or components). Reducing the number of defects in these practices, both within each function and cooperatively as a team, will improve reliability and reduce the risk of injury while reducing costs and environmental incidents.
New Assumption 1: High reliability is neither necessary nor sufficient for safety. (p.13)”
I think this is a really bad assumption, even risky. It’s perplexing why anyone would say this. Why wouldn’t you want high reliability, particularly if it reduces risk – the risk of injury, the risk of high costs, and the risk of environmental incidents? This assumption may depend on her definition or view of reliability being driven by maintenance.
Reliability should not be driven by maintenance. Maintenance is a support function to the overall plant and production process.
I have data on manufacturing businesses that improve safety without commensurate improvement in reliability. However, they reach a point where additional improvements in safety do not appear to be achievable because the system has reached a statistically stable state.
For example, you can improve safety by improving personal behavior – wear your PPE, do your lock-out/tag-out properly, etc. However, once you do this exceptionally well, you must reduce the exposure to the risk of injury; that is, you must improve process reliability (not just equipment) to achieve further gains.”
Additional note from Ron Moore: “I scanned through Chapter 2 of Leveson’s book, and we may be talking in two different languages, or perhaps the same language, but different dialects. I agree that safety and reliability are different properties and that a reliable system can be unsafe, and a safe system can be unreliable. She gives good examples.
When I talk about reliability, I’m generally not using the standard definition she repeats in the book. I’m thinking about the ability of the business (the system in this case)—the refinery, steel mill, or chemical plant—to deliver its product in a timely, cost-effective, and safe manner.
What I’ve observed in the data is that when the OEE improves, safety improves; when reactive is reduced, OEE improves, and safety improves (but reactive events are not typically caused by maintenance, only performed by them). As practices in operations and maintenance improve, OEE and costs improve, and so on.
OEE graphic courtesy of Bruce Hawkins (How Reliability Impacts Shareholder Value Presentation at SMRP Symposium, Bruce Hawkins, Dir. Of Technical Excellence, Emerson Operational Certainty).
A caution I would insert here is that correlation is not necessarily cause and effect. Anyway, the examples she uses (the ones I read) are what I think of as sub-systems; from that context, I can see her point and agree.
Moreover, she makes a good point about using FMEA and the like. It’s really hard to capture all the complexity in a large system (a plant or combination of plants and other functions in a business) using those techniques.
I’ve said for many years that leadership, culture, teamwork, and employee engagement are more important than any particular analysis tool, but they are also important for engaging people in solving problems. My book, What Tool? When? provides my thoughts on this.”
— End of Ron Moore’s comments —
An old friend, Ramesh Gulati, kindly provided me with 12 years of additional field data from the Arnold Engineering Development Complex (AEDC) that further supports the conclusions of Ron Moore’s data described above. This graph shows a decrease in injury rates correlating to a decrease in PM backlogs and Unscheduled Downtime.
The Bigger Picture: Integrating Reliability and Safety for Better Outcomes
The relationship between reliability and safety is complex, but the evidence suggests a strong correlation. While reliability alone does not guarantee safety, reducing failures and unexpected disruptions significantly lowers risk.
As industry leaders like Ron Moore and Ramesh Gulati have shown, improving operational reliability consistently aligns with improved safety outcomes. The challenge lies in adopting a holistic approach that integrates system-level reliability with proactive safety strategies to create a genuinely secure working environment.