How Root Cause Analysis Fits Into a Proactive Reliability Strategy

by | Articles, Maintenance and Reliability, Root Cause Analysis

Introduction to the 4 Quadrants

I recently had a discussion with one of my colleagues in which I attempted to articulate why I believed that a company’s maintenance organization, asset management, and inventory management need to be at a certain level of maturity before they can effectively use root cause analysis (RCA).

My idea was to boycott RCAs for a while to focus on tasks that could cast a wider net for defect elimination and target some of the low-hanging fruit I knew was there. My main sticking point on these is that if you do not know your equipment, you do not have the right parts, and you do not have an effective maintenance program, most RCAs will identify gaps in these three areas.

Then I said to him, “So, knowing this, why not actively identify and fill in these foundational gaps instead of reacting to them?”

He looked at me and said, “Of course, that makes sense, but root cause analysis sounds sexy and not many people know, want to know, or want to admit how big the gaps are.”

So, then the questions become: are they, aren’t they, or is it the maturity of the organization’s processes that determines whether utilizing RCAs can be successful? I believe that, in most organizations, RCAs are an ineffective firefighting exercise that is crushing the will to live of some poor reliability engineer.

If you have read Stephen Covey’s book “The 7 Habits of Highly Effective People,” you will remember that he says to “put first things first” and identifies the 4 quadrants. I have recreated it here, but modified it for the maintenance department. We will use this tool to help answer the question above.

matrix

Please take a moment to review the table above. It should be noted that the firefighter will typically live in quadrants 1 and 3. The procrastinator will prioritize time to quadrant 4, and the proactive person will do what they can to spend their time in quadrant 2. For the purpose of reading the rest of this document, it is important to understand this table and the meaning of each of its 4 quadrants.

In Steve Covey’s book, he states that quadrant two activities effectively reduce quadrant 1 activities. Considering this and adding to it, I would like to lead off with the following observation. It is a fairly obvious statement, but I want to put it here in front of you now.

Proactive work reduces reactive work. Reactive work reduces the time available for proactive work.

Knowing this, we must consciously choose to do proactive activities over reactive ones, or we will spiral into the reactive abyss.  

Root Cause Analysis (RCA)

What is an RCA?

Root cause analysis is a systematic investigation into a failure to identify the underlying cause of the failure. From my experience, most people believe that RCAs will always lead to an apparent root cause, but in practice, they often do not. The best we may be able to do is to identify some probable cause(s) for the failure that can be addressed.

This is why it is vitally important, as part of any RCA, to include a follow-up at some future point to confirm that whatever actions were taken met the defined objective.

In conducting RCA, we must be very careful to ensure that we fully understand the actions that result from them and that they are both appropriate and achievable with the staff and resources we have access to. What I mean by this is:

  • We do not want to recommend a million-dollar fix to a five-dollar problem (inappropriate).
  • Clearly define the scope of the assigned tasks. Get into the details – poorly defined tasks often turn out to be bigger than expected, making them unachievable.

This can very easily happen in the boardroom when tasks are being identified. Most of them will be written as a one-liner on the whiteboard, and everyone at the time will agree that that is what needs to be done. This is where the conversation will stop.

It will not define the objectives that must be met to determine when the task has been completed, nor will it specify the expected labor input to meet those objectives, thereby ensuring adequate resource allocation. This lack of insight will lead to a growing task backlog and eventually to the tasks failing to get completed, rendering the RCA a waste of time.

People often believe that the actions that result from an RCA should eliminate the failure from recurring in the future. This can, in most instances, be a fool’s errand, and should more relevantly seek to reduce the probability of occurrence to some predefined acceptable level, i.e., to 1 in 5 years opposed to 1 in 1 year, and/or to mitigate the consequence when it does occur i.e., 1 hour of production loss opposed to 12 hours of production loss. Realizing this will produce better task definitions and more attainable outcomes.  

What drives the decision to start an RCA?

The above all seems great, so long as we can avoid the fool’s errand, we take the time to ensure the task objectives are well defined, and the labor component is well understood, so we can assign the appropriate resources; we should have a successful RCA, right?

To answer this, we need to understand three things:

  1. What quadrant(s) are RCAs generated from?
  2. What quadrant do the RCA tasks fall into?
  3. What quadrant is our maintenance organization primarily working from?

What quadrants are RCAs generated from?

From my experience, RCAs are typically the outcome of quadrant 1 or quadrant 3 activities. An unexpected failure that leads to unacceptable downtime and production losses, a safety incident, or a flavor-of-the-day issue. Some of this is fine. No matter how good your organization is, you are always going to have quadrant 1 and 3 activities.

However, if RCAs are to effectively reduce quadrant 1 activities, they can not be primarily generated from them, as Steve Covey explained; quadrant 2 activities are what effectively reduce your quadrant 1 activities.

What quadrant(s) do RCA tasks fall into?

So, what quadrant do our RCA tasks fall into? Well… if we have done an excellent job in identifying the tasks, they will fall primarily in quadrant 2, and some may be things that need to get done immediately, which will fall into quadrant 1.

If we have not done a good job completing the RCA, we may find some tasks fall in quadrants 3 and 4; however, if we get lucky, they may fall in quadrants 1 and 2.

Does anyone really want to rely on luck when assigning valuable resources?

This can happen relatively quickly if we overwhelm our available resources in RCAs and they are not investigated thoroughly enough (hence my earlier comment about the poor reliability engineer who wanted to end their life). It is important to note that if this happens, you are now effectively utilizing what resources you have, figuratively speaking, to put a load of lipstick on the pig.

The lipstick is the assigned tasks that are, at best, unknowingly not important or urgent (quadrant 4 activities mistaken for quadrant 2 activities), and, at worst, not important but deemed urgent (quadrant 3 activities mistaken for quadrant 1 activities). These tasks will have no meaningful effect on reducing the probability of future failure, nor will they reduce the impact of failure when it occurs. Even worse, they are now competing for your maintenance resources.

What quadrant is our maintenance organization primarily working from?

It is imperative to understand the current state of the maintenance department and what quadrant drives their workload.

  • What types of tasks are they consumed with on a daily basis?
  • How is the bulk of their work being generated?
  • How is the morale within the group? How do they feel about the maintenance program?
  • Is there high staff turnover?

The answers to these questions will give you an understanding of what quadrant dictates the bulk of their time and whether they have the resources to be able to complete the RCA tasks (quadrant 2) that get generated.

As described in the table, quadrant 1 and 3 activities, especially in a reactive environment, are likely already overwhelming your resources. Some departments, depending on the maturity of their asset and inventory management, and maintenance programs, may be perpetually stuck in a quadrant 1 and 3 death spiral.

It is important to understand that age and maturity are not always synonymous. In this scenario, there will be very little room to do any quadrant 2 activities, including assigned RCA tasks. This will lead to a growing task backlog, or to tasks that are slowly modified and eventually removed because they are realized not to be achievable with the available resources. They may also disappear once it is discovered that they would be a waste of time and resources (quadrants 3 and 4 masquerading as quadrants 1 and 2).

Aside from the task management issues noted above, this reactive environment may also be recognized if RCAs are started and the initial investigation identifies that the failure being investigated has occurred multiple times in the past, and possibly, that previous RCAs had investigated it. Another clear indicator is when the RCA finds that the failure could have been avoided or mitigated through a basic care mechanism that should already have been in place (a foundational gap).   

So, the conundrum becomes: if they are heavily loaded with primarily quadrant 1 activities, do not expect them to be able to focus on quadrant 2 tasks unless they are explicitly made quadrant 1 tasks through some sort of failure or an upper management demand to prioritize them. Either of these scenarios will be ineffective long-term solutions in generating effective RCA outcomes because at some point, something’s got to give. Any system, including systems of people or teams, has a maximum load rating.

If you exceed it, the cracks will start to show.

What causes the system to get overloaded?

This is a good time to stop and consider how we might get overloaded with RCAs in the first place. To be frank, this happens when we do not understand the labor required to complete the RCA itself. A properly completed RCA can take several weeks or months of task tracking and follow-ups, with the initial investigation taking several days or weeks to complete. Investigative tasks may include, but are not limited to:

  • Documenting the failure site (pictures, video)
  • Collecting the physical pieces for analysis
  • Gathering witness statements
  • System and parts research
  • Identifying the total cost of the failure, including secondary and tertiary damages
  • Equipment research and identifying/analysing maintenance data, including:
    • planned maintenance routines.
    • parts history
    • work/repair history
    • work instructions, operator instructions
    • remote monitoring data, etc.
  • Identifying similar equipment failures that have occurred within the organization or in industry.
  • Conferring with system experts and vendors
  • Arranging and hosting meetings to gain tribal knowledge and to identify, flesh out, and assign appropriate, achievable tasks.

Note: this can require considerable time depending on how easily accessible the information and people are.

Assuming the RCA was carried out effectively, once the investigation has been completed and all tasks have been assigned, the reliability engineer will be responsible for the task register and for following the progress of each task objective through to completion.

Sometimes, roadblocks will need to be removed or tasks will need to be reimagined to make them achievable. Once the tasks are completed, they will be responsible for scheduling and completing the later (sometimes much later, like a year or more) analysis to ensure the tasks were effective and, if not, they should reopen the case.

Following the task objectives through to completion is an extremely important component in the process because if not, the quadrant 2 tasks can be shelved and stagnate to make room for more pressing quadrant 1 activities, leading to risk of reoccurrence and nullifying or reducing the value of having carried out the RCA in the first place.

So… to carry out a proper RCA through to completion, we should be looking to assign anywhere from 3-4 weeks of part-time work to the reliability engineer per RCA. This would limit a reliability engineer to effectively handling 12-17 RCAs per year, depending on how heavily loaded they are with other tasks.

This does not include the additional workload the tasks place on the maintenance and production staff. Some tasks that come out of RCAs look more like small projects than tasks that can be quickly picked off throughout a single workday. As stated before, it is vitally important to thoroughly understand the objectives that specify when the task has been completed and the expected labor required to complete the task, to ensure appropriate resources are available. 

How do we avoid overloading the system?

It is crucial to avoid overloading the system with RCAs. Doing so risks generating a workload that appears productive on the surface but ultimately fails to deliver meaningful results. Overloading the system will:

  • Intensify the issue of overloading your staff,
  • Create limited or negative outcomes,
  • Unnecessarily inflate your budget,
  • Compete with higher-priority quadrant 2 tasks, and
  • Undermine confidence and willingness to engage in the process.

If we deem that we have the resources to successfully carry out effective RCAs, we need to be very careful in choosing when to step into the RCA investigation itself. This should be done in two steps:

  1. Ensure we are not taking on more RCAs than our available resources allow. This could be in the form of limiting both the number of RCAs per year and the number per month or per week. If the RCAs are primarily driven by quadrant 1 activities, the reactive nature could easily trigger multiple RCAs in a very short time frame, quickly overloading the system.
  2. Create a sanity check or exit calculation, which can be done by using a matrix that would identify and compare consequences and probabilities. An item of medium-high production consequence but low probability of recurrence may be chosen to be documented, but the RCA investigation should be avoided. A different decision may be made for safety or environmental reasons, and depending on other RCAs already in the system, it may be necessary to choose to exit or simplify an existing RCA to make room for another. It is important to realize that there are likely vastly different risk appetites and possibly different documentation regulations that will dictate when to initiate an RCA in any of these three buckets.

Chart

We also need to set up strict rules/sanity checks for any identified tasks and follow-ups to ensure they are both appropriate and achievable.

Conclusion

Are RCAs proactive or reactive? It depends…

At this point, we are getting a better understanding of why the answer to our question is “it depends.”

Utterly reactive

When a maintenance organization primarily operates in quadrants 1 (urgent and important tasks) and 3 (urgent but not important tasks), it indicates low maturity in asset management, inventory management, and maintenance programs. Consequently, Root Cause Analysis (RCA) investigations initiated in this state will repeatedly expose significant foundational gaps within the organization.

In this scenario, RCAs function reactively, addressing failures one at a time without a clear understanding of the broader foundational structure. This reactive approach quickly leads to overwhelm, making RCAs ineffective at best.

By contrast, stepping into quadrant 2 (important but not urgent tasks) and proactively auditing and fixing the systems themselves is far more effective. This approach addresses foundational issues in advance, mitigating the consequences of failures and significantly reducing their likelihood in the first place.

Truly proactive

When a maintenance program and its supporting systems are effectively designed to support maintenance work, it reflects a truly mature system. In such an environment, the RCAs identify smaller gaps, defects, and oversights, gradually refining an already effective system.

Unlike reactive programs, RCAs here are not primarily driven by quadrant 1 activities; instead, they are balanced between quadrant 1 and quadrant 2 tasks. This approach incorporates proactive auditing processes and predictive analysis, enabling the organization to actively engage in 12 to 17 RCAs per year without artificially limiting them.

In this scenario, Failure Modes and Effects Analysis (FMEA), or Reliability Centered Maintenance (RCM) activities may take precedence over RCAs. These approaches focus on improving critical equipment uptime and addressing persistent “bad actors” in the system, driving overall performance improvements.

This proactive methodology enables the maintenance team to effectively handle and complete the quadrant 2 tasks that arise, ensuring that the program continually evolves and sustains its effectiveness.

Are there better options?

For the proactive

If your existing maintenance and support programs are effective, your maintenance department primarily operates in quadrant 2, and your RCAs (Root Cause Analyses) are generated from a balance between breakdowns and quadrant 2 exercises, your RCAs should perform as expected.

However, if issues arise, ensure the following:

  • RCAs are given adequate time for effective execution.
  • Investigators have the appropriate training, experience, and support.
  • Tasks are clearly defined, achievable, and include specific objectives, so assignees know when they’re completed.
  • Labor estimates are realistic and account for available resources.
  • Task tracking, verification of completion, and follow-ups to evaluate longer-term objectives—such as mitigating consequences and reducing recurrence probability—are conducted to a high standard.

This approach will sustain and gradually improve an already functional maintenance program.

For the Reactive

If your program is primarily reactive, limiting RCAs to prioritize quadrant 2 activities becomes essential. These activities cast a wider net, addressing system gaps before they cause more issues.

While it may not be feasible to abandon RCAs entirely—especially for high-consequence, high-recurrence events—it is advisable to limit them to what is absolutely necessary. Too many RCAs will have little effect on failure/consequence mitigation. They may create more harm to the organization than good by competing with already strained resources and reducing people’s confidence in the process. It will be seen as another exercise in pencil-whipping and will result in a collective “arghhh…” whenever it is mentioned or initiated.

In this scenario, consider increasing staffing and splitting the maintenance department into two work streams:

  • Quadrant 1 Stream – Focused on frontline firefighting and immediate issues.
  • Quadrant 2 Stream – Dedicated to fixing system gaps and creating long-term efficiencies (repairing the foundation).

With this setup, daily operations can continue as usual. However, the quadrant 2 group’s proactive efforts will gradually reduce quadrant 1 activities, leading to fewer downtime events and faster repair times, thereby increasing machine availability.

From experience, much of the firefighting stems from missing or inaccessible information, turning what should be 5-minute jobs into hours—or even days—of extra work or rework. Addressing these foundational gaps transforms these extended tasks back into the 5-minute jobs they were meant to be, freeing up resources for valuable quadrant 2 efforts.

Targeting Efficiency Through Quadrant 2

Quadrant 2 exercises can range from simple to advanced gap analysis audits. The overarching goal is to enhance maintenance efficiency, uptime, and consequence mitigation by:

  • Identifying and addressing maintenance data gaps.
  • Making critical information easily accessible.
  • Ensuring equipment receives basic care.
  • Tackling “failures waiting to happen” in bulk rather than one at a time.

This proactive approach results in:

  • Reduced quadrant 1 activities.
  • Lower overall workloads for the maintenance department.
  • Fewer RCAs.

By focusing efforts on repairing the foundation and supporting proactive maintenance strategies, your organization can transition from firefighting to long-term efficiency and success.

Practical Tools and Templates for Better RCA

Quadrant 2 Activities to Fix Your Maintenance Foundation

Quadrant 2 tasks focus on proactive audits and gap analyses to strengthen your maintenance foundation. These activities can help identify and resolve critical inefficiencies.

  1. Asset Register (Targets Maintenance Efficiency)
  • Do we have an asset register? Is our basic equipment information—make, model, serial number, capacity—readily available for use?
  • Is the register up to date, or are we still assigning time and inventory space to equipment that no longer exists?
  • Are there duplicate entries by make and model? Duplicates can create significant advantages/efficiencies, such as sharing of parts BOMs, documentation, maintenance routines, and training programs.
  1. Planned Maintenance (PM) or Condition-Based Maintenance (CbM) Audits (Targets Uptime)
  • Is there adequate PM/CbM coverage across registered equipment? Are we providing basic care to the equipment we rely on for production?
  • Do we have defined maintenance strategies, and are they appropriately selected for each piece of equipment? Strategies may include:
    • Condition-based monitoring
    • Remote monitoring
    • Basic care/manufacturer-recommended maintenance
    • Time-based discard/rebuilds
    • Run-to-failure
    • Operator checks
  • Are PM tasks appropriate and effective? More advanced audits should assess:
    • Are tasks addressing specific failure modes?
    • Are tasks redundant?
    • Is the frequency appropriate?
  1. Bill of Materials (BOM) (Targets Maintenance Efficiency—Improves Planning, Repair Time, and Inventory Optimization)
  • BOM gap analysis: Does every piece of equipment have an associated BOM?
  • Critical BOM gap analysis: Are essential components included? For example:
    • Pumps should include motors and couplings.
    • Conveyors should include belts, motors, and drive components.
    • Are the critical items needed for quick repairs documented in the equipment BOM?
  1. Inventory Analysis (Targets Uptime)
  • Are replacement parts providing their expected lifespan?
    • Could vendors recommend higher-quality parts or alternatives better suited to the application?
  • Are inventory failure patterns indicating:
    • Premature failures due to installation issues or training gaps?
    • Consistent failure age, suggesting opportunities for time-based discard tasks or rebuilds?
  1. Equipment Mean Time Between Failure (MTBF) and Bad Actor Analysis (Targets Uptime)
  • MTBF/bad actor analysis typically ties back to the audits mentioned above, but may also reveal:
    • Operational or training issues.
    • Equipment running beyond rated capacity, which may identify bottlenecks in the process.

This list is not exhaustive but provides actionable ideas to refine your maintenance foundation. By prioritizing these proactive quadrant 2 activities, organizations can enhance efficiency, reduce downtime, and achieve sustainable improvements.

Author

  • Mike Arsenault

    Mike Arsenault is a seasoned maintenance and reliability professional with more than 20 years of experience across MRO, asset management, and project leadership. A former Marine Engineering Technician in the Royal Canadian Navy, he has worked on submarines, complex industrial systems, and enterprise-scale asset strategies. His background spans reliability engineering, project management, fabrication and design, and dual Red Seal certifications as an Industrial Mechanic/Millwright and Industrial Electrician. As founder of Rockland Physical Asset Management, he helps small and mid-sized businesses build strong maintenance foundations through better data, documentation, and disciplined asset management practices.

    View all posts
SHARE

You May Also Like