How MTTR Insights Drive the Shift from Reactive to Proactive Maintenance

MTTR (Mean Time To Repair) is the average duration needed to restore a failed system or component to full functionality after a breakdown.

As with MTBF/MTTF, “average” can be a very dangerous number when examining MTTR data. Like MTBF/MTTF, the MTTR to recover from a failure may be highly variable and dependent upon both the failure mode and the level at which the failure is defined and evaluated, be it at the equipment unit, sub-unit, component/maintainable item, or part as defined by ISO 14224.

In addition, it’s important to note that MTTR is significantly affected by an organization’s planning, scheduling, coordination, and execution of maintenance work.

Reactive vs. Condition-Based Maintenance: Assessing the Impact

Reactive maintenance jobs are MTTR killers. When a machine fails without warning, the organization may not be well-positioned to execute needed repairs promptly. This is especially true if the job requires one or more of the following:

Erection of scaffolding or completing pre-work preparation
Expediting parts or materials (particularly those with long lead times)
Machining or fabricating parts
Building or repairing the foundation or other civil/structural elements
Procuring special tools or equipment, e.g., a large capacity crane, to complete the job
Hiring specialized or possibly scarce available labor.

In a reactive scenario, where the machine fails without warning, all of those activities must be completed while the machine is down and production is slowed or halted (Figure 1). In that scenario, all aspects of the planning, preparation, and execution of work are “internal” activities, meaning they’re completed while the machine is down and not producing.

In a reactive maintenance job, all tasks required to prepare for and execute corrective actions must be completed while the machine is offline.

Figure 1. In a reactive maintenance job, all tasks required to prepare for and execute corrective actions must be completed while the machine is offline.

Making matters worse, reactive jobs often experience a great deal of collateral or secondary damage, which increases the scope of the job to prepare for and executive corrective maintenance actions. Larger jobs are more costly and generally take more time, which further drives up the MTTR and the variability of the mean.

Compounding the problem is that in reactive situations, we’re often required to “make do” with sub-optimal parts, which will work in a pinch but aren’t ideal. This can adversely affect reliability in both the short and long term.

In the short term, using sub-optimal parts or materials sets the stage for another failure. In the long term, if the organization lacks accurate bills of materials (BOMs) and is a like-for-like “parts changer,” the next time the failure occurs, the previously improvised incorrect sub-optimal parts become the standard.

And they’ll likely be changed out with the same sub-optimal parts the next time the same failure occurs. This theme can be repeated many times before a root-cause analysis (RCA) event is scheduled to resolve the mysterious and new repeat-failure problem.

These vicious cycles can repeat several times before the root cause is corrected. In truly desperate situations, we also may be forced to raid the “junkyard” of failed components to scavenge parts.

Advance warning transforms repair chaos into order.

Inspection rounds, condition monitoring such as vibration analysis, oil analysis, ultrasonic analysis, infrared thermographic analysis, motor analysis, etc., and nondestructive testing (NDT) can provide advance warning of failure and significantly reduce the Time To Repair (TTR) for a maintenance job.

It enables planners to separate internal activities, i.e., those that can only be completed while the machine is stopped, from “external” activities that can be completed as preparatory work while the machine is running.

This approach can considerably reduce MTTR and the associated adverse impact on production time (Figure 2). It also decreases variability about the MTTR for standardized repair jobs in response to commonly observed failure modes.

Figure 2. Advance warning enables advance planning and reduces machine downtime and MTTR.

Activity-Level MTTR Tracking for Enhanced Maintenance Insights

In addition to tracking MTTR at appropriate levels in the ISO 14224 hierarchy and by different failure modes, tracking machine downtime by maintenance activity makes sense.

For example, how much time was spent completing pre-work while the machine was shut down? How much time was spent finding parts? How much time was spent finding tools? How much time was spent finding people? How much time was spent awaiting permits?

Tracking performance at the maintenance activity level can reveal deficiencies and opportunities to improve planning, preparation, materials management, and other aspects of the corrective maintenance process. While it’s ideal to detect impending failures before their occurrence and plan and schedule accordingly, we must strive to predict these events to the best of our ability.

Doing so is not always possible. Unplanned events will occur, so we must expect and be prepared for the unexpected and attempt to organize as proactively as possible with standard parts kits and standardized corrective work plans to address reactive, unplanned events as effectively as possible.

Can Repair Time Be Too Low?

We generally think of MTTR as a metric that we want to minimize. For the most part, that’s true. However, the active maintenance time where we’re executing repairs, testing, and prepping for startup can be an exception.

In a reactive maintenance scenario, when the team is under the gun to get the machine back up, running, and producing, there is often much pressure to hurry the maintenance process. This can lead to shortcuts, particularly in the area of precision maintenance.

If, among other things, we circumvent precision fastening practices (for threaded fasteners, welds, etc.), alignment of shafts, alignment of pulleys and sheaves and tensioning of bolts, precision balancing of rotating assemblies, proper lubrication, and contamination control, we’re setting the stage for the subsequent failure. Likewise, if we fail to properly test before returning a machine online, we’re setting the stage for the subsequent failure.

Transform downtime into productive work to slash repair times.

Your best strategy for driving down TTR is to convert internal work that can only be completed while the machine is down to external work that can be completed while the machine is running. Shortcuts that compromise maintenance quality rarely pay off. Moreover, in some cases, hurry-up pressure can produce safety shortcuts, which is never a good thing.

Mean Time To Repair (MTTR) is essential for measuring an organization’s preparedness to complete maintenance work. It also reveals the organization’s effectiveness at inspecting equipment to uncover potential failures with enough time to complete the necessary preparatory work required to limit downtime to active repair time and testing in preparation for restarting and getting a machine back online.

Like any average, though, MTTR can be a misleading number. Analyzing MTTR at the proper asset hierarchy level and on a failure-mode-by-failure-mode basis is important. It’s also helpful to evaluate the time required to execute various aspects of the maintenance problem, from diagnosis to testing and prep for restart, to find opportunities to improve efficiencies.