Why Run to Failure Maintenance Backfires on Critical Assets

by Reliable Media, Alison Field | Cartoons

run to failure maintenance risks

Some equipment genuinely belongs on a run to failure strategy. Light bulbs, for instance. Disposable filters. Components where the replacement cost is trivial and the failure consequence is zero. The problem starts when organizations apply that same logic to assets where run to failure maintenance risks are anything but trivial.

A run to failure approach means operating equipment with no scheduled intervention until it breaks. For the right assets, that makes economic sense. For the wrong ones, it creates a cascade of unplanned downtime, emergency spending, and safety exposure that compounds over months and years.

What Run to Failure Maintenance Risks Actually Look Like

The theory behind run to failure sounds efficient: why spend money maintaining something that works fine? The reality is that “works fine” has an expiration date, and most organizations have no idea when that date arrives.

Unplanned failures carry costs that planned maintenance avoids:

Emergency labor rates run 1.5x to 3x regular maintenance costs, depending on the shift and the urgency.
Expedited parts shipping adds 30% to 200% over standard procurement prices.
Production losses during unplanned downtime can reach $5,000 to $50,000 per hour in process industries.
Collateral damage to adjacent components multiplies the repair scope. A failed bearing that goes undetected destroys the shaft, the seal, and sometimes the housing. Early failure detection prevents exactly this kind of escalation.

A chemical plant in Louisiana ran its cooling water pumps to failure for three years. The maintenance team argued the pumps were “redundant” and the backup would cover any failure. It did, until both pumps failed within the same week. The resulting production loss exceeded $400,000. The planned maintenance that would have prevented both failures was estimated at $12,000 annually.

Redundancy covers single failures. Run to failure strategies create the conditions for multiple failures to cluster.

That clustering effect catches many organizations off guard. Equipment in the same service, purchased at the same time, operating under the same conditions, tends to fail in waves. One pump failure is an inconvenience. Three pump failures in the same month is a crisis.

When Run to Failure Maintenance Risks Become Safety Hazards

Cost overruns get management’s attention. Safety incidents get regulators’ attention. Run to failure strategies applied to the wrong equipment create both.

Pressure relief valves, emergency shutdown systems, fire suppression components, and structural supports share a common characteristic: their failure has consequences that extend beyond production losses. Applying a run to failure approach to safety-critical equipment violates most regulatory frameworks and every reasonable engineering standard.

Equipment that protects people should never be on a strategy that waits for failure as the trigger for action.

Even on non-safety-critical assets, run to failure approaches generate indirect safety risks. A failed gearbox on a conveyor can send debris across a work area. A seized fan motor can overheat electrical enclosures. A broken coupling can release stored rotational energy in unpredictable directions.

These secondary failure modes rarely appear in the risk assessment that justified the run to failure decision in the first place. The original analysis considered the direct consequence of the asset stopping. It rarely considered what happens when the asset stops violently, at full speed, with no warning.

Insurance carriers have started paying attention to this pattern. Facilities with high percentages of run to failure assets face elevated premiums and, in some cases, coverage exclusions for equipment that lacked documented maintenance strategies. The underwriters’ logic is straightforward: if you chose to let it fail, you accepted the risk, and they’d prefer you carry it alone.

Identifying Equipment That Belongs on a Proactive Strategy

The fix starts with honest asset criticality analysis. Every piece of equipment in the facility needs a clear classification based on three factors:

Consequence of failure: safety impact, environmental impact, production impact, repair cost.
Failure predictability: can vibration analysis, oil analysis, thermography, or other condition monitoring technologies detect degradation before failure occurs?
Failure pattern: does the equipment exhibit wear-out characteristics (increasing failure rate over time) or random failure patterns?

Equipment with high consequences, detectable degradation, and wear-out patterns belongs on a predictive maintenance strategy. Equipment with low consequences, random failure modes, and cheap replacement costs can stay on run to failure.

Run to failure works when the cost of preventing failure exceeds the cost of allowing it. For most rotating equipment, that math favors prevention by a wide margin.

The analysis usually reveals that 10% to 20% of a plant’s assets are legitimately suited for run to failure. The rest need some form of proactive care, whether that’s time-based preventive maintenance, condition-based monitoring, or a combination of both.

Plants that skip this classification step often default to the worst possible approach: informal run to failure on everything, with reactive maintenance disguised as strategy. The maintenance team knows certain equipment will fail. Management knows certain equipment will fail. Nobody documents the decision or calculates the expected cost, so the organization absorbs repeated emergency expenses without ever connecting them to the policy that caused them.

Moving Away from Reactive Operations

Transitioning from run to failure to proactive maintenance requires a phased approach. Attempting to implement condition monitoring on every asset simultaneously overwhelms the maintenance team and the budget.

Start with the top 20 assets by criticality. Implement condition monitoring on those first. Build the skills, the data collection routines, and the analysis capability on a manageable scope before expanding.

The goal is controlled migration. Plants that try to go from fully reactive to fully predictive in six months usually end up back where they started.

Track the transition with one simple metric: the ratio of planned to unplanned work orders. A plant running 80% unplanned work is deeply reactive. Moving that number to 60% within the first year, then 40% by year two, represents meaningful progress. World-class operations run at 90% planned work or better, but that takes years of sustained effort.

Budget conversations shift dramatically once the data exists. When leadership can see that reactive maintenance consumed $2.3 million last year while the proposed condition monitoring program costs $180,000 annually, the approval discussion changes from “Can we afford this?” to “How did we justify the alternative for so long?”

The maintenance team needs training to support the transition. Technicians accustomed to reactive work require different skills for proactive strategies: data collection techniques, basic analysis interpretation, and the discipline to follow condition-based work orders even when the equipment appears to be running normally. Investing in that training pays for itself within the first year through reduced emergency callouts alone.

The run to failure maintenance risks your organization carries today represent decisions made (or avoided) in the past. Changing those decisions takes leadership commitment, honest asset classification, and the patience to build proactive capability one system at a time. The equipment will tell you what it needs. The question is whether you’re listening before or after it fails.

Authors

Reliable Media

Reliable Media simplifies complex reliability challenges with clear, actionable content for manufacturing professionals.
View all posts
Alison Field

Alison Field captures the everyday challenges of manufacturing and plant reliability through sharp, relatable cartoons. Follow her on LinkedIn for daily laughs from the factory floor.
View all posts

SHARE

Recent Posts

Reliable Directory

Find industrial contractors, distributors & integrators

15,000+ verified listings across North America

Search the Directory →

You May Also Like

Root Cause Analysis Meeting Best Practices for Maintenance Teams

Root Cause Analysis Meeting Best Practices for Maintenance Teams

Failures don’t wait for an opening on the calendar. A pump seizes, a gearbox grinds, a line drops, and the clock on...

How to Prevent Lubricant Contamination with Proper Breathers

How to Prevent Lubricant Contamination with Proper Breathers

An open vent is an uncontrolled entry point into a machine that depends on clean, dry lubricant to separate surfaces,...

What Causes Excessive Gearbox Vibration and How to Stop It

What Causes Excessive Gearbox Vibration and How to Stop It

A gearbox shows up rated for torque and speed, and the spec sheet stops there. The real operating world adds...

How to Control Airborne Contamination and Extend Equipment Life

How to Control Airborne Contamination and Extend Equipment Life

A new gearbox does not get a grace period. The day it goes online, the plant air goes to work on it, and in a dusty...

How to Justify Preventive Maintenance Costs to Plant Leadership

How to Justify Preventive Maintenance Costs to Plant Leadership

Every maintenance manager has lived this moment. A machine is running fine, the lubrication route is doing its quiet...

The Long-Term Risks of Deferred Maintenance Most Plants Underestimate

The Long-Term Risks of Deferred Maintenance Most Plants Underestimate

Deferred maintenance feels free. You skip the repair, the asset keeps running, and the savings land on this quarter's...