Stop Setting Condition Monitoring Alarm Thresholds and Forgetting Them

by | Articles, Maintenance and Reliability, Predictive Maintenance

The “install and wait” approach to condition monitoring is a classic trap. You’ve got sensors on your assets. You’ve got dashboards populating. You’ve got alarm thresholds configured. It feels proactive. But if your only trigger for action is an alarm, you are essentially waiting for the machine to tell you it’s already failing.

That’s a reactive culture wearing a digital costume.

An alarm tells you that a threshold was crossed. It doesn’t tell you why. Without trending the data leading up to that point, you can’t distinguish between a sudden catastrophic break and a slow, manageable wear pattern that’s been developing for months. And that distinction is everything when it comes to planning your response.

If your only trigger for action is an alarm, you are essentially waiting for the machine to tell you it’s already failing.

The Data Graveyard Problem

I’ve seen many new programs try to “boil the ocean” by installing sensors everywhere at once, without properly implementing the program or having a champion to drive it. The result is predictable: the entire project gets scrapped before it yields value.

With the vast amounts of data we can generate today, we’ve in some cases thrown out common sense and years of asset reliability training and tribal knowledge. People have fallen into a false sense of safety and reliability because they have access to data. But access to data and the ability to act on it are two very different things.

An unprioritized condition monitoring program quickly becomes a data graveyard. And data graveyards produce very specific symptoms:

  • Alarm fatigue. Too many alerts, too often, with too little meaning. The team stops responding because they’ve been burned too many times by nuisance alarms.
  • Analysis paralysis. So much data that analysts can’t prioritize what matters. Everything looks urgent, so nothing gets the attention it deserves.
  • Economic waste. Sensors, software licenses, analyst time, and IT infrastructure are all being consumed without producing actionable outcomes.
  • False busy-ness. The team looks productive because they’re reviewing data and generating reports. But the equipment keeps failing because the data isn’t being converted into maintenance actions.

The end result is diminishing return on investment and dilution of maintenance resources. More data didn’t equal more reliable. It just created more noise.

More data doesn’t equal more reliable. Quality of data over quantity is paramount.

Dashboards, Alerts, and Experts: Know the Difference

There’s a saying I subscribe to: dashboards tell you the history, alerts tell you the emergency, but experts tell you the future. Understanding those three layers is critical to building a condition monitoring program that actually works.

Dashboards

Dashboards are visual, high-level summaries. They provide a big-picture view of your assets at any given moment. They’re excellent for trend analysis (seeing a bearing temperature slowly creep up over six months), system context (recognizing that a motor is running hot because the pump it’s driving is clogged), and accountability (giving management a single source of truth for plant health).

But dashboards are passive. They require someone to look at them. If nobody checks the screen, a developing problem goes unnoticed. And too much data on a dashboard leads to visual noise, where users start ignoring subtle changes because the display is overwhelming.

Alerts

Alerts are reactive notifications triggered when a specific threshold is crossed. They’re efficient: technicians don’t waste time watching monitors, and they get immediate notification via SMS, email, or a CMMS notification when there’s a known issue.

But alerts suffer from two chronic problems. First, alert fatigue: if thresholds are set too low, nuisance alarms cause staff to start ignoring or silencing them. Second, lack of context: an alert tells you that a limit was reached, but it rarely explains why or identifies the root cause.

The Human Element

Data alone doesn’t fix machines. The most sophisticated AI can flag a vibration spike, but it takes an expert analyst to bridge the gap between “there’s a noise” and “replace the inboard bearing next Tuesday.”

Sensors fail. Wires break. Machines have “normal” quirks. An analyst identifies false positives (like a sensor picking up vibration because a truck drove by) before a work order gets wasted. An alert might say “high temperature.” The analyst looks at the dashboard history and determines it’s not a failing motor but a cooling fin clogged with dust.

Instead of just reporting a problem, experts provide a prescription: “The drive-end bearing is showing signs of outer-race wear. We have four weeks of life remaining. Order the part now and schedule the swap during the next planned outage.” That prescription gets turned into a formal maintenance action through a high-priority work order in the CMMS, complete with the specific tools and parts required for the job.

Dashboards tell you the history. Alerts tell you the emergency. But experts tell you the future.

The Variable Speed Problem: Why Generic Alarms Fail

Setting alarm thresholds on a constant-speed, constant-load system (like a fixed-speed fan) is relatively straightforward. The vibration or temperature should always be roughly the same. If it changes, it’s highly likely a fault is occurring.

Setting alarms on variable-speed or process-heavy equipment is a completely different challenge. You’re trying to hit a moving target. In a variable system, the data changes constantly just because the machine is doing its job.

Consider a Variable Frequency Drive (VFD). As the VFD increases motor speed, the natural vibration amplitude increases as well. If you set a static alarm high enough to account for high-speed operation, you’ll miss a failure occurring at low speeds because the fault signal won’t be strong enough to trip the high-speed alarm. You’ve created a blind spot at the bottom end of the operating range.

VFDs and varying loads also introduce what’s called smearing. If the speed changes during the few seconds a sensor is taking a reading, the frequency peaks smear across the graph. This makes it impossible for automated software to identify the specific frequency of a failing part. The data looks muddled, and the alarm logic can’t parse it.

Constant Speed/Load Systems

Variable Speed/Load Systems

Vibration and temperature are predictable. Deviations strongly indicate faults.

Vibration and temperature change with operating conditions. Normal variation can mask faults.

Static alarm thresholds work well when properly baselined.

Static alarms create blind spots at different operating speeds.

Automated fault detection is reliable.

Speed changes cause frequency smearing, confusing automated analysis.

Analyst involvement needed primarily for diagnosis.

Analyst involvement critical for both alarm design and interpretation.

 

This is where the skill of the analyst becomes essential. To set alarms successfully on variable equipment, you don’t just need a highly skilled analyst. You need that analyst to be deeply familiar with the process variables that affect machine vibration. Speed, load, temperature, product type: all of these change the baseline, and the alarm strategy has to account for that.

The A.I.R.U. Framework for High-Quality Alerts

I subscribe to the principle that high-quality alerts should meet four criteria. I call it the A.I.R.U. framework:

  • The alert should trigger a specific response. If nobody can do anything about it when it fires, it shouldn’t be an alert.
  • The alert should provide enough context (or link to enough data) that an analyst can determine root cause without starting from scratch.
  • The alert should be tied to a known failure mode on a specific asset. Generic, broad-spectrum alarms dilute the program’s credibility.
  • The alert should tell you something you don’t already know. Redundant alarms from multiple parameters on the same fault just add noise.

If an alert doesn’t pass all four criteria, it’s a candidate for elimination or redesign. Every nuisance alarm that stays in the system erodes the team’s trust in condition monitoring. Every alert that meets the A.I.R.U. standard reinforces it.

Start with Criticality, Not Coverage

The temptation is to instrument everything and alarm everything. Resist it. Implement a criticality matrix and prioritize your alarm strategy around your most critical assets first. Get those thresholds right, train your analysts on the process variables, and build the workflow from alert to work order. Then expand.

A focused program with well-designed alarms on 50 critical assets will outperform a sprawling program with generic alarms on 500 assets every time. The first produces action. The second produces noise.

The value of condition monitoring is not measured by the ability to detect a fault, but by the lead time provided to respond to it.

Lead Time Is the Metric That Matters

The value of condition monitoring is not measured by the ability to detect a fault, but by the lead time provided to respond to it. Lead time is the difference between a controlled shutdown and a catastrophic failure. It transforms an emergency into a scheduled task, protecting both the safety of your people and the integrity of your production commitments.

Your alarm strategy is the mechanism that creates or destroys that lead time. Set alarms carelessly and you’ll either get warnings too late or so many false warnings that the real ones get ignored. Design them with discipline, using the A.I.R.U. framework, accounting for variable operating conditions, and backing them with expert analysis, and you’ll build a program that gives your maintenance team the time they need to act before failure dictates the schedule.

So go look at your alarm setup. If the thresholds haven’t been reviewed since the sensors were installed, you’ve got work to do. Start with your critical assets. Apply A.I.R.U. Put an expert analyst at the center of the process. And remember: more data doesn’t equal more reliable. Quality over quantity. Every time.

Author

  • Matthew Knuth

    Matthew is a maintenance and reliability professional with over 25 years’ field experience in condition monitoring and reliability engineering. As Director of Reliability Solutions at Uptime Solutions, he designs and implements tailored programs that boost facility reliability and profitability. His expertise spans asset reliability, vibration, ultrasonic and oil analysis, alignment, balancing, and other CM technologies. Matthew holds certifications as a CAT III Vibration Analyst, and CMRP, CMRT, and CRL credentials, with a foundation in electromechanical engineering. He is also passionate about sharing his knowledge through education and public speaking.

    View all posts
SHARE

You May Also Like