Process machines are critical to the profitability of processes. Safe, efficient, and reliable machines are essential for maintaining dependable manufacturing processes that produce marketable, on-spec products at the desired production rate. As the ward of process machinery, we wish to keep our equipment in serviceable condition.
Troubling machinery symptoms, such as vibration, high temperatures, low flow conditions, etc., are often encountered around production sites. When these symptoms are detected (Figure 1), the site should have a clear process for assessing these conditions to decide if they are systemic, transitory, or due to the deterioration of a machine or its ancillary components.
Troubleshooting keeps machines running – failure analysis explains why they failed
To optimize reliability, sites must provide both troubleshooting and failure analysis functions for operations. However, managers should be aware that troubleshooting is a different undertaking than performing failure analysis. In this brief article, I will outline the key differences between these two job tasks.

Figure 1: Decision tree broken down into “Field Assessment” and “Shop Repair” segments.
One of the most challenging aspects of those working in the field is deciding whether an operating machine should be shut down due to a perceived problem or be allowed to keep operating at current operating conditions.
If they wrongly recommend a repair, the remaining useful machine life is wasted; however, if they are right, they can save the organization from severe consequences, such as product releases, fires, or costly secondary machine damage. This economic balancing act is at the heart of all machinery assessments.
The reader may ask: What is the difference between field troubleshooting and other analysis methods such as a root cause analysis, failure analysis, and a root cause failure analysis? Let us briefly review the differences:
Field troubleshooting is a process of determining the cause of an apparent machine problem, i.e., a symptom, while the machine is still operating under actual process conditions. Field troubleshooting efforts typically focus on a specific machine under a full process load, utilizing a proven body of technical and historical knowledge.
The body of knowledge may be in the form of troubleshooting tables, manufacturer’s information, or the site’s operating history. Keep in mind that process machinery can only be truly evaluated in service and under full load, i.e., in situ. To my knowledge, there are no testing facilities available that can evaluate process machinery under full process loads and with actual process fluids.
Field troubleshooting assesses the mechanical integrity of a machine in process service to determine whether observed symptoms are the result of an actual machine fault or a process-related issue.
Here are examples of troubleshooting opportunities:
Example #1: Pump flow has fallen well below its rated level.
Example #2: Pump thrust bearing is running 20°F hotter than it was last month.
Example #3: A pump seems to be vibrating excessively.
The site needs to know if these machine symptoms are outside normal limits. If not, the plant needs to either adjust field operating conditions, investigate piping restrictions, inspect for broken supports, align the pump and driver, make other corrections that lead to the symptom or plan for a pump repair.
Troubleshooting requires a cause-and-effect understanding of the type of machine being assessed to determine if there is a problem. If field conditions seem to be outside normal limits, the troubleshooter must ask himself: What possible field condition, machine issue, or issues can cause the symptoms I am seeing?
Here are some common examples of cause-and-effect thinking:
- If a bearing is running hot, the troubleshooter needs to know what can cause the symptom to occur. A loss of cooling water flow or low oil level might cause it.
- If a centrifugal pump is exhibiting low flow, the troubleshooter needs to know what can cause that symptom to occur. It might be caused by a pinched discharge valve causing excessive pump back pressure.
- If a centrifugal pump driver has a high-power draw, the troubleshooter needs to know what could possibly cause this symptom to occur. It may be caused:
- By the wrong impeller(s) being installed.
- By running the pump at too high a speed. Since we know that a pump’s power draw is proportional to the speed to the third power, we need to check the pump speed to ensure it is running close to its rated speed.
- By running the pump above the design flow due to a control issue or there is inadequate backpressure on the pump.
- By the liquid’s density or viscosity if the fluid being pumped being higher than expected.
By eliminating unlikely causes or disproving possible causes by testing, the most likely sources of the problem can be found. This cause-and-effect thinking is gradually developed with training and practice. Over time, this type of thinking will become second nature.
Failure analysis is the process of collecting and analyzing physical data to determine the cause of a failure. Physical causes of failure include corrosion, bearing fatigue, and shaft fatigue, among others.
Failure analyses can only be conducted after a machine is disassembled and the failed component is identified and removed. Failure is defined as a condition when a component’s operating state falls outside its intended design range and is no longer able to safely or efficiently perform its intended duty.
Root cause failure analysis (RCFA) methodology attempts to solve complex problems by identifying and correcting their root causes rather than simply addressing the component damage. The RCFA methodology enables an organization to delve deeper into a failure or series of failures to identify latent issues.
To further clarify the differences between these analysis approaches, we recommend the following line of questioning:
- The field troubleshooter must first ask: Do I fully understand the machine or subsystem that needs to be analyzed? If the complexity is beyond the troubleshooter’s abilities, they should get help. At this point, management may decide to conduct a Root Cause Analysis (RCA) analysis.
- Suppose the field troubleshooter decides to tackle the problem at hand. In that case, they should then ask: “Are the observed symptoms caused by a failing machine, a correctable fault, or by undesirable process conditions?” If it is a process-related problem, changes can be made before permanent machine damage occurs. If a fault is deemed correctable, adjustments or minor repairs can be made to restore the machine to a serviceable condition quickly.
If the machine fails, either a failure analysis or root cause failure analysis must be performed, depending on the extent and cost of the failure. The failure analyst asks different types of questions depending on the level of detail desired:
- The failure analyst asks the question: “What is the physical mechanism or sequence of events that caused a given component to fail?” If the failure mechanism is clearly understood, perhaps design or procedural changes may be implemented to avert future failures.
- The root cause failure analyst asks the question: “Are there hidden factors, such as unknown design, repair, operational, and other organizational issues, contributing to the observed machine problems?” If there are latent factors suspected but unidentified, perhaps an interdisciplinary team can identify the key factor or factors and address them to avoid future failures.
A key distinction between field troubleshooting and failure analysis methods is that initial troubleshooting usually only requires basic field knowledge, tools, and instruments. In contrast, most failure analyses require some engineering expertise to perform.
Train the field to spot problems—leave the deep forensics to the experts.
From experience, we know that machinery failure investigations require knowledge of materials science, failure mechanisms, machine design, machinery performance, heat transfer, and other related fields. These are the reasons why I believe that most field personnel should be capable of performing basic field troubleshooting.
Still, in-depth failure investigations should be left to engineers and experienced technicians. It makes sense to encourage operators, mechanics, and field supervisors to learn basic troubleshooting skills to assist in the initial assessment of machinery and determine if repairs are required.
Figure 1 shows a simple decision tree that depicts how common machinery field problems can be addressed. The troubleshooter begins at the top of the tree when a symptom is first detected. At this point, the troubleshooter assesses the situation and then picks one of the possible paths forward:
- Do nothing.
- Modify process conditions.
- Adjust machine, i.e., balance, align, or lubricate machine as required.
- Plan to repair.
If a machine repair is deemed necessary, the maintenance organization should then estimate the repair and outage costs. If the total cost (parts and labor) of the failure is less than $10,000, then the repair should be performed without any additional type of analysis.
If the total cost of repair is estimated to be greater than $10,000 but less than $50,000, a failure analysis should be conducted on the failed parts to understand the nature of the failure. Finally, if the total cost of failure exceeds $50,000, then a root cause failure analysis is justified and should be conducted.
Note: The reader should note that the decision tree presented here is only one of many possible tools that can be used to address machinery field problems. Each organization can and should develop its own customized decision tree to meet its specific needs. For example, the cost breakpoints used in this example can be customized to satisfy your organization’s process and management goals.
Scouts versus Doctors
Troubleshooting
We should think of a troubleshooter as a scout (see Table 1) who goes into the field to identify and assess process machines exhibiting unwanted symptoms. The troubleshooter must possess the necessary knowledge, skills, and motivation to determine whether the machine is operating safely or not.
If not, the troubleshooter must then survey the overall installation and operating conditions to determine if the observed issues can be resolved by making process changes or adjusting the machine. If his attempts fail to correct the issue, the troubleshooter will recommend scheduling a machine repair.
Performing Failure Analyses
In contrast, we should think of a failure analyst as a doctor (see Table 1) who determines what led to the failure and if a similar failure can be prevented in the future. The failure analyst’s job begins once the process machine is removed and taken to the shop. Upon disassembly, the failure analyst begins looking for clues that will explain why the failure occurred.
The first layer of inspection is usually conducted by a mechanic, who determines which components failed and documents the type of damage they sustained. He may also employ the 5 Whys method to understand the chain of events leading to the failure.
Although the mechanic’s primary goal is to repair the machine, he may be asked to determine the reason for failure, depending on the economic losses sustained due to the failure. If it is evident that the machine has reached its normal end of life, then he will proceed with the repair and document his findings.
Failure analysis doesn’t fix the machine – it prevents the next failure.
However, if it is evident by inspection the machine failed prematurely, then the mechanic and/or machinery specialist should perform a more detailed root cause analysis to understand and document the failure.
For more complex failures, the mechanic and machinery specialist should collaborate to determine if a latent root cause is involved in the failure. Hopefully, he will uncover the root cause of the failure and eliminate it to ensure a long operating lifetime.
Table 1 provides a comparison of troubleshooters and failure analysts. Notice the difference in skill levels, training, and tools.

Table 1: Comparison of a Troubleshooter versus a Failure Analyst
I recommend that sites with critical machinery teach field personnel basic machinery inspections and troubleshooting methods. Since operators tend to make up the lion’s share of the staff working in petrochemical sites, it makes sense to develop them into “field scouts.”
They are always in the field, understand machine functions, and recognize normal operation. Having an army of trained operators in the field will help keep machinery out of harm’s way and avoid unnecessary repairs.
Rudimentary troubleshooting by operators should be the first step in a series of analysis steps to identify and correct an abnormal situation in the field before permanent machine damage occurs (Figure 1). By acting quickly, the underlying problem can be identified and corrected, allowing the machine to return to normal operation promptly.
How Troubleshooting and Failure Analysis Work Together
Periodic field inspections are crucial for ensuring the reliable operation of machinery. The decision tree in Figure 1 illustrates how informed machine decisions usually begin with some sort of troubleshooting or assessment effort. Field personnel can be considered “field scouts” who inspect machinery and determine if repairs are required.
In contrast, machinery professionals can be considered “doctors” who oversee repairs to determine the cause of the failure and whether it could have been prevented. These two functions work in tandem to avoid unnecessary repairs and enhance machinery reliability.