Do you have the Knack for Effective Troubleshooting?
By George Anderson, ESEP
Systems engineers are potentially the best candidates for being or becoming good troubleshooters.
The INCOSE Handbook identifies maintenance as a technical process. Within this framework fault identification or troubleshooting is included as a sub-activity. Troubleshooting is a form of problem-solving that requires experience and often special skills like forensic engineering
I have been fortunate over my career to have been a troubleshooter in the aviation, electronics, ordnance, and mechanical systems fields.
Here, I offer selected stories as examples of problems that illustrate how systems thinking, principles, and intuition can play an important role in troubleshooting.
Story 1: The lawnmower conundrum
Informative troubleshooting tales do not need to be about large complicated systems. Sometimes a simple system malfunction can be difficult to identify.
Some years ago, I had a power lawn mower that provided several years of trouble-free operation. I also had a hobby of rehabbing mowers and knew a great deal about mower diagnostics. At some point, my mower began surging and eventually could not reach operating speed. I measured and inspected everything I knew to be a suspect condition and found nothing. Unlike most mowers at the time this one used a solid-state ignition system. These were so new that no one understood much about their failure modes and the manufacturer touted them as almost failure-proof.
Skeptical of this, I attached a neon bulb to the ignition and started the mower. Immediately the problem revealed itself. The bulb flashed until the engine tried to speed up and then it cut out. The ignition system was failing only at high rpm. After replacing this expensive ignition module, the mower again ran normally and is still cutting my grass today, 43 years later
This was a good outcome, but I was consumed with curiosity as to what caused a supposedly long-lived ignition module to fail. Its components were sealed against my prying eyes by an expensive mass of potting compound that took me hours to grind away. When I reached the circuit components, I traced the failure to a single capacitor. This capacitor if it had been accessible would have cost about $0.75 to purchase and ten minutes to replace.
My discovery led me to challenge a friend to repair these failed modules and sell them to mower repair shops for a fraction of the cost of a new part. It was later discovered that there was no shortage of defective solid-state ignitions to repair. These parts cost over $100 to replace vs. about $5.00 for the older system. From lawnmowers to dishwashers, consumers have paid enormously since then for unreliable new electronic or digital systems that drive life cycle costs beyond the 20 times increase suggested by this example. In fact, there is a growing trend in consumers buying older appliances to enjoy better performance, reliability, and longevity. For example, let me mention that surviving examples of my 1973 mower are currently selling for around $280.
After this experience, I became suspicious of capacitors used in other applications and over the years was able to quickly find failures in electronic power supplies and computers as well as aircraft electronics. My fixation on capacitors probably saved me many man-hours following conventional isolation processes.
It took around ten years after this event for the solid-state ignition and its reliability flaws to become part of automobile engine design and with it came very painful difficulties for the owners.
Story 2. A significant loss
A few years ago, my relatively new car started having sudden engine shutdowns while driving. This was scary and unsafe. When the Volvo dealer started replacing expensive parts with no logical plan, I had another shop noted for trouble shooting investigate. After months of testing they could not find the source of the problem. I reluctantly had to sell the car because all the diagnostic efforts failed to identify the fault or even the part of the system that was failing. Several years later I learned that it was the failure of the camshaft position sensor that gave false information to the car’s electronic ignition and caused the engine stalls. The fact that it was intermittent made troubleshooting nearly impossible. To make matters worse, the then new On-Board Diagnostic System (OBD) provided false information as to the cause of shutdown. This last revelation was probably the biggest lesson for me, namely that new systems designed to assist you can be terribly flawed.
As time went on, I encountered similar flaws in other digital systems in aircraft. Electronic fuel controls for jet engines sometimes shut engines down in flight at awkward moments. One aircraft lost both engines when the fuel control detected that the g limit of the airplane was being exceeded. While the accident cause remains controversial, there was no question that the design requirements for the digital fuel control needed to be revisited.
Story 3. Boeing’s trouble shooting efforts in the B-737 and B-737 MAX
As a National Transportation Safety Board (NTSB) accident investigator, I had exposure to the long-lived forensic examination of the Boeing 737 rudder hydraulic power control unit (PCU) malfunctions. This occurred in the 1991 to 2003 time frame and the outcome of this story again proved that finding randomly occurring malfunctions can be very difficult to locate and understand.
Almost 20 years later, two Boeing 737 MAX crashes again brought the conduct of forensic and troubleshooting skills before the public eye.
In the earlier B-737 rudder yaw damper investigation, a subtle and random fault led to the crash of two aircraft and the temporary loss of control of countless others over a ten-year period. The NTSB spent many manhours in forensic investigation before an alleged whistleblower from Boeing and follow up testing allowed them to unequivocally identify the failure.
The two recent crashes of the Boeing 737 MAX places Boeing’s troubleshooting or forensic engineering skills back under scrutiny. The facts appear to be that in both crashes a single sensor malfunction was able to generate great authority over the aircraft pitch control and against the pilots’ best efforts cause an unrecoverable dive. It was easy for the accident investigators to discover the sensor failure as a cause, but Boeing initially blamed both crashes on pilot error. These seemingly opposing positions were not necessarily incompatible.
The Boeing position addressed the area of pilot performance during an emergency and the expected response. This shift of perspective was needed but the trouble shooting efforts found the single component whose malfunction started the rapid movement of the aircraft flight controls. Pilot interactions notwithstanding, ensuing hardware investigation has shown that a redesign of the system to prevent this type of malfunction is possible. In the meantime, the FAA has withdrawn its airworthiness certification of the Boeing 737 MAX and is conducting further evaluation of the system design requirements.
Story 4. The miniature rocked that failed acceptance testing
A success story of the 1960’s, the Gyrojet miniature rocket is still used today by the military as a distress flare. An important requirement is the ability to function after sustained immersion. In 1987 I was involved in a troubleshooting exercise that sought to correct testing failures of the hermetic seal that kept the propellant dry.
While there were a lot of materials that would make a good hermetic seal, many interfered with the rocket initiation process. It was likely that the problem was a change in the existing materials, and this was confirmed. Expert engineers and consultants tried to find a solution, but all suggestions resulted in test failures. It seemed that we needed to look at solutions outside the thinking of the aerospace and ordnance industry.
Thinking about other industries I happened to remember that my next-door neighbor was the president of a local can producing factory. I asked him as a food preservation expert what other techniques might be successful and he introduced me to the Crown Cork & Seal Company
This company at the time was the leader in hermetic sealing technology and was the inventor of the ubiquitous bottle cap. They made recommendation that solved the sealing and launch problem. Sometimes troubleshooting means finding the right expert rather than the right material.
Story 5. The C-130 incident that almost killed me.
In this story I was at best an observer. I was a 2nd Lt. copilot on a USAF C-130E crew departing from Travis AFB, CA to Hickam AFB, HI. The night departure was normal until about an hour after takeoff when the engineer left his seat and scanned the wings with his flashlight. He thought he saw fuel streaming out of both wings.
With the weak light of the flashlight, he could not be sure but was concerned enough to check the fuel panel. He found no indications that fuel was being lost in the fuel tank quantity gages and no pumps in the fuel dump system were in the on position. The aircraft commander was baffled and wanted to believe that because there was no confirmation from the fuel panel that the fuel stream was somehow an illusion.
Fortunately there was an Aldus lamp on board that provided 10 times the light output of the flashlight and this enabled several crew members to confirm that the dump masts were discharging fuel at a rate that the tech order noted could exceed 500 lbs. per minute.
The Aldus lamp was designed and placed on the aircraft to be used as a signaling light and its improvised use as a spotlight to illuminate the wing was a great example of problem solving when time was critical.
There was panic now as we turned back to Travis no longer having faith in our fuel gages and exactly how much fuel we had left.
Obviously we returned safely and successfully troubleshot the failure. The immediate cause of the failure was that before departure maintenance had repaired a loose wire on a connector plug and somehow got the wire attached to the wrong pin. The result was that all the fuel dump pumps were powered during flight but did not activate on the ground.
This was because the pumps’ circuits were connected to power through the touchdown switch that activated when weight was on the landing gear. Added to the crew’s inflight diagnosis problem was that many of the aircraft fuel gauges were inoperative or unreliable.
This incident was not the last of the accidents and incidents that involved miswired Cannon plugs or inoperable fuel gauges. In 1990, a Lockheed C5A crashed on takeoff killing all onboard when a thrust reverser opened just after the aircraft left the ground . A cannon plug repair was the likely cause.
A C-130 in another incident involving a Cannon plug lost all four engines as the plane lifted off the runway but recovered safely .
Fuel gauges have also been an issue in other incidents. Canadian airlines have had two passenger aircraft run out of fuel and forced to glide to alternate landing runways. Air Canada Flight 143 in 1983, and Air Transit Flight 236 in 2001, were both forced to make emergency landings when their engines failed. It appears that these pilots variously misinterpreted fuel gauges, made unwarranted assumptions about fuel system malfunctions, and ignored ground refueling records. Pilots need to troubleshoot fuel system anomalies and land immediately when problems cannot be identified and solved.
The stories above represent challenges that were difficult to solve by conventional thinking. Trouble shooting is truly a thinking out of the box approach to problem solving as well as learning to apply or even improve on the practices of forensic science. Our systems will always be evolving and exhibiting new instances of emergent behavior that will require patience and trouble shooting skills to understand and effectively control.
Additionally, because of my troubleshooting experiences I have learned a great deal about the unreliability, excessive cost, and safety vulnerabilities of modern systems. We of course need progress but how much of the negative aspects of new designs can we eliminate? Systems engineers must take on this challenge of creating better designs and using good trouble shooting techniques to provide lessons learned for the benefit of all. Awareness of capacitors and their shortcomings would be a good start.