In his seminal book The Limits of Safety, Scott D. Sagan contrasted two schools of thought to assess the safety performance of the United States nuclear weapon system during the era of the Cuban Missile Crisis: High reliability theory and complexity theory. One of the basic tenets regarding high reliability organizations is that they are able and willing to learn from events. Sagan, however, found that oftentimes, the U.S. military failed to learn from the incidents they had experienced. He attributed this failure of organizational learning in part to what March, Sproull, and Camull called “the ambiguity of interpretation.” This ambiguity stems from the fact that events in which safety margins were reduced, but which did not end in an accident, can be interpreted both as a proof of the robustness of the safety barriers, or as “near misses”, as events in which accidents almost occurred if it wasn’t for some random element. The first interpretation supports the perception that the system in question is fundamentally safe; the second interpretation supports the perception that the system is not safe enough.
According to Sagan, the U.S. military interpreted their incidents in the main as proof of the robustness of their systems. Obviously, no accidental nuclear war and no accidental nuclear detonation occurred. Considering this safety record of more than fifty accident free years, the U.S. nuclear weapon system could be considered to be very safe. However, there were incidents in which the margins were decreased, for example when a B-52 loaded with nuclear weapons crashed close to an airbase at Thule, Greenland, but the nuclear bombs aboard did not explode. Two interpretations of such events are possible: On the one hand, it can be argued that events such as accident at Thule prove that the system is fundamentally safe, that so many redundancy and safety devices were built into the system that even a catastrophic aircraft accident did not cause a nuclear detonation. This is what March et al. called the “reality of safety in the guise of danger.” On the other hand, of course, it can be argued that the safety margins were so much reduced that the system was unsafe and that an accident could have resulted if it wasn’t for some additional (random) element. In the example of the Thule accident, the nuclear bombs aboard the B-52 could theoretically have exploded if exposed to a shock wave or impact forces in a certain way. March et al. called this interpretation the “reality of danger in the guise of safety.” The first interpretation, however, is not conducive to organizational learning and to safety improvements. If the system is considered safe, then what else is there to do but to pat each other on the back and congratulate yourself for a safety job well done?
Almost 25 years have passed since Sagan’s book was first published, but the ambiguity of interpretations is something that can be seen frequently even today. Take the following event as an example: A flight crew was picking up a long-range business jet from a third party maintenance provider. During their initial cockpit checks, they found that five circuit breakers for essential systems had been pulled, which they diligently reported to their company. Their report, however, did initially not cause much concern. While the company’s maintenance department would normally have followed up with the external maintenance provider to establish
why the circuit breakers had not been reset and what could be done to prevent reoccurrence, the maintenance management of this particular fleet decided against that. They argued that pulling the circuit breakers was perfectly normal when preparing an aircraft for maintenance (which it is; however, not resetting the circuit breakers before releasing the aircraft back to line flying is not). But most importantly they pointed out that the pilots’ checklist, as one of the very first items, includes checking the circuit breakers. Taking it even further, they pointed out that this safety barrier worked very well, as evidenced by this very event. As a consequence, no further actions were considered to be necessary, not even sharing the crew’s report with the maintenance provider.
But which interpretation is correct, or at least more accurate than the other? In my humble opinion, interpreting the removal of safety barriers and the reduction in safety margins as a safety success is off the mark. As Todd Conklin pointed out, safety is not the absence of incidents, but rather the presence of controls and defences. Furthermore, there is a clear difference between the (temporary) breach of a safety barrier and its intentional or tacitly accepted disregard. In the latter case, we can expect that the safety barrier is subsequently removed, either formally or informally. Hence, the interpretation as a safety success could easily be the starting point to what Sidney Dekker called the drift into failure – the slow and often unperceivable reduction of margins and the transition to a lower state of safety. In the case at hand, the organization’s maintenance department did not intend to follow up with the maintenance provider. They were content that the second safety barrier, the pilot checklist, was sufficient to provide the desired level of safety, and that re-establishing the first safety barrier, the requirement for maintenance providers to always deliver aircraft with the circuit breakers reset, was unnecessary. Paradoxically, they would have removed a safety barrier because it had been breached. Fortunately, the lack of actions was noticed by the organization’s safety department and promptly addressed.
Nevertheless, the event demonstrates that the danger in the guise of safety, and the pitfalls it creates, is alive and well. We must be careful not take any safety successes at face value. Instead, we must investigate on what interpretation of events the alleged safety success is based on, always being mindful that the road to failure may be paved with safety successes that are nothing more than one-sided interpretations.
I would love to hear your thoughts on the topic.
In traditional software systems, one “fault tolerance” approach aimed at mitigating component failure is through redundancy – adding a secondary and identical component that could be used in case the primary one fails.
One issue with this approach is that it introduces a number of additional complications, such as an additional component to monitor for health, the failover mechanisms need to be in place, etc. Another issue is in scenarios where the components are in an “active-active” configuration, where both primary and secondary are in production. In this case, one or both of the components can experience degradation of some kind and that behavior not be noticed, because the overall outcome is not degraded. As long as one of the components are up and running at any given point in time, all can be deemed to be ok.
This is called “masking” and it is one of the most insidious modes to be in. This sounds very much like ‘ambiguity of interpretations’ – the existence of secondary components and the associated infrastructure to support it both masks deeper issues as well as provides continuity in specific failures.
That’s a very interesting perspective, and I think it is just another example of why adding redundancy alone doesn’t solve safety issues.
It looks like the Grenfell Tower Fire is another example! see https://www.researchgate.net/publication/319123150_The_Grenfell_Tower_Fire
Hi David, I read your article with great interest. Where do you think the ambiguity comes in in particular?
A broken lighthouse is more dangerous than a reef.
A very nice way of putting it, thanks!
If “interpreting the removal of safety barriers and the reduction in safety margins as a safety success is off the mark”, what would you then consider a safety success? If degraded barriers are backed up by additional standing barriers, isn’t this a demonstration of a system’s resilience?