Resilience and the Pembroke refinery explosion

Chevron_pembroke_refPembroke Refinery is an oil processing facility on the Milford Haven Waterway, in Wales. It was the site of a multiple-fatality explosion in 2011, and the grounding of the Sea Empress in 1996, releasing a major oil spill into Pembrokeshire Coast National Park. Prior to both of these events, Pembroke Refinery made national headlines with another explosion and fire, fortunately non-fatal.

Early on the morning of 24th July, 1994, there was a dry electrical storm raging above the refinery. Just before 9am lightning struck one of the processing units starting a fire. The lightning and subsequent fire did not directly cause any permanent harm, but they were the trigger for the events which followed. They created a disturbance in the normal operation of the refinery, and signalled that this was not a normal day at the office.

As part of the emergency shutdown, hydrocarbon flow was halted through part of the plant, the cracking unit. None of the vessels in this plant are supposed to be completely empty, even in shutdown, so a series of valves began to close. These closed valves prevented any of the vessels from completely draining. Once the plant was restarted and flow resumed, the valves opened again. One valve in particular, which we’ll call VALVE B, stayed closed. VALVE B enabled flow from a tank called the debutaniser into another tank called the naptha splitter. With VALVE B closed, and flow restarted, the debutaniser began to fill up, whilst the naptha splitter was being starved.

To make matters worse, the control system was getting incorrect signals from VALVE B. According to the control system, VALVE B was open, and everything was working normally. Inside the control room there were status indications for all of the equipment, and a separate monitor with a  list of all the alarms, but there was no overview display of the plant. What I imagine when I think of a control room is a screen with a picture of all the tanks joined together, showing how full each one is, and how much is flowing between them. If the operators had such a picture they could have quickly worked out that even though VALVE B was supposedly open, nothing was flowing from the debuatniser to the naptha splitter. The technology allowed such graphics to be displayed, but had been configured onto to show raw data and bar charts. The operators could also only call up part of the overall process at a time. On the debutaniser information screen they could see that pressure was increasing. There was nothing calling their attention to the naptha splitter, so no reason to bring up that particular screen, or to compare it to the debutaniser.

As the debutaniser filled up with liquid, pressure increased and had to be relieved. The operators vented the debutaniser into the blow-down and flare system three times. Unlike at Texas City, where similar overspill caused a much worse explosion, all of the equipment was connected to flare towers. Rather than have a separate tower for each piece of equipment, there were towers for different types of product. The “sweet” tower dealt with light hydrocarbons. The “sour” tower dealt with gasses that had significant amounts of hydrogen sulfide, and the “acid” tower dealt with mixed material that needed processing before it could be safely burned off.

Just like at Texas City, though, the blow-down system was designed primarily to handle overflow gasses, not large amounts of liquid. Within each unit there was a drum for capturing liquid and allowing gas to proceed into the arterial pipework and eventually to the flare. Once this drum was full, liquid would enter the pipework. Because there were only three towers for lots of bits of equipment, this piping was fairly complicated. Even in an overflow situation pressure needed to be carefully managed so that there was a smooth flow to the tower. On 24th July, as the liquid from the third venting traversed through one of four 90 degree elbow bends, the pipework failed. It didn’t help that changes a few years earlier to reduce the environmental impact of the plant prevented automatic removal of liquid from the blowdown system. It didn’t help that the pipes, not meant to handle liquid in the first place, were known to be corroded.

Twenty tons of hydrocarbon were released, a vapour cloud formed, and an explosion quickly followed. There were no fatalities. Partly this was due to luck, but to be fair to the plant management there was also solid contingency planning in place, and facilities in range of the explosion were designed to cope with blast damage.

 

Investigation

It took two and a half days to put out the fire, but only a few hours for Health and Safety Executive (the United Kingdom body responsible for investigating accidents of this type) representatives to arrive on the scene.

The site was quite dangerous to investigate due to the extensive damage, but the control rooms were mostly intact. In fact they wouldn’t have been badly damaged at all if the earlier lightning strikes hadn’t disabled the air-conditioning system, requiring the protective door to be left open.

The proximate causes of the accident was identified as two process imbalances. There was more liquid going into the debutaniser than coming out of it, and more heat going in than coming out. Both of these speak to a loss of control. Some variation of chemical processes is normal, even desirable. Too much variation, in particular when two parameters drift badly in the wrong direction at the same time, led to an unsafe situation.

The report also found that the actions of the operators contributed to these imbalances. They didn’t quite understand what was going on, and so they didn’t take appropriate action to correct the drift. It would be easy (and very wrong) to stop there.

 

Explanation

None of the operators went to work that day intending to do a bad job. In fact, they were displaying considerable expertise in coping with a highly unusual situation. They had contained a dangerous fire without anyone being hurt or damage to the plant, and they were in the process of restoring production. However, they were trying to manage within a plant and organisation with very low resilience.

Resilience” is the capacity of something cope with disturbance. When measuring the physical resilience of a component we consider the amount that it distorts in response to stress, and how quickly and perfectly it returns to its original shape. Resilience is a positive measure of safety, in the sense that it considers the presence of good things rather than the absence of bad things.

The physical plant at the refinery had very limited flexibility. In a simple flow cycle, it is very important that the system always has more capacity to remove pressure, mass and heat than it does to introduce pressure, mass and heat. This can be achieved by having a second control loop which shuts off inputs if the outputs are not flowing.

The design of the control room dis-empowered the operators. They had enough information to follow carefully written procedures, but they were not expected to adapt or improvise, and so the system didn’t provide them with the situational awareness needed to show initiative. It wasn’t just the lack of an overview display, there was also an alarm system that just dumped a long list of unprioritised warnings to the operators, and unreliable instruments which meant that the operators had to form complicated mental models including disturbed processes and incorrect reporting of those processes.

Their training and their equipment didn’t let them step back and form a clear picture of what was going on in time to react appropriately

Finally, and quite literally, if the elbow bend hadn’t been corroded it might have flexed and returned to shape instead of shattering.

The lightning strike provided an initial disturbance to the system. The equipment and operators were put under unusual stress. Good business and good safety required a return to normal operating conditions as quickly and smoothly as possible. It didn’t have to be a lightning strike which caused the disturbance, and it was probably impossible to enumerate and protect against every single thing that could go wrong. Instead, the overall system needed the positive features of resilience so that it could respond to any disturbance.

Some of these features could have been pure hardware – a blowdown system with more capacity, and less crucial timing and balance. Modifications to the plant to reduce its environmental footprint had actually made this part of the system far less resilient. Other missing resilient features related to the support provided to the operators. Displays that provide increased situational awareness, and alarm systems which take care of prioritisation and interpretation so that operators can focus on the big picture increase resilience.

Whilst the report does not discuss training in depth, it does mention that the team inside the control room were flexible and multi-skilled. This is normally a marker of resilience. Individuals could switch roles to cover for missing staff, and managers were in the practice of “helping out” during upset situations. Unfortunately this individual flexibility didn’t translate into team resilience. Decisions were made on an individual, reactive basis without co-ordination.

Reading between the lines, the gap in trust and understanding was not between line management and control room operators, but between the designers and the whole operational team. All of the operators, including the line managers, were provided with the information that the designers thought they needed to have to operate the plant in its intended fashion.

In 1984, Charles Perrow would have called this a “Normal Accident”. The tight coupling and interactive complexity of the system meant that the operators were not able to comprehend the problems adequately, so they made things worse instead of better. Resilience, one of the themes of Safety Differently and Safety II thinking, provides us with a way to manage safety in a way that can prevent such accidents. It is good to anticipate hazards and design systems and people that can cope with them, but it isn’t enough. We also need to look at safety as a positive attribute of our systems and people.

This post is a modified version of a segment from DisasterCast Episode 42.

Image: Colin Bell/geograph.org.uk

 

6 Comments

  1. John Wilkinson Reply

    Hi Drew, while I like the summary account I think there are other factors e.g. supervision (this is a necessary distributed function even in a fully self-managed team but i don’t think they were such a full team), which also contributed and I am not sure what an appeal to ‘resilience’ here adds – while appreciating there is a fuller version elsewhere which i have not yet seen. In my simple view, there are a range of performance shaping / influencing factors which make error more or less likely on the day, and the operators were faced with a catalogue of these. So to add assurance against error the design and operation needed to be optimal, and clearly they were not. the risk of using ‘resilience’ as an explanation is that it may add nothing i.e. have no explanatory power, only descriptive at best. if we are adding something we need to be clear exactly what that is. Meanwhile I’d say that improved HCI and control room arrangements, better understanding and awareness of what could go wrong and improved training (the site developed a suite of full high fidelity simulators after the incident but sadly other refineries did not all follow suit) would look like a good start to fixing things. And if that’s resilience in a system then I’m for it of course but I’m not seeing the ‘extra’ that it should bring. You can tell I’m ripe for conversion, but not just yet I’m afraid… 🙂

  2. Drew Rae Reply

    John, you’ve hit right at the heart of the epistemological problem of safety research. “Explanatory” power isn’t quite precise as a characterisation, but it is a good short-hand for the difference between safety and, say, bridge design. Almost all safety concepts aren’t so much theories as “meta-narratives”. They are patterns that we can see in the stories, and there’s no particular reason to prefer one pattern over another.

    The idea of an error-encouraging or error-discouraging system due to a set of performance shaping factors is equally a pattern. In the small (i.e. in a controlled experiment) performance shaping factors have predictive power. As explanations for accidents they are just another meta-narrative.

    You’re quite right that I could have emphasised all of those factors, and made this a story about operator error. I could equally have made this a story about the design of the plant, showing the accident as a downstream consequence of design decisions.

    What’s special about resilience? Not a lot empirically, yet. There are two reasons to prefer it over more negative views of safety, though.

    The first, which is my main interest, is that in principle it _can_ have explanatory power. It doesn’t yet, because we don’t have a way of measuring resilience in advance of disturbances in a way that is both reliable and predictive. Contrast this with risk assessment approaches where we equally don’t have ways of measuring safety in advance that are reliable and predictive, but where there are very good reasons to expect that we never will.

    The second reason is a value judgement. Explaining an accident as the consequence of human error is a morally questionable narrative with real-world negative consequences. Adding systemic explanations for those errors builds upon that narrative without changing its fundamental nature, and has repeatedly failed to take away the real-world negatives of the human error worldview.

    What I was trying to do with this story – and it is just a story, as all accident explanations are – was to use resilience as the meta-narrative. You’re completely right that it doesn’t “add” anything – it guides the narrative. My challenge to you would be to recognise that there is no “real” story or explanation to be added to. There are always other factors that can be discussed, but the choice of which ones to include or emphasise is a storytelling choice, not an objective one.

  3. David Provan Reply

    Hi Drew, Great summary of an incident that is similar in nature to Texas City and also Longford here in Australia. I think that information plays a critical role in resilience within tightly coupled complex socio-technical systems. In this example the alarm system and displays within the control room provide this information for interpretation and action by operators. In my experience in gas plants it becomes extremely hazardous when the technical system has low levels of resilience (engineering redundancy) the people have low levels of resilience (competence, anticipation, risk competence) and they are connected by incomplete or unreliable information. Improving safety in these environments may most quickly and efficiently be improved by focussing on the information rather than the technical system or the people. More complete, timely, accurate and risk prioritised information about the functioning of the technical system made available to operators should lead to improved speed and adequacy of intervention in the system during unplanned operating conditions,Thant maintains the margin of safety … Or you could employ more safety managers and hope.

  4. Drew Rae Reply

    David,
    Thanks. I think information is a really interesting lens to look at these things with. Unfortunately it is one of the factors that often gets obscured by the process of accident reporting. To create an intelligible report, all the different perceptions get turned into a single narrative, filled with “didn’t realise” and “should have known” – all of which are flags that the story was very different seen with different information.

    At Pembroke the local information situation could have been much better engineered. I’d love to know the full story behind that. If you’d asked the operators before the accident, would they have said that the displays were frustrating to use? Safety aside, did they go home to play Doom or Myst on their Pentium PCs and complain about the low-tech displays at work?

    Sometimes fixing the information situation does require significant engineering change. At Texas City there was no reliable way to see what was happening, due in part to under-maintenance.

    In this case, though, they had all the technology there – they just hadn’t configured it to be useful. Did they know that? Had they conducted a human factors analysis when they set it up? Had they involved the operators in the control room layout design?

    Improving the information was a quick way to improve resilience, but I don’t know why they hadn’t already done it. That’s one of the research questions I’m looking into at the moment (not with Pembroke though – the details I’m interested in aren’t the details that the investigators at the time were interested in).

  5. David Provan Reply

    The short answer is un-finished design and/or drift. The long answer is far more messy as you describe and whist the “why hadn’t they?” seems so logical in hindsight, was so rational not to do before. I have some very relevant stories specific to control rooms in gas plants for another time.

  6. Bill Mullins Reply

    Effective understanding for operations in Perrow’s tightly coupled systems requires an ecological sensibility. I would contend that Resilience can and should be viewed as a very real, substantive, and emergent property of the complex adaptive System of Systems which is the institution in conjunction with the as built and maintained plant.

    Requisite diversity of perspective upon the operation is a critical feature and is practically a predictor of institutional resilience. One good table top exercise built around this actual case will give a high confidence insight as to how Resilient the current management is.

    Much work has been done by ecologists to model the dynamic properties of such SoS – while not predictive in the engineering sense, these institutions do have properties which are forecast-able with considerable utility. The features of such SoS which are indicative of the state of Resilience (not a scalar variable, but a Figure of Merit type measure of mitigative capacity in the face of unanticipated sequences); these can be measured by means of peer-to-peer comparisons across populations of like SoS.

    That is what good regulatory operational oversight provides on a sampling basis. There are even industries where these comparison are provided in the mode of self-assessment (e.g. within a large corporation, or via an industry consortium such as INPO/WANO in the nuclear power world.)

    A caution I would offer involves the employment of the compounding meme “meta-“; if one intends to employ systems thinking then the goal for recognition purposes seems to be gaining that Ah Hah where the listener recognizes that a System of Systems is more than the Sum of the component systems effects. Walmart is a complicated collection of component systems, but its management works very intentionally to avoid it drifting into CAS behavior.

    For contrast, we can follow an Aircraft Carrier-Air Wing SoS through its progressive preparation for deployment and then on through the forward-deployment cycle right on to the return to home port and decoupling of the two major function components.

    We have now the relevant architecture to describe Resilience and in population pattern ways that it can be measured. What needs to be accepted is that variation in Whetherspace is variable in kind in the same manner as Weather forecasting. While we can reduce the frequency of major accidents more dependably than we can eliminate tornadoes, we are still in the business of forecasting.

Leave a Reply

Your email address will not be published. Required fields are marked *