MSc-IT Study Material
June 2010 Edition

Computer Science Department, University of Cape Town
| MIT Notes Home | Edition Home |

Safety critical systems

A safety critical system is one which can have catastrophic consequences if it fails. So far we have been talking about productivity; the amount of a value a system adds. In the world of safety critical systems ‘reliability' is crucial. Reliability is a measure of how often and how dramatically a system fails.

In an ideal world we would want no accidents to occur at all, but in the real world accidents will happen, so safety critical system engineers put in an awful lot of effort to ensure the safety of their systems. Essentially they attempt to trade off the severity of a possible accident against the likelihood of that accident occurring. A severe accident is acceptable if it is extremely (very extremely) unlikely to happen. On the other hand a likely accident is acceptable if it is not very severe. Safety engineers work to mitigate accidents by a combination of making them less likely and less severe.

However there have been a number of very high profile and fatal accidents lately, where the failure is put down to ‘operator error'. We will present arguments that ‘operator error' is a misnomer; it is ‘system error' and a lot of the time errors involving users are predictable and preventable.

First consider the following two cases:

The Kegworth air disaster

On 8th January 1989, British Midland flight G-OMBE was flying from the UK's Heathrow Airport to Belfast. The aeroplane was a twin-engined Boeing 737 model with a new computerised set of cockpit controls. One of the blades in the left engine detached and the crew began to start emergency action. The plane could be safely flown with just one engine, so the procedure was to shut down the damaged engine and fly on the working engine. The crew informed ground staff and an emergency landing was scheduled at East Midlands Airport near Kegworth.

The crew were aware of the fact that one of the engines was damaged not only because of instrument readings in the cockpit but because of a loud juddering. The Captain throttled down and shut off one of the engines and the juddering stopped. He had however shut down the good engine and the juddering had stopped just by an unfortunate fluke of aerodynamics. The aeroplane flew on powered only by its damaged engine. The instrument panel displayed this fact, but not in a way that any of the crew noticed. On final approach to the runway the damaged engine completely failed and the plane began to fall out of the sky. The crew realised the error and then tried to restart the good engine, but could not in time. The plane crash landed onto the M1 motorway just yards short of the runway at Kegworth. 47 passengers were killed and 74 seriously injured. The captain survived.

In the aftermath initial reports stated that both engines had failed and the tabloid press proclaimed the captain as a hero for managing to get the plane so close to the airport and landing it in such a way that many passengers actually survived. After the truth came out in the crash enquiry the same tabloid press denounced the captain and placed the blamed firmly with him.

The Paddington rail crash

On 5th October 1999 a slow moving commuter train left Paddington station in London heading for Bedwyn in Wiltshire. Just outside Paddington there is a large bank of signals on a gantry over the line. The signal for the particular line that the commuter train was on was at red. It was also notoriously difficult to see as it was partially obscured by overhead electrical cables. The driver of the commuter train passed through the red signal and his train was crossing over several lives when it was struck head on by a high speed intercity train travelling to London from Cheltenham. 31 people, including both drivers were killed in the collision and many were severely injured by the fireball which swept through the front carriages of the intercity train. The accident site closed down Paddington station, one of the busiest in London, for over a week, inconveniencing over a million travellers.

Subsequently it became clear that ‘SPAD' (signal passed at danger) events were quite common, and that systems recommended years earlier that would have prevented such accidents had not been implemented. The signalling systems were claimed to be so unreliable that drivers often knowingly passed through red lights. If they did not then the rail system would have most likely ground to a halt.

What is interesting is the reporting of these two terrible accidents; in the case of the Kegworth crash blame was firmly placed with the pilot and crew and they were vilified. However twelve years later attention was much more drawn to the system that the train driver had to use, and the consensus opinion was that blame should lay with the system, and the system needed modifying.

At the time of writing the Paddington rail crash enquiry is just getting under way. It will be interesting to see if it comes to the ‘operator error' conclusion this time.

Operator error

Safety engineers collect and analyse a huge amount of data about the technology they use to build safety critical systems. The hardware which they use will be highly tested with known and documented failure tolerances. The use the most secure software development techniques in order to ensure that requirements for the system are collected and that those requirements actually reflect what is wanted for the system (requirements gathering is a notoriously tricky business). Then software is rigorously developed to meet those requirements. At all stages in the process thorough testing and validation is employed. The engineers expend prodigious effort in ensuring that they ‘build the right system, and the system right'. Many experts with experience of systems similar to the one being developed are consulted and a considerable amount of time and money is expended in order to get a system that is certifiably and explicitly correct.

All too often, at this point this bullet-proof technology is handed to a hapless operator who does the wrong thing with it and can cause a serious accident. In subsequent enquiries the failure is put down to ‘operator error' which is held to be in some way unavoidable. The failure cannot lie with the technology, particularly after all that effort was put into it to ensure its safety. Operator error is a very convenient let out clause for the system developers. It allows then to shift blame away from their product onto operators.

When questioned it is clear that many developers do not consider users to part of the systems they develop; they consider users to be external to their systems. Therefore when users cause errors, the blame for the error lies comfortably outside their system.

A widening of what a system is

What is needed is a wider conception of what a system is. Consider a common safety critical system; the car. Cars are very dangerous, so much so the law insists that their user's (drivers) take training sessions and pass examinations before being allowed to use them. A car without a driver is not a complete system; unless it is rolled down a hill it will not do anything very dangerous without a driver. Therefore a safety analysis of a car which does not take into account its driver would be a fairly sterile exercise.

But many safety critical systems are put in place which do not take their users into account at all. Not only should users be brought into safety analyses, but also much more detail about the environment that is part of the system needs to be considered.

There are a variety of reasons why users are not generally included in safety analyses. Safety analysts like dealing with numbers. Given a piece of hardware they like to know how many times it is likely to breakdown in the next ten years, and the chances are there will be data available to tell them. They like to the know the cost of that piece of hardware breaking down, and there may also be data telling them this too. They can then multiply the cost of breakdown by the number of times breakdown will occur and produce a quantitative estimate of how much running that piece of hardware for the next ten years will cost.

Such data does not exist for humans, or it is very suspect if it does. Because data for reasoning about human behaviour do not fit easily into the processes developed for reasoning about hardware and software, it tends to be ignored. Also information about human behaviour is not ‘black and white'. Arguments can be made about the behaviour of automated computers that can give yes or no answers, but arguments about human behaviour will be much less deterministic. Because of this, taking account of users in safety analyses is rare.

Because questions about human behaviour cannot generally be given in a yes or no way, this does not mean however that no answers can be given. Psychology has developed many models of human behaviour that accurately describe and predict human performance. In particular there are many valid models of human perception that could have pointed to problems with the display configuration in the cockpit of the airliner that crashed near Kegworth.

The problem is to define design processes that can make use of the models of human behaviour developed by psychologists and sociologists.

Review Question 4

How do safety engineers work to reduce the impact of accidents? Why is it crucial to reduce the likelihood of accidents happening?

Answer to this question can be found at the end of the chapter.

Review Question 5

What differentiates the failure of a normal interactive system from the failure of a safety critical system?

Answer to this question can be found at the end of the chapter.

Responsibility for safety critical systems

In the previous section we discussed responsibility for interactive systems. We suggested that developers should move towards being responsible to users as well as customers. Responsibility for safety critical systems should spread even wider. The failure of a normal interactive system impinges on the user and possibly the user's employer; people who have to a certain extent agreed to use the system. Failure of safety critical systems impinges on a much wider stage and can affect people who have in no way made a decision to be part of the system.

Consider the following hypothetical example:

I step out on a road crossing with a car about 200 yards away heading towards me. The car's ABS brake system fails, the driver tries to avoid me but strikes me. The car is a hire car. The ABS system on the car was developed by a subcontractor to the car manufacturer.

Who is responsible for the accident?

You are invited to discuss these issues in discussion 2.