I was reviewing the Therac-25 disaster, and a few interesting things came out as concerns risk analysis, which transcend a multitude of other areas. In a nutshell, the Therac-25 was a radiation therapy machine, which operated under software control, and there were a multitude of assumptions, processes, operator UI issues, business, and design factors which resulted in its failure. Namely it delivered a massive overdosage of radiation under certain conditions, which led to a painful and debillitating death for many.
While the actual systematic failure involves a multitude of arenas, I just want to look at the risk analysis aspect today.
First a couple quotes:
William Ruckelshaus, two-time head of the US Environmental Protection Agency: “risk assessment data can be like the captured spy; if you torture it long enough, it will tell you anything you want to know.” Risk in a Free Society,” Risk Analysis, Vol. 4, No. 3, 1984, pp. 157-162
E.A. Ryder of the British Health and Safety Executive:risk assessment “should only be played in private between consenting adults, as it is too easy to be misinterpreted.” The Control of Major Hazards: The Advisory Committee’s Third and Final Report,” Transcript of Conf. European Major Hazards, Oyez Scientific and Technical Services and Authors, London, 1984.
Next, regulatory bodies regulate what they know, especially so in the arena of risk analysis which leads to holes, and in plugging those holes, many unintended consequences can result. For example, microswitches were added to verify the positioning of shields for the lack of a better term, and a risk analysis showed the probability of an incorrect dosage as nearly improbable. First off, the risk analysis only looked at the failure of the switches, and did not take into account the software monitoring of them. Secondly, it was suggested to add a potentiometer to measure exact position, and it was discounted, being a pots failure rate is much higher than a combination of switches. However, had a analog input been added, it would have required new code, or even better a complete different positioning system and there may well have been a chance such would have prevented the problem. Ideally a separate hardware sensor and shut down system would have been the way to go, albeit too costly for something that the risk analysis folks said should never occur.
The end result, assumptions and prior experience of the risk analysis combined with new technology (ie software control) ended up leading to the conclusion things were ok. This combined with a number of other systemic/process failues resulted in a number of people dieing, and significant injury to others.
There are too many factors to assign the failure to a single cause, but one thing that can be learned from this, is that risk analysis is based upon experience, and that experience may have yet to catch all the potential failure modes.
Granted, I am a huge fan of risk analysis, and it is a good tool for sure… but it is a limited tool, and should not be 100% depended upon. Just as AAA ratings in the credit world and CDS turned out to be not so great, so too can be risk analysis in the product development world.
I have a saying that I use with my flight students. If something seems fishy, it probably is. STOP and investigate it. Had the Therac team stopped when the first event occurred, and not relied on their risk analysis, and that they couldn’t duplicate it in the lab, but instead investigated on site with full monitoring, its possible they might have caught this earlier.
Thats a warning for all of us… sure, no one likes to go to a customer site, and potentially spend many thousands of dollars to catch that 1:10000000 event that seems to happen way too frequently, but its a lot better than having people die at worst, to having major recalls and financial troubles at best. There are so many things one just cannot duplicate in real life on site testing, whether it be risk models, or user experiences and operating practices. In mission critical applications, its just too important not to invesitgae.