Horizon IT trial has a focus on probability theory – but it’s seemingly impossible events that cause some of the worst failures of complex systems

By Tony Collins

Can probability theory explain a single one of the Post Office’s major incidents?

Analysis and comment

One focus of the latest High Court hearings over the Post Office Horizon system has been the likelihood or otherwise of known bugs causing losses for which sub-postmasters were held responsible.

The Post Office argues that Horizon is robust and it has countermeasures in place to ensure any errors with potentially serious consequences are detected and corrected.

So reliable is Horizon – thousands of people use it daily without lasting problems – that the Post Office has expressed no doubts about blaming sub-postmasters for losses shown on the system.

But sub-postmasters argue that they did not steal any money and that spurious losses were shown on a system that was demonstrably imperfect at times.

The arguments and counter-arguments have left journalist Nick Wallis who is covering the trial with this impression …

“What you can’t do is actually get a sense of whether Horizon’s bugs, errors and defects caused discrepancies for which subpostmasters have been held liable…

“The expert answer to whether Horizon is responsible for causing discrepancies in branch accounts appears to be ‘possibly’ or ‘possibly not’.”

As Wallis also points out, there may not be enough information on which to make a definitive judgement.

The Post Office’s own expert has referred to …

“levels of depth and complexity in the way Horizon actually works which the experts have not been able to plumb …”

The expert agreed that it was usually difficult to make categorical negative statements of the form: x or y never happened.

This uncertainty raises questions of about how any judge can decide whether Horizon did or did not cause the losses complained of in the litigation.

As part of its case, the Post Office has used probability theory – a branch of maths – to help demonstrate the robustness of Horizon.

This was some of the evidence given in court …

“And we have to take 50,000, we divide it by 3 million, and what I get from that is I can cancel  all the thousands out and I get 32 x 50/3, so that is about 500.  So it is consistent with one occurrence of a bug to each claimant branch during their tenure.”

Another piece of evidence …

“the chances of the bug occurring in a Claimants’ branch would be about 2 in a million.”

The expert made it clear that statistics are not a substitute for hard facts.

But what when the seemingly impossible occurs?

Adding to the “levels of depth and complexity in the way Horizon actually works which the experts have not been able to plumb” is a layer of further possible uncertainty: whether bugs or complex system-related issues affected branch accounts in ways that were not detected or were not considered possible as part of a complex sequence of apparently random events.

Below is a list of some aircraft accidents involving technology or complex system-related problems where the seemingly impossible or the unanticipated happened.

As these involved a sequence of events that had not been considered possible or were not deemed a serious risk, the system operators (pilots) had not been trained to mitigate the consequences.

In many of the accidents, pilots were blamed initially but investigators found, sometimes after years of tests of systems and equipment, that the aircraft was at fault.

The Post Office in the High Court hearings has compared the robustness of Horizon to aircraft and other systems. The Post Office’s QC compared Horizon’s robustness to “systems that keep aircraft in the air, that run power stations and run banks”.

Banking systems are indeed robust but they do fail sometimes; and when they do,  thousands of customers can be locked out of their accounts for days. as happened at Tesco Bank, RBS and TSB.

Power station IT tends to be designed, tested and implemented, in the UK at least, to defence safety-critical standards that impose a rigour not required of commercial systems such as Horizon.

With aircraft systems, however, it may be worth looking at how similar they are or different to Horizon. As air crash investigations are usually exhaustive in their thoroughness, we, the public, know what has gone wrong because reports are published.

We know, therefore, that some of the worst air crashes are caused by occurrences of the seemingly impossible.

Aircraft manufacturers could not, with any authority or credibility, tell investigators of a series of air crashes that, as they cannot fully understand the complexity of the system, they will have to take it that pilots must be to blame given that the aircraft is demonstrably robust.

That millions of flights take place every year without incident, and planes have triple redundancy in their flight systems and fail-safe measures in place for critical components, would not be good reason to assume pilots must be to blame for crashes.

And imagine telling air crash investigators that they could use probability theory to work out the likelihood of a serious fault causing a particular major incident.

The seemingly impossible occurs – a list

These are some of the most notorious failures of complex aircraft systems where the seemingly impossible happened …

  • Nobody thought it possible that new technology on one of the world’s safest aircraft, the 737 Max, could leave pilots fighting to lift the plane’s nose at the same time as complex systems were inexplicably keeping the nose pitching towards the ground. Probability theory would not have explained what went wrong or why – because such a sequence of events was not foreseen. After the first crash, pilots were blamed. After the second crash in March 2019, various countries grounded the plane.  Modifications now being made are likely to save hundreds of lives in future. In the end, despite an initial assumption of pilot error, precise faults in complex systems were identified as the probable cause of both crashes.
  • Nobody thought it possible that a computer-controlled engine on a modern jet airliner, a Boeing 767, could go into reverse thrust at nearly 30,000 feet. Being a theoretical impossibility, pilots were not trained to try and mitigate the consequences. Compulsory improvements after the crash have, potentially, saved many lives by avoiding similar accidents. Probability theory would not have explained what went wrong or why. Initially pilots were blamed but eventually it was found that they could not have recovered from such an event at that altitude.
  • The sequence of events that led to the crash into the side of a mountain of an A320, one of the world’s safest and most reliable aircraft, had not been anticipated. The crash’s lead investigator described the accident as a “random” event. Nobody who designed the computer-controlled plane had anticipated that a confusing screen display, an easily-made input mistake, a little-known autopilot feature that compounded the problems and the absence of a computer-based ground proximity warning system, could combine with poor pilot training to cause a disaster. The crash of Air Inter Flight 148 near Strasbourg airport in France led to industry-wide changes, including a redesigned screen display, more pilot training and more widespread use of onboard warning systems. It was thought the pilots had entered “3.3” into the autopilot believing this to be the angle of descent on their approach to the airport. But the same autopilot control was used to set the rate of descent. The autopilot, being in the “wrong” mode, took the plane on a disastrous 3,300 feet-per-minute descent instead of a more relaxed 3.3 angle of descent. The crash was caused by an unanticipated sequence of events. “It a fascinating lesson about the random dimension of accidents,” said the French lead investigator Jean Paries. “Half a second before or half a second later and we wouldn’t have had the accident.” Probability theory would not have helped identify the contributory factors or the “random” sequence of events.
  • Nobody thought that rain and hail could cause both engines on a 737 to flame out. The engines were certified to cope with water. But flame out they did, on a flight from Belize to New Orleans in 1988. Amazingly, the pilots glided the unpowered airliner onto a narrow grass levee next to a canal and everyone survived. If the plane had crashed and little wreckage recovered and everyone on board had died, the pilots might have been blamed because of an absence of evidence of technical malfunction, as the engines showed signs only of mild hail damage.
  • It was always considered possible, even likely, that birds could be ingested into a computer-controlled jet engine. But it was not considered likely that birds could ingested into the core of the engine. It was even less likely that birds could be ingested into the engine’s core and stop it from working. The idea of birds being ingested into the core of two engines and greatly reducing thrust in both of them at the same time was not even tested for, or pilots trained to cope, because it was thought impossible. But nobody had considered that migrating flocks of Canada geese would be in the vicinity of New York’s LaGuardia airport. The birds weigh up to 10 pounds. The plane’s engines were certified to cope with birds weighing up to four pounds. With a loss of power in both engines, the Airbus A320 glided safely onto the Hudson river, piloted by the gifted and now-famous pilot Chesley “Sully” Sullenberger. Probability theory would have been of no use in identifying what went wrong or why.
  • It was thought impossible that a modern airliner could lose all three of its separate hydraulic systems on one flight but that is exactly what happened on United Airlines Flight 232. The tail engine on a DC-10 had an uncontained fan disk failure in flight, which damaged all three hydraulic systems and rendered the flight controls inoperable. Nobody had considered that a rupture could occur just below the tail engine where all three hydraulic systems were in close proximity. But a number 2 engine explosion hurled fragments that ruptured all three lines, resulting in total loss of control to the elevators, ailerons, spoilers, horizontal stabilizer, rudder, flaps and slats. Probability theory could not have identified the cause.
  • At first, pilots were blamed for a series of 737 crashes where a suspected component was tested by investigators but it performed perfectly every time. After more than five years of investigations,  hundreds of fatalities and thousands of tests on the component, it was discovered that in a very rare set of specific circumstances, the component could not only jam but jam in a way that left the rudder in an extreme position on the opposite side to that expected. This was the equivalent of a car driver turning the steering wheel left and it jams hard over to the right. The seemingly impossible had happened. No probability theory would have helped identify the fault.
  • Nobody had thought it possible that a Chinook helicopter tethered to the ground at Wilmington, Delaware, USA, during tests could be destroyed by an uncontrollably surging computer-controlled engine. It happened because an electrical lead had been unplugged to simulate an electrical failure. The engine software had not been programmed to cope with such an eventuality. It kept pumping fuel into the engine because the software misinterpreted the unplugged lead as evidence the engine was delivering insufficient power.
  • Nobody had considered the possibility of wasps contributing to the deaths of all 189 people on a 757 bound for Germany. The wasps were thought to have nested in a pitot tube which fed incorrect data to the cockpit instruments. As a result, pilots were told simultaneously that the plane was flying too fast (which can cause break-up of the airframe) and too slowly (which can cause a stall and send the plane plummeting to the ground). As such an eventuality was not considered possible, pilots were not trained to cope with the effects of a blocked pitot tube or with conflicting warnings that they were flying too fast and too slowly at the same time. Probability theory would not have helped identify the cause.

So what? – Horizon is not an aircraft system

There are more similar incidents in which the seemingly impossible happened.

All of the aircraft had duplicate or triplicate critical components, methods of error detection and correction, contingency measures, built-in redundancy – and a great deal more in terms of rigorous real-world user testing, independent analyses and firm change control.

And still the aircraft or its complex systems failed. Probability theory and statistics would have solved none of the incidents.

Investigators identified a probable cause after each incident by having a full understanding of the systems and equipment involved, full disclosure of information, in most cases the print-outs from black boxes and dogged independent investigations that sometimes involved years of tests of  single components on multi-million dollar test rigs.

There has been no requirement to determine the exact cause of every major incident involving Horizon.

How, then, can anyone know for certain that Horizon was performing as expected when sub-postmasters were blamed for losses of tens of thousands of pounds  – losses that turned out to be ruinous for them and their families, and on rare occasions led to suicide?

Can probability theory explain a single one of the Post Office’s major incidents?

The Post Office will continue to argue its Horizon system is robust. But the complex systems on more than a dozen planes that crashed, causing the loss of hundreds of lives, were also robust.

The planes crashed not because a one-in-a-million risk materialised but because of a series of events that designers had not considered possible. For this reason, there were no procedures for coping with the events.

We know about the random events and seemingly impossible causes of air crashes because they are among the most thoroughly investigated of all failures of complex systems. Lessons are required to be learnt.

But how does all this leave us on the question on whether Horizon did or did not cause the losses in question?

Perhaps the truth is best summed up in Nick Wallis’ comment that the  expert  answer to whether Horizon is responsible for causing discrepancies in branch accounts seems to be “possibly” or “possibly not”.

But does “possibly” or “possibly not” provide strong enough grounds for Post Office actions that have ruined hundreds of lives?

More to the point, we have with the Horizon system, as with air crashes, evidence of major incidents. Every accusation against a sub-postmaster who denies any knowledge of losses is a major incident. On this basis, there is evidence of hundreds of major Post Office incidents.

But working back from each major incident, there is no full understanding by experts of how exactly the systems worked.

As the Post Office’s expert put it, there are “levels of depth and complexity in the way Horizon actually works which the experts have not been able to plumb …”

There are no print-outs from accident black boxes. There are not single investigations that have taken years to establish the full truth or multi-million pound test rigs on which to assess all the technology in question.

In short, there are many more questions than answers. Is this a just basis for the Post Office’s 100% certainty that it was right to blame sub-postmasters for losses shown on Horizon?

With uncertainty over the exact cause(s) of each incident, was it just and right to require sub-postmasters to make good losses shown on Horizon?

Arguably, that is the same or similar as holding pilots, whether dead or alive, responsible for plane crashes that could have been caused by a random sequence of events that were thought impossible.

Would you feel safer in a plane or running a village post office?

Nick Wallis’ postofficetrial blog

Karl Flinders has reported extensively on Horizon and the trials for Computer Weekly.

Tim McCormack’s “Problems with POL [Post Office Ltd]” blog.

Stephen Mason, barrister and associate research fellow at the Institute of Advanced Legal Studies in London, has written an excellent article (related to the Horizon dispute): the use of the word robust to describe software code

4 responses to “Horizon IT trial has a focus on probability theory – but it’s seemingly impossible events that cause some of the worst failures of complex systems

  1. Fantastic examples – thank you, Tony.

    I’ll keep this short as I’d only be repeating myself. All good managers are aware of the “black swan” phenomenon. They may not be able to immediately solve such a problem but, at least they don’t freeze in shock when it arrives. Poor managers, however, barely cope with day to day systems so when the black swan arrives, they freeze in denial – “this can’t be happening, particularly to me!” – a version of the P.O. more or less claiming the infallibility of Horizon.
    I doubt we need poor old physics being dragged in to make a point in a case that has already demonstrated from the witness box that the most talented of problem solvers are not employed by the Post Office.
    Over the weekend, I hope to have the time to appreciate the case studies you have provided.

    In the meantime, thank you again.

    Kindest,
    Z

    Like

  2. Mark Hudson

    I think probabilty is the wrong theory for what is essentially a transactional system.

    https://en.wikipedia.org/wiki/Queueing_theory

    Sadly, Jim Grey, whod have made an excellent expert witness, is dead.

    There are some people around who are pretty hot on transactional systems. Sadly they seem to have never worked for the PO …

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.