Using probability calculus to evaluate evidence for alternative hypotheses

In this tutorial I will try to show how the formal framework of hypothesis testing based on probability theory is able to separate subjective beliefs about the plausibility of alternative explanations, on which we can agree to differ, from the evaluation of the weight of evidence supporting each of these alternative explanations, on which it should be easier to reach a consensus. We can then begin to apply this to framework to evaluate alternative explanations of current events, where some of these alternatives may invoke “fake news” or systematic deception. An interesting attempt to apply this framework systematically is the Rootclaim project, founded by Saar Wilf, an Israeli entrepreneur (and noted international poker player).

Although the mathematical basis for using evidence from observations to update the probability of a hypothesis was first set out by the 18th century clergyman Thomas Bayes, the first practical use of this framework was for cryptanalysis by Alan Turing at Bletchley Park. This was later elaborated by his assistant Jack Good as a general approach to evaluating evidence and testing hypotheses. This approach to testing hypotheses has been standard practice in genetics since the 1950s, and has spread into many other fields of scientific research, especially astronomy. It underlies the revolution in machine learning and artificial intelligence that is beginning to transform our lives. Although the practical usefulness of the Bayes-Turing framework is not in question, this does not prove that it is the only logical way to evaluate evidence. The basis for this was provided by the physicist Richard Cox, who showed that degrees of belief must obey the mathematical rules of probability theory if they satisfy simple rules of logical consistency. Another physicist, Edwin Jaynes, drew together the approach developed by Turing and Good with Cox’s proof to develop a philosophical framework for using Bayesian inference to evaluate uncertain propositions. In this framework, Bayesian inference is just an extension of the ordinary rules of logic to manipulating uncertain propositions; any other way of evaluating evidence would violate rules of logical consistency. There are too many names – not limited to Bayes, Turing, Good, Cox and Jaynes – attached to the development of this framework to name it after all of them, so I’ll follow Jaynes and just call it probability calculus.

The objective of this tutorial is to show you how to evaluate evidence for yourself using simple back-of-the-envelope calculations based on probability calculus.

Some fundamental principles of probability calculus can be expressed without using mathematical language:-

For a light-hearted tutorial in how to apply these principles in everyday life, try this exercise

To take the argument further, I need to explain some simple maths. If you already have a basic grounding in Bayesian inference, you can skip to the next section Otherwise, you can work through the brief tutorial below, or try an online tutorial like this one .

Before you have seen the evidence, your degree of belief in which of these alternatives is correct can be represented as your prior odds. For instance if you believe \(\textrm{H}_1\) and \(\textrm{H}_2\) are equally probable, your prior odds are 1 to 1, or even odds in everyday language. After you have seen the evidence, your prior odds are updated to become your posterior odds.

Bayes’ theorem specifies how evidence updates prior odds to posterior odds. The theorem can be stated in the form:-

\[ \left(\textrm{prior odds of H}_1 \textrm{ to H}_2 \right) \times \frac{\textrm{likelihood of H}_1} {\textrm{likelihood of H}_2} = \left(\textrm{posterior odds of H}_1 \textrm{ to H}_2 \right) \]

Examples

  1. You have two alternative hypotheses about a coin that is to be tossed: \(\textrm{H}_1\) that the coin is fair, and \(\textrm{H}_2\) that the coin is two-headed. In most situations your prior belief would be that \(\textrm{H}_1\) is far more probable than \(\textrm{H}_2\). Given the observation that the coin that the coin comes up heads when tossed once, the likelihood of a fair coin is 0.5 and the likelihood of a two-headed coin is 1. The likelihood ratio favouring a two-headed coin over a fair coin is 2. This won’t change your prior odds much. If, after the first ten tosses, the coin has come up heads every time, the likelihood ratio is \(2^{10} = 1024\), perhaps enough for you to suspect that someone has got hold of a two-headed coin.

  2. Hypothesis \(\textrm{H}_1\) is that all crows are black (as in eastern Scotland), and hypothesis \(\textrm{H}_2\) is that only 1 in 8 crows are black (as in Ireland where most crows are grey). The first crow you observe is black. Given this single observation, the likelihood of \(\textrm{H}_1\) is 1, and the likelihood of \(\textrm{H}_2\) is 1/8. The likelihood ratio favouring \(\textrm{H}_1\) over \(\textrm{H}_2\) is 8. So if your prior odds were 2 to 1 in favour of \(\textrm{H}_1\), your posterior odds, after this first observation, will be 16 to 1. This posterior will be your prior when you next observe a crow. If this next crow is also black, the likelihood ratio contributed by this observation is again 8, and your posterior odds favouring \(\textrm{H}_1\) over \(\textrm{H}_2\) will be updated to (\(16 \times 8 = 128\)) to 1.

Bayes’ theorem can be expressed in an alternative form by taking logarithms. If your maths course didn’t cover logarithms, don’t be put off. To keep things simple, we’ll work in logarithms to base 2. The logarithm of a number is defined as the power of 2 that equals the number. So for instance the logarithm of 8 is 3 (2 to the power of 3 equals 8). The logarithm of \(1/8\) is minus 3, and the logarithm of 1 is zero. Taking logarithms replaces multiplication and division by addition and subtraction, which is why if you went through secondary school before the arrival of cheap electronic calculators you were taught to use logarithms for calculations. However logarithms are not just for calculations but fundamental to using maths to solve problems in the real world, especially those that have to do with information.

The logarithm of the likelihood ratio is called the weight of evidence favouring \(\textrm{H}_1\) over \(\textrm{H}_2\). As taking logarithms replaces multiplying by adding, we can rewrite Bayes’ theorem as

prior weight + weight of evidence = posterior weight

where the prior weight and posterior weight are respectively the logarithms of the prior odds and posterior odds. If we use logarithms to base 2, the units of measurement of weight are called bits (binary digits).

So we can rewrite the crow example (prior odds 2 to 1, likelihood ratio 8, posterior odds \(2 \times 8 = 16\)) as

prior weight = 1 bit (\(2^1 = 2\))

weight of evidence = 3 bits (\(2^3 = 8\))

posterior weight = 1 + 3 = 4 bits

One advantage of working with logarithms is that it gives us an intuitive feel for the accumulation of evidence: weights of evidence from independent observations can be added, just like physical weights. Thus in the coin-tossing example above, after one toss of the coin has come up heads the weight of evidence is one bit. After the first ten coin tosses have come up heads, the weight of evidence favouring a two-headed coin is 10 bits. As a rule of thumb, 1 bit of evidence can be interpreted as a hint, 2 to 3 bits as weak evidence, 5 to 6 bits as modest evidence, and anything more than that as strong evidence.

Hempel’s paradox

Within the framework of probability calculus we can resolve a problem first stated by the German philosopher Carl Gustav Hempel. What he called a paradox can be stated in the following form:

An observation that is consistent with a hypothesis is not necessarily evidence in favour of that hypothesis.

Good showed that this is not a paradox, but a corollary of Bayes’ theorem. To explain this, he constructed a simple example (I have changed the numbers to make it easier to work in logarithms to base 2). Suppose there are two Scottish islands denoted A and B. On island A, there are \(2^{15}\) birds of which \(2^{6}\) are crows and all these crows are black. On island B, there are \(2^{15}\) birds of which \(2^{12}\) are crows and \(2^9\) of these crows (that is, one eighth of all crows) are black. You wake up on one of these islands and the first bird that you observe is a black crow. Is this evidence that you are on island A, where all crows are black?

You can’t do inference without making assumptions. I’ll assume that on each island all birds, whatever their species or colour, have equal chance of being seen first. The likelihood of island A, given this observation, is \(2^{-9}\). The likelihood of island B is \(2^{-3}\). The weight of evidence favouring island B over island A is \(\left[-3 - \left(-9 \right)\right] = 6\) bits. So the observation of a black crow is moderately strong evidence against the hypothesis that you are on island A where all crows are black. So, when two hypotheses are compared, an observation that is consistent with a hypothesis can nevertheless be evidence against that hypothesis.

The converse applies: an observation that is highly improbable given a hypothesis is not necessarily evidence against that hypothesis. As an example, we can evaluate the evidence for a hypothesis that most readers will consider an implausible conspiracy theory: that the Twin Towers of the World Trade Center were brought down not by the hijacked planes that crashed into them but by demolition charges placed in advance, with the objective of bringing about a “new Pearl Harbour” in the form of a catastrophic event that would provoke the US into asserting military dominance. We’ll call the two alternative hypotheses for the cause of the collapses - plane crashes, planned demolitions - \(\textrm{H}_1\) and \(\textrm{H}_2\) respectively. The proponents of this hypothesis attach great importance to the observation that a nearby smaller tower (Building 7), collapsed several hours after the Twin Towers for reasons that are not obvious to non-experts. I have no expertise in structural engineering, but I’m prepared to go along with their assessment that the collapse of a nearby smaller tower has low probability given \(\textrm{H}_1\). However I also assess that the probability of this observation given \(\textrm{H}_2\) is equally low. If the planners’ objective in destroying the Twin Towers was to create a catastrophic event, why would they have planned to demolish a nearby smaller tower several hours later, with the risk of giving away the whole operation? For the sake of argument, I’ll put a value of 0.05 on both these likelihoods. Note that it doesn’t matter whether the observation is stated as “collapse of a nearby tower” for which the likelihoods of \(\textrm{H}_1\) and \(\textrm{H}_2\) are both 0.05, or as “collapse of Building 7” for which (if there were five such buildings all equally unlikely to collapse) the likelihoods of \(\textrm{H}_1\) and \(\textrm{H}_2\) would both be 0.01. For inference, all that matters is the ratio of the likelihoods of \(\textrm{H}_1\) and \(\textrm{H}_2\) given this observation. If this ratio is 1, the weight of evidence favouring \(\textrm{H}_1\) over \(\textrm{H}_2\) is zero.

The conditional probabilities in this example are my subjective judgements. I make no apology for this; the logic of probability calculus says that you can’t evaluate evidence without making these subjective judgements, that these subjective judgements must obey the rules of probability theory, and that any other way of evaluating evidence violates axioms of logical consistency. If your assessment of these conditional probabilities differs from mine, that’s not a problem as long as your assessments of these probabilities are logically consistents sense to others. The general point on which I think most readers will agree is that although the collapse of a nearby smaller tower would not have been predicted from \(\textrm{H}_1\), it would not have been predicted from \(\textrm{H}_2\) either. The likelihood of a hypothesis given an observation measures how well the hypothesis would have predicted that observation.

We can see from this example that to evaluate the evidence favouring \(\textrm{H}_1\) over \(\textrm{H}_2\), you have to assess, for each hypothesis in turn, what you would expect to observe if that hypothesis were true. Like a detective solving a murder, you have to “speculate”, for each possible suspect, how the crime would have been carried out if that individual were the perpetrator. This requirement is imposed by the logic of probability calculus: complying with it inevitably requires you to speculate, but does not imply that you are a “conspiracy theorist”.

Evidence contributed by the non-occurrence of an expected event

To evaluate all relevant evidence, we must include the non-occurrence of events that would have been expected under at least one of the alternative hypotheses. This is the principle set out in the case of “the curious incident of the dog in the night-time”: Holmes noted that the observation that the dog did not bark had low probability given the hypothesis of an unrecognized intruder, but high probability given the hypothesis that the horse was taken by someone that the dog knew.

How widely can probability calculus be applied to evaluate evidence?

The principle of evaluating how the data could have been generated under alternative hypotheses applies in many fields: for instance, medical diagnosis, historical investigation, and intelligence analysis. A manual on intelligence analysis sets out a procedure for “analysis of competing hypotheses” which “demands that analysts explicitly identify all the reasonable alternative hypotheses, then array the evidence against each hypothesis — rather than evaluating the plausibility of each hypothesis one at a time”. I am not trying to tell people who are expert in these professions that they don’t know how to evaluate evidence. However it can still be useful to work through the formal framework of probability calculus to identify when intuition is misleading. For instance, where two analysts evaluating the same observations disagree on the weight of evidence, working through the calculation will identify where their assumptions differ, and how the evaluation of evidence depends on these assumptions.

An interesting argument about the use of Bayesian evidence in court can be found in this judgement of the Appeal Court in 2010. In a murder trial, the forensic expert had given evidence that there was “moderate scientific support” for a match of the defendant’s shoes to the shoe marks at the crime scene, but had not disclosed that this opinion was based on calculating a likelihood ratio. The judges held that where likelihood ratios would have to be calculated from statistical data that were uncertain and incomplete, such calculations should not be used by experts to form the opinions that they presented to the court. However the logic of probability calculus imply that you cannot evaluate the strength of evidence except as a likelihood ratio. Calculating this ratio makes explicit the assumptions that are used to assess the strength of evidence. In this case, the expert had used national data on shoe sales to assign the likelihood that the foot marks were made by someone else, given that the foot marks were made by size 11 trainers. The conditional probability of size 11 trainers, given that they were made by someone else, should have been based on the frequency of size 11 trainers among people present at similar crime scenes. It was because the calculations were made available at the appeal that the judges were able to criticize the assumptions on which they were based and to overturn the conviction.