## What is “statistically significant” is not necessarily significant

“Statistical significance” is “a mathematical machine for turning baloney into breakthroughs, and flukes into funding” – Robert Matthews.

Tests for statistical significance generating the p value are supposed to give the probability of the null hypothesis (that the observations are not a real effect and fall within the bounds of randomness). So a low p value only indicates that the null hypothesis has a low probability and therefore it is considered “statistically significant” that the observations do, in fact, describe a real effect. Quite arbitrarily it has become the custom to use 0.05 (5%) as the threshold p-value to distinguish between “statistically significant” or not. Why 5% has become the “holy number” which separates acceptance for publication and rejection, or success from failure is a little irrational. Actually what “statistically significant” means is that “the observations may or may not be a real effect but there is a low probability that they are entirely due to chance”.

Even when some observations are considered just “statistically significant” there is a 1:20 chance that they are not. Moreover it is conveniently forgotten that statistical significance is called for only when we don’t know. In a coin toss there is certainty (100% probability) that the outcome will be a heads or a tail or a “lands on its edge”. Thereafter to assign a probability to one of the only 3 outcomes possible can be helpful – but it is a probability constrained within the 100% certainty of the 3 outcomes. If a million people take part in a lottery, then the 1: 1,000,000 probability of a particular individual winning has significance because there is 100% certainty that one of them will win. But when conducting clinical tests for a new drug, it is often so that there is no certainty anywhere to provide a framework and a boundary within which to apply a probability.

A new article in Aeon by David Colquhoun, Professor of pharmacology at University College London and a Fellow of the Royal Society, addresses The Problem with p-values.

In 2005, the epidemiologist John Ioannidis at Stanford caused a storm when he wrote the paper ‘Why Most Published Research Findings Are False’,focusing on results in certain areas of biomedicine. He’s been vindicated by subsequent investigations. For example, a recent article found that repeating 100 different results in experimental psychology confirmed the original conclusions in only 38 per cent of cases. It’s probably at least as bad for brain-imaging studies and cognitive neuroscience. How can this happen?

The problem of how to distinguish a genuine observation from random chance is a very old one. It’s been debated for centuries by philosophers and, more fruitfully, by statisticians. It turns on the distinction between induction and deduction. Science is an exercise in inductive reasoning: we are making observations and trying to infer general rules from them. Induction can never be certain. In contrast, deductive reasoning is easier: you deduce what you would expect to observe if some general rule were true and then compare it with what you actually see. The problem is that, for a scientist, deductive arguments don’t directly answer the question that you want to ask.

What matters to a scientific observer is how often you’ll be wrong if you claim that an effect is real, rather than being merely random. That’s a question of induction, so it’s hard. In the early 20th century, it became the custom to avoid induction, by changing the question into one that used only deductive reasoning. In the 1920s, the statistician Ronald Fisher did this by advocating tests of statistical significance. These are wholly deductive and so sidestep the philosophical problems of induction.

Tests of statistical significance proceed by calculating the probability of making our observations (or the more extreme ones) if there were no real effect. This isn’t an assertion that there is no real effect, but rather a calculation of what wouldbe expected if there were no real effect. The postulate that there is no real effect is called the null hypothesis, and the probability is called the p-value. Clearly the smaller the p-value, the less plausible the null hypothesis, so the more likely it is that there is, in fact, a real effect. All you have to do is to decide how small the p-value must be before you declare that you’ve made a discovery. But that turns out to be very difficult.

The problem is that the p-value gives the right answer to the wrong question. What we really want to know is not the probability of the observations given a hypothesis about the existence of a real effect, but rather the probability that there is a real effect – that the hypothesis is true – given the observations. And that is a problem of induction.

Confusion between these two quite different probabilities lies at the heart of why p-values are so often misinterpreted. It’s called the error of the transposed conditional. Even quite respectable sources will tell you that the p-value is the probability that your observations occurred by chance. And that is plain wrong. …….

……. The problem of induction was solved, in principle, by the Reverend Thomas Bayes in the middle of the 18th century. He showed how to convert the probability of the observations given a hypothesis (the deductive problem) to what we actually want, the probability that the hypothesis is true given some observations (the inductive problem). But how to use his famous theorem in practice has been the subject of heated debate ever since. …….

……. For a start, it’s high time that we abandoned the well-worn term ‘statistically significant’. The cut-off of P < 0.05 that’s almost universal in biomedical sciences is entirely arbitrary – and, as we’ve seen, it’s quite inadequate as evidence for a real effect. Although it’s common to blame Fisher for the magic value of 0.05, in fact Fisher said, in 1926, that P= 0.05 was a ‘low standard of significance’ and that a scientific fact should be regarded as experimentally established only if repeating the experiment ‘rarely fails to give this level of significance’.

The ‘rarely fails’ bit, emphasised by Fisher 90 years ago, has been forgotten. A single experiment that gives P = 0.045 will get a ‘discovery’ published in the most glamorous journals. So it’s not fair to blame Fisher, but nonetheless there’s an uncomfortable amount of truth in what the physicist Robert Matthews at Aston University in Birmingham had to say in 1998: ‘The plain fact is that 70 years ago Ronald Fisher gave scientists a mathematical machine for turning baloney into breakthroughs, and flukes into funding. It is time to pull the plug.’ ………

Related: Demystifying the p-value