## Killing the P-Value Messenger

Submitted by robweiss on Tue, 06/25/2019 - 16:48Does the argument about p-values miss the big problem? One might

argue that blaming p-values for our mis-uses of p-values is akin to

blaming the messenger for bad news.

How do we use p-values? Prior to publishing, we use p-values as

gate keepers. A small p-value indicates a 'significant' and

therefore 'real' result, while large p-values indicate that a

particular effect is not there or not worth mentioning. Researchers

emphasize their results that have small p-values, while ignoring or

not mentioning results with large p-values. Often results with

large p-values are omitted from a paper. Having at least one small

p-value is considered de rigueur to bother submitting a paper, and

a paper with no significant results will either be reworked until

some p-value is small (torturing the data until the data speaks) or

the paper will be laid to rest without being submitted (the file

drawer problem).

It seems to me the problem isn't the p-value. The problem isn't

even that a lot of people don't quite understand the underlying

logic of what a p-value is. The problem is that p-values are used

to determine which results are presented to the broader scientific

community and which results are ignored. But when we read a result

in an article in the scientific literature, we typically assume the

canonical (but false) idea that the authors wanted to test that

specific result or hypothesis, and then reported that specific

result or hypothesis.

The truth is that there are quite a few results that authors could

report in most studies. Authors choose the results that are

actually reported in the paper. And this choosing is closely

correlated with the p-value. This is commonly called selection

bias.

To over-simplify slightly, there are two ways to get a significant

p-value. One way is to have an underlying effect that is strong

with a study that has the power to identify that strong effect.

Another way is to have an unusually strong result that isn't

warranted by the underlying truth. The first way is that there is

an actual effect, and the second way is 'something unusual'

occurred. Testing lots of hypotheses gets you lots of chances to

get a significant result. With each test you may have identified

something real, or you may have gotten lucky.

Unfortunately, when researchers get lucky, society gets unlikely.

It's a situation where society's utility function and researchers'

utility functions may be at odds with each other. Researchers with

few exceptions do want to actually identify real effects. However,

researchers operate in a milieu that rewards significant results.

And especially, journal publishing rewards splashy and unusual

results more than steady straightforward routine results or worse,

non-results which means not-significant results. There are stories

(of real researchers) getting wealthy by marketing modest (and

probably lucky) but splashy findings into big businesses.

Scientists test a lot of hypotheses. Some of these tests are due to

direct interest as when we assess a treatment's effect on a sample

of subjects. Much of it is exploratory, as when we try to find the

demographic characteristics that lead to higher values of some

outcome. In addition, some tests of hypothesis are not

straightforward, as when it requires exploratory analysis to build

a model. When we build a complicated model to identify the effects

of a given hypothesis, then we are subject to selection bias that

occurs when we add model terms that distinguish a treatment's

effects and we omit model terms that do not seem to distinguish

treatment from control.

The problem is that, due to selection bias, our scientific

literature is likely filled with results whose strength has been

over-inflated. Now how over-inflated the typical result is depends

on the underlying ground truth that is being explored in the

literature. But when we have a large literature, and many

researchers looking for many different effects, and few people

checking results (it's less prestigious after all!) we're likely to

have lots of candidate results that are not all correct but are all

treated as real.

There have been a few attempts at validation studies, where

researchers have gone back and tried to duplicate a number of past

results in the literature. These validation studies don't validate

all the past findings. Is this due to the original studies

identifying false effects? Or is it because the follow-up study got

unlucky and didn't manage to identify an actual effect? Perhaps a

combination: the original study over estimated the effect size, and

the validation study was under powered due to the original over

estimation.

So should we toss out p-values? As a Bayesian, I made my peace with

p-values a while ago. Now, two sided p-values are a bit weird and

possibly a touch creepy from a Bayesian perspective. However, a

one-sided p-value has a simple Bayesian interpretation that is

actually useful. Suppose our estimate of a treatment effect is that

treatment has a positive effect on the outcome. Then the Bayesian

interpretation of a one-sided p-value (the one-side smaller than

0.5) is the probability, given the data, that the treatment effect

in this study is actually harmful. It's a useful summary of the

results of an analysis. Its not a sufficient statistic in either

the English or statistical senses of the word with regard to how we

should interpret the particular result.

It seems to me that the problem is more of selection. We select for

strong results. We pay attention to strong results. We really pay

attention to unusual and weird results. The popular press picks up

unusual results and magnifies their impact immensely, without

evaluating whether the reported result is likely to be true or not.

Thus we propagate ideas based on something distinct from their

likelihood of being true.

So don't shoot the messenger, neither messenger RNA nor messenger

pidgeon. Oops too late, we already shot that last one. Consider how

we can solve the crisis of scientific selection.