Killing the P-Value Messenger

Does the argument about p-values miss the big problem? One might argue that blaming p-values for our mis-uses of p-values is akin to blaming the messenger for bad news.
How do we use p-values? Prior to publishing, we use p-values as gate keepers. A small p-value indicates a 'significant' and therefore 'real' result, while large p-values indicate that a particular effect is not there or not worth mentioning. Researchers emphasize their results that have small p-values, while ignoring or not mentioning results with large p-values. Often results with large p-values are omitted from a paper. Having at least one small p-value is considered de rigueur to bother submitting a paper, and a paper with no significant results will either be reworked until
some p-value is small (torturing the data until the data speaks) or the paper will be laid to rest without being submitted (the file drawer problem).

It seems to me the problem isn't the p-value. The problem isn't even that a lot of people don't quite understand the underlying logic of what a p-value is. The problem is that p-values are used to determine which results are presented to the broader scientific community and which results are ignored. But when we read a result
in an article in the scientific literature, we typically assume the canonical (but false) idea that the authors wanted to test that specific result or hypothesis, and then reported that specific result or hypothesis.

The truth is that there are quite a few results that authors could report in most studies. Authors choose the results that are actually reported in the paper. And this choosing is closely correlated with the p-value. This is commonly called selection bias.

To over-simplify slightly, there are two ways to get a significant p-value. One way is to have an underlying effect that is strong with a study that has the power to identify that strong effect. Another way is to have an unusually strong result that isn't warranted by the underlying truth. The first way is that there is an actual effect, and the second way is 'something unusual' occurred. Testing lots of hypotheses gets you lots of chances to get a significant result. With each test you may have identified something real, or you may have gotten lucky.

Unfortunately, when researchers get lucky, society gets unlikely. It's a situation where society's utility function and researchers' utility functions may be at odds with each other. Researchers with few exceptions do want to actually identify real effects. However, researchers operate in a milieu that rewards significant results. And especially, journal publishing rewards splashy and unusual results more than steady straightforward routine results or worse, non-results which means not-significant results. There are stories (of real researchers) getting wealthy by marketing modest (and probably lucky) but splashy findings into big businesses.

Scientists test a lot of hypotheses. Some of these tests are due to direct interest as when we assess a treatment's effect on a sample of subjects. Much of it is exploratory, as when we try to find the demographic characteristics that lead to higher values of some outcome. In addition, some tests of hypothesis are not straightforward, as when it requires exploratory analysis to build a model. When we build a complicated model to identify the effects of a given hypothesis, then we are subject to selection bias that occurs when we add model terms that distinguish a treatment's effects and we omit model terms that do not seem to distinguish treatment from control.

The problem is that, due to selection bias, our scientific literature is likely filled with results whose strength has been
over-inflated. Now how over-inflated the typical result is depends on the underlying ground truth that is being explored in the literature. But when we have a large literature, and many researchers looking for many different effects, and few people checking results (it's less prestigious after all!) we're likely to have lots of candidate results that are not all correct but are all treated as real.

There have been a few attempts at validation studies, where researchers have gone back and tried to duplicate a number of past results in the literature. These validation studies don't validate all the past findings. Is this due to the original studies identifying false effects? Or is it because the follow-up study got unlucky and didn't manage to identify an actual effect? Perhaps a combination: the original study over estimated the effect size, and the validation study was under powered due to the original over estimation.

So should we toss out p-values? As a Bayesian, I made my peace with p-values a while ago. Now, two sided p-values are a bit weird and possibly a touch creepy from a Bayesian perspective. However, a one-sided p-value has a simple Bayesian interpretation that is actually useful. Suppose our estimate of a treatment effect is that treatment has a positive effect on the outcome. Then the Bayesian interpretation of a one-sided p-value (the one-side smaller than 0.5) is the probability, given the data, that the treatment effect in this study is actually harmful. It's a useful summary of the
results of an analysis. Its not a sufficient statistic in either the English or statistical senses of the word with regard to how we should interpret the particular result.

It seems to me that the problem is more of selection. We select for strong results. We pay attention to strong results. We really pay attention to unusual and weird results. The popular press picks up unusual results and magnifies their impact immensely, without evaluating whether the reported result is likely to be true or not. Thus we propagate ideas based on something distinct from their likelihood of being true.

So don't shoot the messenger, neither messenger RNA nor messenger pidgeon. Oops too late, we already shot that last one. Consider how we can solve the crisis of scientific selection.

Why Be Bayesian? Let Me Count the Ways

In answer to an old friend's question.

  1. Bayesians have more fun.
    1. Our conferences are in better places too.
  2. It's the model not the estimator.
  3. Life's too short to be a frequentist: In an infinite number of replications ...
  4. Software works better.
    1. Rather surprisingly, Bayesian software is a lot more general than frequentist software.
  5. Small sample inference comes standard with most Bayesian model fitting these days.
    1. But if you like your inference asymptotic, that's available, just not high on anyone's priority list.
    2. We can handle the no-data problem, all the way up to very large problems.
    3. Don't need a large enough sample to allow for a bootstrap.
  6. Hierarchical random effects models are better fit with Bayesian models and software.
    1. If a variance component is small, the natural Bayes model doesn't allow zero as an estimate, while the natural maximum likelihood algorithms do allow zero. If you get a zero estimate, then you're going to get poor estimates of standard errors of fixed effects. [More discussion omitted.]
    2. Can handle problems where there are more parameters than data.
  7. Logistic regression models fit better with Bayes
    1. If there's perfect separation on a particular variable, the maximum likelihood estimate of the coefficient is plus or minus infinity which isn't a good estimate.
    2. Bayesian modeling offers (doesn't guarantee it, there's no insurance against stupidity) the opportunity to do the estimation correctly.
    3. Same thing if you're trying to estimate a very tiny (or very large) probability. Suppose you observe 20 out of 20 successes on something that you know doesn't have 100% successes.
    4. To rephrase a bit: In small samples or with rare events, Bayesian estimates shrink towards sensible point estimates, (if your prior is sensible) thus avoiding the large variance of point estimates.
  8. Variance bias trade-off is working in your favor.
  9. Frequentists keep reinventing Bayesian methods
    1. Shrinkage estimates
    2. Empirical Bayes
    3. Lasso
    4. Penalized likelihood
    5. Ridge regression
    6. James-Stein estimators
    7. Regularization
    8. Pittman estimation
    9. Integrated likelihood
    10. In other words, it's just not possible to analyze complex data structures without Bayesian ideas.
  10. Your answers are admissible if you're Bayesian but usually not if you're a frequentist.
    1. Admissibility means never having to say you're sorry.
    2. Alternatively, admissibility means that someone else can't prove that they can do a better job than you.
    3. And if you're a frequentist, someone is clogging our journals with proofs that the latest idiocy is admissible or not.
    4. Unless they are clogging it with yet more ways to estimate the smoothing parameter for a nonparametric estimator.
  11. Bayesian models are generalizations of classical models. That's what the prior buys you: more models
  12. Can handle discrete, categorical, ordered categorical, trees, densities, matrices, missing data and other odd parameter types.
  13. Data and parameters are treated on an equal playing field.
  14. I would argue that cross-validation works because it approximates Bayesian model selection tools.
  15. Bayesian Hypothesis Testing
    1. Treats the null and alternative hypotheses on equal terms
    2. Can handle two or more than two hypotheses
    3. Can handle hypotheses that are
      1. Disjoint
      2. Nested
      3. Overlapping but neither disjoint nor nested
    4. Gives you the probability the alternative hypothesis is true.
    5. Classical inference can only handle the nested null hypothesis problem.
    6. We're all probably misusing p-values anyway.
  16. Provides a language for talking about modeling and uncertainty that is missing in classical statistics.
    1. And thus provides a language for developing new models for new data sets or scientific problems.
    2. Provides a language for thinking about shrinkage estimators and why we want to use them and how to specify the shrinkage.
    3. Bayesian statistics permits discussion of the sampling density of the data given the unknown parameters.
    4. Unfortunately this is all that frequentist statistics allows you to talk about.
    5. Additionally: Bayesians can discuss the distribution of the data unconditional on the parameters.
    6. Bayesian statistics also allows you to discuss the distribution of the parameters.
    7. You may discuss the distribution of the parameters given the data. This is called the posterior, and is the conclusion of a Bayesian analysis.
    8. You can talk about problems that classical statistics can't handle: The probability of nuclear war for example.
  17. Novel computing tools -- but you can often use your old tools as well.
  18. Bayesian methods allow pooling of information from diverse data sources.
    1. Data can come from books, journal articles, older lab data, previous studies, people, experts, the horse's mouth, rats a** or it may have been collected in the traditional form of data.
    2. It isn't automatic, but there is language to think about how to do this pooling.
  19. Less work.
    1. Bayesian inference is via laws of probability, not by some ad hoc procedure that you need to invent for every problem or validate every time you use it.
    2. Don't need to figure out an estimator.
    3. Once you have a model and data set, the conclusion is a computing problem, not a research problem.
    4. Don't need to prove a theorem to show that your posterior is sensible. It is sensible if your assumptions are sensible.
    5. Don't need to publish a bunch of papers to figure out sensible answers given a novel problem
    6. For example, estimating a series of means $mu_1, mu_2, \ldots$ that you know are ordered $mu_j \le mu_{j+1}$ is a computing problem in Bayesian inference, but was the source of numerous papers in the frequentist literature. Finding a (good) frequentist estimator and finding standard errors and confidence intervals took lots of papers to figure out.
  20. Yes, you can still use SAS.
    1. Or R or Stata.
  21. Can incorporate utility functions, if you have one.
  22. Odd bits of other information can be incorporated into the analysis, for example
    1. That a particular parameter, usually allowed to be positive or negative, must be positive.
    2. That a particular parameter is probably positive, but not guaranteed to be positive.
    3. That a given regression coefficient should be close to zero.
    4. That group one's mean is larger than group two's mean.
    5. That the data comes from a distribution that is not a Poisson, Binomial, Exponential or Normal. For example, the data may be better modeled by a t, gamma.
    6. That a collection of parameters come from a distribution that is skewed, or has long tails.
    7. Bayesian nonparametrics can allow you to model an unknown density as a non-parametric mixture of normals (or other density). The uncertainty in estimating this distribution is incorporated in making inferences about group means and regression coefficients.
  23. Bayesian modeling is about the science.
    1. You can calculate the probability that your hypothesis is true.
    2. Bayesian modeling asks if this model describes the data, mother nature, the data generating process correctly, or sufficiently correctly.
    3. Classical inference is all about the statistician and the algorithm, not the science.
    4. In repeated samples, how often (or how accurately) does this algorithm/method/model/inference scheme give the right answer?
    5. Classical inference is more about the robustness (in repeated sampling) of the procedure. In that way, it provides robustness results for Bayesian methods.
  24. Bayesian methods have had notable successes, to wit:
    1. Covariate selection in regression problems
    2. Model selection
    3. Model mixing
    4. And mixture models
    5. Missing data
    6. Multi-level and hierarchical models
    7. Phylogeny

The bottom line: More tools. Faster progress.


Student name: Changhee Lee
Department: Electrical engineering

Subscribe to inference