Time to Update the P-Value Dichotomy to a Trichotomy

In executing a classical hypothesis test, a small $p$-value allows us to reject the null hypothesis and declare that the alternative hypothesis is true.

This classical decision requires a leap of faith: if the $p$-value is small, either something unusual occurred or the null hypothesis must be false.

These days we should add a third possibility. That we searched over several models and methods to find a small $p$-value. We need to update the $p$-value oath of decision making to state: Either something unusual happened, we searched to find a small $p$-value or the null hypothesis is false.

Note that being Bayesian doesn't necessarily avoid this problem. Suppose a regression model $Y = X\beta+ \mbox{error}$. Apologies for not defining notation, except that $\beta$ is a $p$-vector with elements $\beta_k$. One way to define a one-sided Bayesian $p$-value is the posterior probability that $\beta_k$ is less than zero. If this probability $P(\beta_k \lt 0 | Y)$ is near 0 or near 1, then we declare "significance". Basically the Bayesian $p$-value tells us how much certainty we have about the sign of $\beta_k$. The usual classical $p$-value is approximately twice the smaller of $P(\beta_k \lt 0 | Y)$ and $P(\beta_k \gt 0 | Y)$. How close the approximation is depends on the relative strength of the prior information to the information in the data, the observed Fisher information. The Bayesian $p$-value is subject to the same maximization by search over models as the classical $p$-value.

Bayesians have an alternative to merely searching over models however. We can do a mixture model (George and McCulloch 1993, JASA; Kuo and Mallick 1998, Sankhyā B) and incorporate all the models that we've searched over into a single model to calculate the $p$-value.

Why Be Bayesian? Let Me Count the Ways

In answer to an old friend's question.

  1. Bayesians have more fun.
    1. Our conferences are in better places too.
  2. It's the model not the estimator.
  3. Life's too short to be a frequentist: In an infinite number of replications ...
  4. Software works better.
    1. Rather surprisingly, Bayesian software is a lot more general than frequentist software.
  5. Small sample inference comes standard with most Bayesian model fitting these days.
    1. But if you like your inference asymptotic, that's available, just not high on anyone's priority list.
    2. We can handle the no-data problem, all the way up to very large problems.
    3. Don't need a large enough sample to allow for a bootstrap.
  6. Hierarchical random effects models are better fit with Bayesian models and software.
    1. If a variance component is small, the natural Bayes model doesn't allow zero as an estimate, while the natural maximum likelihood algorithms do allow zero. If you get a zero estimate, then you're going to get poor estimates of standard errors of fixed effects. [More discussion omitted.]
    2. Can handle problems where there are more parameters than data.
  7. Logistic regression models fit better with Bayes
    1. If there's perfect separation on a particular variable, the maximum likelihood estimate of the coefficient is plus or minus infinity which isn't a good estimate.
    2. Bayesian modeling offers (doesn't guarantee it, there's no insurance against stupidity) the opportunity to do the estimation correctly.
    3. Same thing if you're trying to estimate a very tiny (or very large) probability. Suppose you observe 20 out of 20 successes on something that you know doesn't have 100% successes.
    4. To rephrase a bit: In small samples or with rare events, Bayesian estimates shrink towards sensible point estimates, (if your prior is sensible) thus avoiding the large variance of point estimates.
  8. Variance bias trade-off is working in your favor.
  9. Frequentists keep reinventing Bayesian methods
    1. Shrinkage estimates
    2. Empirical Bayes
    3. Lasso
    4. Penalized likelihood
    5. Ridge regression
    6. James-Stein estimators
    7. Regularization
    8. Pittman estimation
    9. Integrated likelihood
    10. In other words, it's just not possible to analyze complex data structures without Bayesian ideas.
  10. Your answers are admissible if you're Bayesian but usually not if you're a frequentist.
    1. Admissibility means never having to say you're sorry.
    2. Alternatively, admissibility means that someone else can't prove that they can do a better job than you.
    3. And if you're a frequentist, someone is clogging our journals with proofs that the latest idiocy is admissible or not.
    4. Unless they are clogging it with yet more ways to estimate the smoothing parameter for a nonparametric estimator.
  11. Bayesian models are generalizations of classical models. That's what the prior buys you: more models
  12. Can handle discrete, categorical, ordered categorical, trees, densities, matrices, missing data and other odd parameter types.
  13. Data and parameters are treated on an equal playing field.
  14. I would argue that cross-validation works because it approximates Bayesian model selection tools.
  15. Bayesian Hypothesis Testing
    1. Treats the null and alternative hypotheses on equal terms
    2. Can handle two or more than two hypotheses
    3. Can handle hypotheses that are
      1. Disjoint
      2. Nested
      3. Overlapping but neither disjoint nor nested
    4. Gives you the probability the alternative hypothesis is true.
    5. Classical inference can only handle the nested null hypothesis problem.
    6. We're all probably misusing p-values anyway.
  16. Provides a language for talking about modeling and uncertainty that is missing in classical statistics.
    1. And thus provides a language for developing new models for new data sets or scientific problems.
    2. Provides a language for thinking about shrinkage estimators and why we want to use them and how to specify the shrinkage.
    3. Bayesian statistics permits discussion of the sampling density of the data given the unknown parameters.
    4. Unfortunately this is all that frequentist statistics allows you to talk about.
    5. Additionally: Bayesians can discuss the distribution of the data unconditional on the parameters.
    6. Bayesian statistics also allows you to discuss the distribution of the parameters.
    7. You may discuss the distribution of the parameters given the data. This is called the posterior, and is the conclusion of a Bayesian analysis.
    8. You can talk about problems that classical statistics can't handle: The probability of nuclear war for example.
  17. Novel computing tools -- but you can often use your old tools as well.
  18. Bayesian methods allow pooling of information from diverse data sources.
    1. Data can come from books, journal articles, older lab data, previous studies, people, experts, the horse's mouth, rats a** or it may have been collected in the traditional form of data.
    2. It isn't automatic, but there is language to think about how to do this pooling.
  19. Less work.
    1. Bayesian inference is via laws of probability, not by some ad hoc procedure that you need to invent for every problem or validate every time you use it.
    2. Don't need to figure out an estimator.
    3. Once you have a model and data set, the conclusion is a computing problem, not a research problem.
    4. Don't need to prove a theorem to show that your posterior is sensible. It is sensible if your assumptions are sensible.
    5. Don't need to publish a bunch of papers to figure out sensible answers given a novel problem
    6. For example, estimating a series of means $mu_1, mu_2, \ldots$ that you know are ordered $mu_j \le mu_{j+1}$ is a computing problem in Bayesian inference, but was the source of numerous papers in the frequentist literature. Finding a (good) frequentist estimator and finding standard errors and confidence intervals took lots of papers to figure out.
  20. Yes, you can still use SAS.
    1. Or R or Stata.
  21. Can incorporate utility functions, if you have one.
  22. Odd bits of other information can be incorporated into the analysis, for example
    1. That a particular parameter, usually allowed to be positive or negative, must be positive.
    2. That a particular parameter is probably positive, but not guaranteed to be positive.
    3. That a given regression coefficient should be close to zero.
    4. That group one's mean is larger than group two's mean.
    5. That the data comes from a distribution that is not a Poisson, Binomial, Exponential or Normal. For example, the data may be better modeled by a t, gamma.
    6. That a collection of parameters come from a distribution that is skewed, or has long tails.
    7. Bayesian nonparametrics can allow you to model an unknown density as a non-parametric mixture of normals (or other density). The uncertainty in estimating this distribution is incorporated in making inferences about group means and regression coefficients.
  23. Bayesian modeling is about the science.
    1. You can calculate the probability that your hypothesis is true.
    2. Bayesian modeling asks if this model describes the data, mother nature, the data generating process correctly, or sufficiently correctly.
    3. Classical inference is all about the statistician and the algorithm, not the science.
    4. In repeated samples, how often (or how accurately) does this algorithm/method/model/inference scheme give the right answer?
    5. Classical inference is more about the robustness (in repeated sampling) of the procedure. In that way, it provides robustness results for Bayesian methods.
  24. Bayesian methods have had notable successes, to wit:
    1. Covariate selection in regression problems
    2. Model selection
    3. Model mixing
    4. And mixture models
    5. Missing data
    6. Multi-level and hierarchical models
    7. Phylogeny

The bottom line: More tools. Faster progress.

Comments

Student name: Changhee Lee
Department: Electrical engineering

Kathryn Chaloner 1954-2014

Prescript: Memory is fickle. It's been a while since these events, a while since I took her regression course and a while since I've read her papers. One thing I've found is that memories of the contents of particular papers evolves with time, and memory often doesn't line up with actual content. Sometimes my memory is associated with the conclusions I've drawn and the lessons I've learned, not the actual paper content.

Kathryn Chaloner taught the regression course I took in grad school at the University of Minnesota. There was one homework set I thought I had a perfectly good solution to. But I only got an A- on the problem set. Kathryn gave us her analysis of the problem, and, in explaining the grading, said there actually was a variable missing in the analysis. And unless you had that variable you couldn't get it right; or at least we should have concluded that something was missing. So she felt no one deserved an A. Well, we didn't have the variable in the data set, nor even a hint about it, and I was rather miffed over the whole thing. But as you can tell, her claim did stay with me. It's a good attitude for a researcher to have, that something is missing, and to always be thinking about what it might be. And sure enough, over the years, it's been a rare analysis that wasn't missing something or other. Perhaps something the investigator forgot or didn't know to tell you and equally often, there's information that was never collected that you would want or need for the analysis.

In grad school, I read as many papers by the faculty as I could manage or stand. This included Kathryn's papers. The thing is, I read possibly more of her papers and definitely a greater percentage of her output as compared to any other prof. In terms of percentage, probably by a factor of 2 or maybe even 3. Now partly this was because she was a young assistant prof, and it was actually possible to read a large fraction of her work. But also it was because I was interested in her work. I could read it; and the work mattered.

There was her definition of Bayes residual (Chaloner & Brant 1988; Chaloner 1991; Chaloner 1994). I've a doctoral student plotting posterior mean residuals as part of her dissertation right now and I'll be talking about them later this quarter in my Bayesian course.

I never quite got interested in optimal design, but I did think about it; as part of that I read Chaloner & Larntz (1989). I spent a lot of time with that 1989 paper.

For a heavily Bayesian department, Minnesota didn't do much elicitation that I noticed, or even much in the way of prior specification. Kathryn had the two papers with George Duncan (Chaloner & Duncan 1983; Chaloner & Duncan 1987) which I recall liking at the time as well as the later Chaloner, Church, Louis & Matts (1993). I spend quite a bit of time in my Bayes class on specifying priors.

In class I exhort my students that if they're not using a proper, informative prior then they're leaving money on the table. Kathryn's 1987 Technometrics paper was certainly eye-opening. Here was a Bayesian playing a frequentist's game. Putting informative priors on parameters means that Bayes estimators will win by even more.

These papers all influenced me, and at a formative stage of my career. And they're all pretty much still with me, affecting my research and affecting what I teach. And I still think about them; all of them.

Here are the citations. Read 'em. I did. Read 'em and weep if you need to. But, give a cheer also. Cheer for Kathryn and cheer for her accomplishments. This is my cheer for Kathryn. I'll weep in a bit.

Chaloner, K., & Brant, R. (1988). A Bayesian approach to outlier detection and residual analysis. Biometrika, 75(4), 651-659.

Chaloner, K. (1991). Bayesian residual analysis in the presence of censoring.Biometrika, 78(3), 637-644.

Chaloner, K. (1994). Residual analysis and outliers in Bayesian hierarchical models. Aspects of Uncertainty A Tribute to DV Lindley, 149-157.

Chaloner, K., & Larntz, K. (1989). Optimal Bayesian design applied to logistic regression experiments. Journal of Statistical Planning and Inference, 21(2), 191-208.

Chaloner, K. M., & Duncan, G. T. (1983). Assessment of a beta prior distribution: PM elicitation. The Statistician, 174-180.

Chaloner, K., & Duncan, G. T. (1987). Some properties of the Dirichlet-multinomial distribution and its use in prior elicitation. Communications in Statistics-Theory and Methods, 16(2), 511-523.

Chaloner, K., Church, T., Louis, T. A., & Matts, J. P. (1993). Graphical elicitation of a prior distribution for a clinical trial. The Statistician, 341-353.

Chaloner, K. (1987). A Bayesian approach to the estimation of variance components for the unbalanced one-way random model. Technometrics, 29(3), 323-337.

Subscribe to Bayes