Why shouldn't I dichotomize my outcome variable?

My collaborators periodically want to dichotimize a continuous outcome such as a depression scale into a binary depressed/not depressed variable. Another popular one is Body Mass Index (BMI) gets classified into obese/not obese. Every time this arises, I get to discourage them from dichotomizing and I get to explain why. Sometimes dichotomization is called "analyzing caseness". Here are some of the reasons why we shouldn't directly analyze caseness. 

  1. Dichotomization is a bad idea because it throws away useful information. Dichotimizing variables takes a nice plump juicy continuous variable with lots of information and turns it into a scrawny binary 0-1 single bit of datum with very little information. 
    1. This is like having your computer person take your hard drives with all that hard won data on them, and having her throw out 3 out of every 4 hard drives. (She wouldn't do that really. So why would you?) 
  2. Caseness treats people who are similar, but on different sides of the cut point as very different. Consider IQ score, a continuous variable with (say) mean 100, sd of 15.  Cutting the data at Y=100 divides people into above or below average. 
    1. A person who has an IQ of 99 is treated as hugely different from a person with IQ of 101. 
    2. Conversely, two people, one of whom has IQ 115 is treated different from a person with IQ of 85, but the difference between those two people is treated the same as the difference between the IQ 99 and IQ 101 pair. 
    3. Two people, one who has IQ 102 and one who has IQ 201 are treated identically. Hmmmm.
  3. On average, dichotomizing a continuous variable will lead to larger standard errors, smaller effect sizes, less power, and missed effects. 
  4. Solution!  Do not throw the baby out with the bathwater. [Actually bathwater is a solution too.]
    1. We can analyze the continuous variable and after we are done with the analysis, if it is of interest, we can convert to caseness at the end of the analysis, and draw conclusions about the probability (or odds) of caseness in the treatment group versus control group, or among men versus among women. 

Given a point estimate \hat{\mu}_{tmt} for the mean of the treatment group, and given an estimated population standard deviation \hat{s}, we can calculate the probability of caseness in the treatment group as \Phi( (hat{\mu}_{tmt} - c) / \hat{s} ), where \Phi(z) is the cumulative distribution function of the standard normal. We can do the same computation in the control group. To report on significance or not, I'd use the test that compares \mu_{tmt} to \mu_{cntl}, though that isn't quite the same thing. To get a standard error of the difference in probabilities, I'd probably run a simple simulation that incorporated the uncertainty in \hat{\mu}_{tmt}, \hat{\mu}_{cntl} and \hat{s} and also included any covariance between the tmt and cntl mean estimates. Actually, much easier to run a Bayes analysis and use the McMC (Marked-up chain Monet Carla) output to estimate the uncertainty in the differences in probabilities of caseness. 

I've co-authored an editorial in Medical Decision Making on the subject of dichotomization. At a minimum, Medical Decision Making will make you jump through extra hoops if you discretize continuous variables before you can publish. Remember: Don't drink and dichotomize! Though if you must dichotomize, choosy mothers choose DON'T.   

Reference
Dawson, Neal V. and Weiss, Robert (2012). Dichotomizing Continuous Variables in Statistical Analysis: A Practice to Avoid. Medical Decision Making 32, 225--226. DOI: 10.1177/0272989X12437605

Subscribe to continuous outcomes