Statistics and Computer Science: For Whom the Hand Wrings, It Wrings for Me

Lately there has been much hand wringing about the future of statistics. My hands have also been wrung, and I expect to continue to wring them.

My own worries are several. These can be categorized as scientific concerns, naming rights, legacy/wasting effort, and "can't we all just get along?" concerns.

  • Scientific concerns
    • Suppose statistics disappears, and a new cadre of sheriffs come in and take over data analysis of scientific, public health, medical and public policy problems. My perception is the new guys don't actually understand data all that well. They seem to think algorithmically, not scientifically. I think (suspect/ am worried about/ fear) that the scientific/medical/public health/etc enterprise will be badly damaged. Until of course such time as data scientists manage to reinvent all our statistical wheels and spread those wheels around.
    • Operations research virtually disappeared when statistics started flowering. Similarly, statistics could disappear, replaced by other names or disciplines. What I worry about are that the issues (inference, the distinction between samples and populations, uncertainty assessment, bias) that concern us will be forgotten and much damage done to the scientific enterprise until such time as these issues get reinvented. (note: INFORMS, an organization of OR people has around 10000 members, so they're not in complete remission.)
    • Many of computer science's contributions are accompanied by a PR blitz that doesn't admit discussion and can be difficult to refute once sold to an untutored public: genetic algorithms (they mimic life itself!); unsupervised learning (finding hidden patterns without knowing about them in the first place!); artificial intelligence (it mimics how people think!); neural nets (it's just like the brain!). The problem is that many of the PR blitz-ers don't actually understand the words they use or the underlying issues. If you try to engage the PR-ers on the issues, they don't have a clue what you're talking about. This occurs elsewhere as well -- try talking to people who are well-drilled in Structural Equation Modeling (SEM) and to a greater or lesser extent elsewhere (instrumental variables, two-stage least squares, GEE, Bayesians even). They'll tell you why it's great, but are often unable to actually address or discuss the issues that underly the methodology and why it might or might not be preferable to some other model/methodology.
    • I think of myself as a scientist, a generalist. My specialty is "doing science" and "drawing inferences" as opposed to expertise in a specific field. My colleagues in public health and medicine need my expertise -- they lack what I have, as I lack what they have. Together we make a better paper, and do better research than either of us separately.
    • Engineering versus science. Many of the solutions I see promulgated out of computer science are engineering solutions. They (may) do a good job of prediction, but don't necessarily tell us anything about the underlying structure of the problem. In other words, they build an edifice, but don't necessarily create understanding about the world. I wonder what Newton would have said to some predictive model about apples and gravity that predicts really well but doesn't actually tell you about the important stuff like mass-1 times mass-2 divided by the squared distance between them. Engineering solutions to statistical problems created without understanding of the underlying science and issues will fail when the regime changes (bye bye dinosaurs, you lasted a long time) and the background evolves.
    • Computer science has been very successful. My perception is that they have been so successful and they've grown so large that they need or want to encompass new fields. Computer scientists have spent huge resources in developing data base technology, and what goes into data bases but data? Suddenly they've got all this data and they've decided to get into the business of analyzing data.
    • Data analysts could inadvertently own the keys to the kingdom (see the Google, Rise of). Hence the need to compete directly with statistics and statisticians. Or better yet, eliminate statistics and statisticians and take over the paradigm yourself.
  • Wasting effort
    • A whole new language has sprung up to talk about things that statisticians already have a language for. This is pretty annoying. Now either statisticians have to learn new words for old concepts or computer scientists get to learn statistics and translate if anyone wants to talk to anyone else. Or we go our separate ways and create separate literatures and duplicate each others work. Wait, isn't that what's happening already?
    • Suppose statistics disappears in a title (pun intended) wave of machine learning, data science and unsupported vector machines, etc., I worry that my legacy and that of my students may disappear or be marginalized. Self-centered? You bet, but one of my issues. In the off-chance I did something useful in my career, (cue Julie Andrews, singing "Something Good"), I'd prefer it to be remembered and built upon, not reinvented with someone else's name on it (but, Stigler's law of eponymy).
  • Naming rights
    • I don't mind changing the name from statistics to data science. It's catchier. Statistics and statistician have that popular connotation of collecting lots of random numerical facts, not the actuality of producing information from raw data.
    • As a member of a biostatistics department though, do we change our name to biodata science? Or data bioscience? Perhaps a mouthful Public Health Data Science, given that we're in a school of public health.
    • And if we're going to change the name of statistics to data science, shouldn't we at least see if someone else will pay us more for the name? We could call ourselves Google (data) scientists or perhaps Kellog's Rice Crispies scientists, depending on who buys the naming rights.
    • I suppose it could happen that statistics gets absorbed into computer science -- there does seem to be a rapidly growing number of dataheads in computer science, all clamoring for a piece of the pie. I've certainly known colleagues who early on decided to switch their verbal allegiance from statistics to computer science in pursuit of a larger paycheck. I don't mind being absorbed by the group, particularly, if, as is my perception, we get paid more in computer science. However, I don't want to be absorbed by the hive mind, and then to regress in our knowledge, practice and understanding.
  • Can't we all just get along?
    • People criticizing statistics often seem angry with statisticians as much as they are actually unhappy about statistics. Perhaps they discern a lack of respect from statisticians? Could be. I sense a lack of reverse respect -- for example someone claiming to be doing "data science" and claiming not to be doing statistics.
    • Inaccurate and narrow characterizations of statistics and what it is seem to be rampant, including from people claiming a statistics background!
    • There's a lot of ignorance about statistics out there. Seems like many computer scientists/data scientists/machine learners/etc deny that they are doing statistics. Statistics-deniers. Then they list the things they do, most of which are main stream statistics. I'm reminded of a quote attributed to Abraham Lincoln "How many legs does a dog have if you call the tail a leg? Four. Calling a tail a leg doesn't make it a leg."
    • Counting, doing a census, taking a mean, survey sampling, smoothing, shrinkage, variable selection, data analysis, cross-validation, non-parametrics (including permutation type non-parametrics and modern Bayesian non-parametrics), classification, confidence interval construction, presenting statistical conclusions, and many more, these are solidly statistical enterprises that require statistical thinking and are statistics. Constructing the data base that the data you collect go into, if you would like to claim that's not statistics, that's okay by me. If you don't know the end uses of the data, the statistics involved, it's easy to imagine making mistakes during the construction of the data base.
    • Science versus business. In science, we have to do our analyses in the time allotted, with the people we've got, and with the allotted resources. Oh wait, that's the problem in business too.
    • There has always been a dearth of statisticians. Industry has never had enough, the pharma industry by itself could have hired every PhD statistician produced in a year and still needed more. Now we have the big internet firms needing statistics and statisticians too. Academia has never had more than a small fraction of the needed statistical expertise. Because of this, other disciplines have developed their own training systems: psychometrics, econometrics, education are well known. However, these disciplines were never training so many people that they were doing any more than supplementing the meager supply of statisticians. In contrast, computer science has developed a rather large training system that gives every appearance of being able to compete head to head with statistics for mass or at least in paper production or Google counts.
  • Where do we go from here?
    • A diversity of skills are needed for modern research (my own interest), and for modern business applications.
    • I've met a few operations researchers, very successful in statistics. Two that come to mind are Luke Tierney (U of Iowa Statistics formerly Minnesota Statistics where I got my degree) and Chris Nachtsheim (U of Minnesota Operations and Management Science Department). Both are great statisticians, perhaps better than they might have been because their training was (I'm guessing here) non-standard for statistics. Truth be told, non-standard training has often been good for many statisticians, and I suspect it's been good for many scientists. I'm not against expanding, growing or diversifying statistics training.
    • In the other direction, I've been known to do some statistical computing. Should I call myself a computer scientist then? You got me. Certainly the tools I learned in my meager collection of computer science courses have been of use to me and influenced me.
    • Statisticians have contributed to function maximization (Levenberg–Marquardt, Nelder-Mead), numerical integration (Hastings of Metropolis-Hastings). Mathematicians (Tukey, Savage) and computer scientists (Jordan, Pearl) and physicists (Jeffries, Laplace, Jaynes) and psychologists (Bentler), medical doctors (Robins, Heckerman) and people based in many other disciplines have contributed to statistics. Good for them, and the better for statistics.
    • If you're manipulating data for purposes of making an inference, it is statistics. Some claim that visualization isn't statistics. I am inclined to disagree, but not vociferously, preferring as I do multiple viewpoints on this and other issues. But typically the purpose of visualization is to look at the visualization and draw some conclusion about the data. Then in that sense, visualization isn't the be-all end-all, but a means along the way to a statistical conclusion, and therefore visualization ought to be considered part of statistics.
    • A critique often levied against statisticians is that we haven't contributed to data visualization. Rats. Gee, have to admit, I was doing something else at the time. Also, there are a lot of skills involved in taking huge amounts of numbers and compacting them into a visual display. Some of the knowledge and skills I know about include statistical skills (I have a few of those) involving data representation & summarization; computing & engineering skills involving the placement of many pixels on a display screen, psychological knowledge involving aspects of how the human visual system process color, images, lines and light, just to name three distinct disciplines that matter. Its hard to resist a shout out to Bill Cleveland (http://www.stat.purdue.edu/~wsc/) at this point. Plus there's the subject matter involved in developing visualizations. Recall some of the early graphics of John Tukey -- designed in an era where big massive line printers were king, a technology not in common use these days.
    • I'd rather collaborate than compete. Statisticians learn collaboration. Reasons are that no one person knows all that can be understood about any one activity. My research groups usually have, at a minimum, an expert scientist, an expert statistician, a junior scientist who will execute a lot of the work, and a junior statistician. Often we'll have several scientists, people expert in different aspects of the science. Heck, on occasion we've had several expert statisticians contributing to analyzing some problems.
    • When the problems get complex, the data sets get huge, we need people expert in manipulating those large amounts of data and developing algorithms to get done what needs doing. There's plenty of room in the scientific (or business) enterprise for someone with new tools.

The title phrase and intro paragraph borrows from John Donne's Devotions Upon Emergent Occasions Via Hemingway and Wikipedia http://en.wikipedia.org/wiki/For_Whom_the_Bell_Tolls#Title and Hadley Wickham's Essay "Data science: how is it different to statistics?" in the September 2014 IMS Bulletin (http://bulletin.imstat.org/2014/09/data-science-how-is-it-different-to-…)

A small selection of more or less related web posts: Larry Wasserman, (http://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-s…); Karl Broman, (http://kbroman.wordpress.com/2013/04/05/data-science-is-statistics/); Thomas Spiedel, (http://magazine.amstat.org/wp-content/uploads/2009/11/October2014_AN.pdf. see page 19);
Norman Matloff, (http://magazine.amstat.org/blog/2014/11/01/statistics-losing-ground-to-…)
and many others.

Subscribe to data science