A Lot moRe Than Fifty Shades of Gray

R has 108 shades of grey with an 'e', and 116 shades of gray with an 'a'. Fully 34% of named colors are gray/grey of some kind.

So when can we expect R, the movie?

# Blog appendix
> temp = colors()
> length(temp) #657
[1] 657
>
> temp2 = grep("grey",temp)
> length(temp2) #108
[1] 108
> #temp[temp2]
>
> temp3 = grep("gray",temp)
> length(temp3) #116
[1] 116
> (108+116)/657
[1] 0.3409437
> round((108+116)/657,2)
[1] 0.34

Filed Under

Statistics and Computer Science: For Whom the Hand Wrings, It Wrings for Me

Lately there has been much hand wringing about the future of statistics. My hands have also been wrung, and I expect to continue to wring them.

My own worries are several. These can be categorized as scientific concerns, naming rights, legacy/wasting effort, and "can't we all just get along?" concerns.

  • Scientific concerns
    • Suppose statistics disappears, and a new cadre of sheriffs come in and take over data analysis of scientific, public health, medical and public policy problems. My perception is the new guys don't actually understand data all that well. They seem to think algorithmically, not scientifically. I think (suspect/ am worried about/ fear) that the scientific/medical/public health/etc enterprise will be badly damaged. Until of course such time as data scientists manage to reinvent all our statistical wheels and spread those wheels around.
    • Operations research virtually disappeared when statistics started flowering. Similarly, statistics could disappear, replaced by other names or disciplines. What I worry about are that the issues (inference, the distinction between samples and populations, uncertainty assessment, bias) that concern us will be forgotten and much damage done to the scientific enterprise until such time as these issues get reinvented. (note: INFORMS, an organization of OR people has around 10000 members, so they're not in complete remission.)
    • Many of computer science's contributions are accompanied by a PR blitz that doesn't admit discussion and can be difficult to refute once sold to an untutored public: genetic algorithms (they mimic life itself!); unsupervised learning (finding hidden patterns without knowing about them in the first place!); artificial intelligence (it mimics how people think!); neural nets (it's just like the brain!). The problem is that many of the PR blitz-ers don't actually understand the words they use or the underlying issues. If you try to engage the PR-ers on the issues, they don't have a clue what you're talking about. This occurs elsewhere as well -- try talking to people who are well-drilled in Structural Equation Modeling (SEM) and to a greater or lesser extent elsewhere (instrumental variables, two-stage least squares, GEE, Bayesians even). They'll tell you why it's great, but are often unable to actually address or discuss the issues that underly the methodology and why it might or might not be preferable to some other model/methodology.
    • I think of myself as a scientist, a generalist. My specialty is "doing science" and "drawing inferences" as opposed to expertise in a specific field. My colleagues in public health and medicine need my expertise -- they lack what I have, as I lack what they have. Together we make a better paper, and do better research than either of us separately.
    • Engineering versus science. Many of the solutions I see promulgated out of computer science are engineering solutions. They (may) do a good job of prediction, but don't necessarily tell us anything about the underlying structure of the problem. In other words, they build an edifice, but don't necessarily create understanding about the world. I wonder what Newton would have said to some predictive model about apples and gravity that predicts really well but doesn't actually tell you about the important stuff like mass-1 times mass-2 divided by the squared distance between them. Engineering solutions to statistical problems created without understanding of the underlying science and issues will fail when the regime changes (bye bye dinosaurs, you lasted a long time) and the background evolves.
    • Computer science has been very successful. My perception is that they have been so successful and they've grown so large that they need or want to encompass new fields. Computer scientists have spent huge resources in developing data base technology, and what goes into data bases but data? Suddenly they've got all this data and they've decided to get into the business of analyzing data.
    • Data analysts could inadvertently own the keys to the kingdom (see the Google, Rise of). Hence the need to compete directly with statistics and statisticians. Or better yet, eliminate statistics and statisticians and take over the paradigm yourself.
  • Wasting effort
    • A whole new language has sprung up to talk about things that statisticians already have a language for. This is pretty annoying. Now either statisticians have to learn new words for old concepts or computer scientists get to learn statistics and translate if anyone wants to talk to anyone else. Or we go our separate ways and create separate literatures and duplicate each others work. Wait, isn't that what's happening already?
    • Suppose statistics disappears in a title (pun intended) wave of machine learning, data science and unsupported vector machines, etc., I worry that my legacy and that of my students may disappear or be marginalized. Self-centered? You bet, but one of my issues. In the off-chance I did something useful in my career, (cue Julie Andrews, singing "Something Good"), I'd prefer it to be remembered and built upon, not reinvented with someone else's name on it (but, Stigler's law of eponymy).
  • Naming rights
    • I don't mind changing the name from statistics to data science. It's catchier. Statistics and statistician have that popular connotation of collecting lots of random numerical facts, not the actuality of producing information from raw data.
    • As a member of a biostatistics department though, do we change our name to biodata science? Or data bioscience? Perhaps a mouthful Public Health Data Science, given that we're in a school of public health.
    • And if we're going to change the name of statistics to data science, shouldn't we at least see if someone else will pay us more for the name? We could call ourselves Google (data) scientists or perhaps Kellog's Rice Crispies scientists, depending on who buys the naming rights.
    • I suppose it could happen that statistics gets absorbed into computer science -- there does seem to be a rapidly growing number of dataheads in computer science, all clamoring for a piece of the pie. I've certainly known colleagues who early on decided to switch their verbal allegiance from statistics to computer science in pursuit of a larger paycheck. I don't mind being absorbed by the group, particularly, if, as is my perception, we get paid more in computer science. However, I don't want to be absorbed by the hive mind, and then to regress in our knowledge, practice and understanding.
  • Can't we all just get along?
    • People criticizing statistics often seem angry with statisticians as much as they are actually unhappy about statistics. Perhaps they discern a lack of respect from statisticians? Could be. I sense a lack of reverse respect -- for example someone claiming to be doing "data science" and claiming not to be doing statistics.
    • Inaccurate and narrow characterizations of statistics and what it is seem to be rampant, including from people claiming a statistics background!
    • There's a lot of ignorance about statistics out there. Seems like many computer scientists/data scientists/machine learners/etc deny that they are doing statistics. Statistics-deniers. Then they list the things they do, most of which are main stream statistics. I'm reminded of a quote attributed to Abraham Lincoln "How many legs does a dog have if you call the tail a leg? Four. Calling a tail a leg doesn't make it a leg."
    • Counting, doing a census, taking a mean, survey sampling, smoothing, shrinkage, variable selection, data analysis, cross-validation, non-parametrics (including permutation type non-parametrics and modern Bayesian non-parametrics), classification, confidence interval construction, presenting statistical conclusions, and many more, these are solidly statistical enterprises that require statistical thinking and are statistics. Constructing the data base that the data you collect go into, if you would like to claim that's not statistics, that's okay by me. If you don't know the end uses of the data, the statistics involved, it's easy to imagine making mistakes during the construction of the data base.
    • Science versus business. In science, we have to do our analyses in the time allotted, with the people we've got, and with the allotted resources. Oh wait, that's the problem in business too.
    • There has always been a dearth of statisticians. Industry has never had enough, the pharma industry by itself could have hired every PhD statistician produced in a year and still needed more. Now we have the big internet firms needing statistics and statisticians too. Academia has never had more than a small fraction of the needed statistical expertise. Because of this, other disciplines have developed their own training systems: psychometrics, econometrics, education are well known. However, these disciplines were never training so many people that they were doing any more than supplementing the meager supply of statisticians. In contrast, computer science has developed a rather large training system that gives every appearance of being able to compete head to head with statistics for mass or at least in paper production or Google counts.
  • Where do we go from here?
    • A diversity of skills are needed for modern research (my own interest), and for modern business applications.
    • I've met a few operations researchers, very successful in statistics. Two that come to mind are Luke Tierney (U of Iowa Statistics formerly Minnesota Statistics where I got my degree) and Chris Nachtsheim (U of Minnesota Operations and Management Science Department). Both are great statisticians, perhaps better than they might have been because their training was (I'm guessing here) non-standard for statistics. Truth be told, non-standard training has often been good for many statisticians, and I suspect it's been good for many scientists. I'm not against expanding, growing or diversifying statistics training.
    • In the other direction, I've been known to do some statistical computing. Should I call myself a computer scientist then? You got me. Certainly the tools I learned in my meager collection of computer science courses have been of use to me and influenced me.
    • Statisticians have contributed to function maximization (Levenberg–Marquardt, Nelder-Mead), numerical integration (Hastings of Metropolis-Hastings). Mathematicians (Tukey, Savage) and computer scientists (Jordan, Pearl) and physicists (Jeffries, Laplace, Jaynes) and psychologists (Bentler), medical doctors (Robins, Heckerman) and people based in many other disciplines have contributed to statistics. Good for them, and the better for statistics.
    • If you're manipulating data for purposes of making an inference, it is statistics. Some claim that visualization isn't statistics. I am inclined to disagree, but not vociferously, preferring as I do multiple viewpoints on this and other issues. But typically the purpose of visualization is to look at the visualization and draw some conclusion about the data. Then in that sense, visualization isn't the be-all end-all, but a means along the way to a statistical conclusion, and therefore visualization ought to be considered part of statistics.
    • A critique often levied against statisticians is that we haven't contributed to data visualization. Rats. Gee, have to admit, I was doing something else at the time. Also, there are a lot of skills involved in taking huge amounts of numbers and compacting them into a visual display. Some of the knowledge and skills I know about include statistical skills (I have a few of those) involving data representation & summarization; computing & engineering skills involving the placement of many pixels on a display screen, psychological knowledge involving aspects of how the human visual system process color, images, lines and light, just to name three distinct disciplines that matter. Its hard to resist a shout out to Bill Cleveland (http://www.stat.purdue.edu/~wsc/) at this point. Plus there's the subject matter involved in developing visualizations. Recall some of the early graphics of John Tukey -- designed in an era where big massive line printers were king, a technology not in common use these days.
    • I'd rather collaborate than compete. Statisticians learn collaboration. Reasons are that no one person knows all that can be understood about any one activity. My research groups usually have, at a minimum, an expert scientist, an expert statistician, a junior scientist who will execute a lot of the work, and a junior statistician. Often we'll have several scientists, people expert in different aspects of the science. Heck, on occasion we've had several expert statisticians contributing to analyzing some problems.
    • When the problems get complex, the data sets get huge, we need people expert in manipulating those large amounts of data and developing algorithms to get done what needs doing. There's plenty of room in the scientific (or business) enterprise for someone with new tools.

The title phrase and intro paragraph borrows from John Donne's Devotions Upon Emergent Occasions Via Hemingway and Wikipedia http://en.wikipedia.org/wiki/For_Whom_the_Bell_Tolls#Title and Hadley Wickham's Essay "Data science: how is it different to statistics?" in the September 2014 IMS Bulletin (http://bulletin.imstat.org/2014/09/data-science-how-is-it-different-to-…)

A small selection of more or less related web posts: Larry Wasserman, (http://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-s…); Karl Broman, (http://kbroman.wordpress.com/2013/04/05/data-science-is-statistics/); Thomas Spiedel, (http://magazine.amstat.org/wp-content/uploads/2009/11/October2014_AN.pdf. see page 19);
Norman Matloff, (http://magazine.amstat.org/blog/2014/11/01/statistics-losing-ground-to-…)
and many others.

Kathryn Chaloner 1954-2014

Prescript: Memory is fickle. It's been a while since these events, a while since I took her regression course and a while since I've read her papers. One thing I've found is that memories of the contents of particular papers evolves with time, and memory often doesn't line up with actual content. Sometimes my memory is associated with the conclusions I've drawn and the lessons I've learned, not the actual paper content.

Kathryn Chaloner taught the regression course I took in grad school at the University of Minnesota. There was one homework set I thought I had a perfectly good solution to. But I only got an A- on the problem set. Kathryn gave us her analysis of the problem, and, in explaining the grading, said there actually was a variable missing in the analysis. And unless you had that variable you couldn't get it right; or at least we should have concluded that something was missing. So she felt no one deserved an A. Well, we didn't have the variable in the data set, nor even a hint about it, and I was rather miffed over the whole thing. But as you can tell, her claim did stay with me. It's a good attitude for a researcher to have, that something is missing, and to always be thinking about what it might be. And sure enough, over the years, it's been a rare analysis that wasn't missing something or other. Perhaps something the investigator forgot or didn't know to tell you and equally often, there's information that was never collected that you would want or need for the analysis.

In grad school, I read as many papers by the faculty as I could manage or stand. This included Kathryn's papers. The thing is, I read possibly more of her papers and definitely a greater percentage of her output as compared to any other prof. In terms of percentage, probably by a factor of 2 or maybe even 3. Now partly this was because she was a young assistant prof, and it was actually possible to read a large fraction of her work. But also it was because I was interested in her work. I could read it; and the work mattered.

There was her definition of Bayes residual (Chaloner & Brant 1988; Chaloner 1991; Chaloner 1994). I've a doctoral student plotting posterior mean residuals as part of her dissertation right now and I'll be talking about them later this quarter in my Bayesian course.

I never quite got interested in optimal design, but I did think about it; as part of that I read Chaloner & Larntz (1989). I spent a lot of time with that 1989 paper.

For a heavily Bayesian department, Minnesota didn't do much elicitation that I noticed, or even much in the way of prior specification. Kathryn had the two papers with George Duncan (Chaloner & Duncan 1983; Chaloner & Duncan 1987) which I recall liking at the time as well as the later Chaloner, Church, Louis & Matts (1993). I spend quite a bit of time in my Bayes class on specifying priors.

In class I exhort my students that if they're not using a proper, informative prior then they're leaving money on the table. Kathryn's 1987 Technometrics paper was certainly eye-opening. Here was a Bayesian playing a frequentist's game. Putting informative priors on parameters means that Bayes estimators will win by even more.

These papers all influenced me, and at a formative stage of my career. And they're all pretty much still with me, affecting my research and affecting what I teach. And I still think about them; all of them.

Here are the citations. Read 'em. I did. Read 'em and weep if you need to. But, give a cheer also. Cheer for Kathryn and cheer for her accomplishments. This is my cheer for Kathryn. I'll weep in a bit.

Chaloner, K., & Brant, R. (1988). A Bayesian approach to outlier detection and residual analysis. Biometrika, 75(4), 651-659.

Chaloner, K. (1991). Bayesian residual analysis in the presence of censoring.Biometrika, 78(3), 637-644.

Chaloner, K. (1994). Residual analysis and outliers in Bayesian hierarchical models. Aspects of Uncertainty A Tribute to DV Lindley, 149-157.

Chaloner, K., & Larntz, K. (1989). Optimal Bayesian design applied to logistic regression experiments. Journal of Statistical Planning and Inference, 21(2), 191-208.

Chaloner, K. M., & Duncan, G. T. (1983). Assessment of a beta prior distribution: PM elicitation. The Statistician, 174-180.

Chaloner, K., & Duncan, G. T. (1987). Some properties of the Dirichlet-multinomial distribution and its use in prior elicitation. Communications in Statistics-Theory and Methods, 16(2), 511-523.

Chaloner, K., Church, T., Louis, T. A., & Matts, J. P. (1993). Graphical elicitation of a prior distribution for a clinical trial. The Statistician, 341-353.

Chaloner, K. (1987). A Bayesian approach to the estimation of variance components for the unbalanced one-way random model. Technometrics, 29(3), 323-337.

Favorite Feller-ism: The Persistence of Bad Luck

William Feller's book An Introduction to Probability Theory and Its Applications Volume I is more commonly and affectionately known as Feller Volume One. It is on many statistician's and mathematician's deserted (desert?) island book lists. The deserted island book list is your list of 10 (or so) books that you take with you to a deserted island to keep you mentally fit. Because, as is well known, there is so much food, shelter and water on any deserted island you will have plenty of time to read Feller Volume One and do the problems. I am assured there will be plenty of room to write solutions in the sand between low and high tide. I know this from watching reruns of Gilligan's island and the high-tech version of Gilligan's island, LOST. 

There is also Feller Volume 2, an excellent text on mathematical statistics. One of my favorite mathematical stories is from Volume II and involves the 'persistence of bad luck'. None of the following is mine, it is all rephrasing of Feller in Feller Volume II starting on page 15. 

Consider a waiting time $X$. I experience $X_0$ as my waiting time. What waiting time do you ask? Well, I usually think of this at Costco as I go to pick a checkout line to stand in. So $X_0$ is the waiting time until I get to the checker. 

Of course, this waiting time is much too long. For a spot of cruel fun, I get my countably infinite friends $i=1, 2, \ldots$ to visit Costco and stand in the same interminable line and experience waiting times $X_i$. How long until one of my friends $i$ experiences a waiting time $X_i$ longer than my waiting time? 

We are looking for the friend $i$ where $X_i > X_0$ but all previous friends, $j=1, \ldots, i-1$ have $X_j < X_0$. At friend $i$, the probability that the longest time is $X_i$ and the second longest is $X_0$ is $1/(i*(i+1))$. The random variable $i$ until friend $i$ waits longer than I do has an infinite mean! That shows that I was indeed very unlucky in how long I had to wait until I finally got to the front of the check out queue. 

Yes, there are a few conditions: the $X_0$ and $X_i$ need to be iid continuous random variables. My friends need to be numbered (You do have friends right? And you number your friends don't you?). As most people don't like the tattoo option, I've had to come up with a database of friends and their numbers. Also, every friend needs to have an infinitely accurate stop watch (available from Costco online!) and a Costco membership. You remember that quarter Costco had infinite profits? That was the year all my friends joined up. 

Big Data Without Statisticians: BD2K Symposium At UCLA

UCLA is having a big data conference on Thursday and Friday Mar 27, 28 2014.  The conference is organized by four computer science and genomic biology types. Speakers cluster [one of the rare appropriate uses of cluster analysis I know of] into three types of folks. Big biologists [they must be big, they're doing big data] doing big data, genomic stuff; computer scientists doing topic models, and a few math modelers who as far as I know, don't usually look at data at all.

The conference [well, it's called a workshop] is idiosyncratic [unique] in that it has no statisticians involved. The four organizers hail from Radiological Sciences, Physiology, Microbiology and Computer Science. If I read and recall their backgrounds correctly, two come from Comp Sci, two from Biology backgrounds, three are working in genomics of some sort and one in imaging. It seems probable that all are doing statistics, likely very complex statistics and that's fine; there aren't enough statisticians around, there never has been. Scientists have always had to do their own data analysis. 

It is a bit hard to imagine who might be interested in the entire conference other than me and perhaps a few friends of mine. The genomicists may be interested in other genomicists but not in the math modelers. The comp sci folks may be interested in other topic modelers [a new class of perfectly good models] but will they be interested in genomics? And I haven't a clue what other talks the math modelers will want to hear.

My main interest as a statistician is perhaps to hear about the dynamic fringes of statistical application. The edges of statistical applications are where a lot of the fun stuff is going to be happening. And to defend a bit of statistics/biostatistics turf. I certainly want to hear about the topic models and from the topic modelers. I want to hear from the math modelers in case they've stumbled into some data; I happen to know a bit about how to combine data and math models and I care about the problem too. But I can't make progress without help. I'm not so interested in the genomics; I'm not not interested in the genomics mind you, but the problem is too much data, not enough time to investigate it. 

Another interest of mine is getting to know who the players are. The big data movement is going to get political shortly on campus and you can't tell the players without a scorecard. 

I imagine eventually the big data movement will devolve into a fight over resources. There is a ton of data out there and few statisticians. People will want resources to analyze their data, and will want those resources supplied by central campus, but won't want the resources controlled by stat or biostat departments. Minnesota is currently having a nice campus wide initiative (or fight) about big data; apparently the powers that be didn't want statistics or biostatistics involved at all. 

There are lots of misconceptions about statisticians and statistics. I'm expecting to hear some and I'll try to keep track of what I hear at the conference. Some of these are sheer prejudice: that statisticians don't care about big data or that we don't care about someone's specific type of data. This is partly a forest/tree problem: individual statisticians are busy and may not have time for some one else's data or problems. Another issue is way too much data and way too few statisticians. Statisticians code up lots of solutions to problems in canned software but that will never handle the bleeding edge of science and statistical problems. So we all get to develop new methods and apply old methods in new ways to new sorts of data. That's why statistics is so much fun. Welcome to our world.

I'll see if I can stomach the entire conference or not. There are serious difficulties in communicating across disciplinary boundaries and I've had my share of successes and failures, possibly more than most given that biostatistics is naturally poly-disciplinary and that I have a tendency to be apian and flit from science to science. Often people don't understand their audience's background and they don't communicate well because of that. That applies to me, and it applies to scientists that I sometimes try to talk to or listen to. Even people who spend their careers on the boundaries can easily make mistakes: mistakes in communicating, mistakes in talking about the tools that come from outside their immediate scientific background, mistakes in assessing what an audience needs to hear to follow the research. Thus it is quite possible that most speakers will be talking to their own close colleagues and it will be of interest to see whether some or indeed any of the speakers are prepared to speak to a broad audience as this conference seems designed to attract.