A Lot moRe Than Fifty Shades of Gray

R has 108 shades of grey with an 'e', and 116 shades of gray with an 'a'. Fully 34% of named colors are gray/grey of some kind.

So when can we expect R, the movie?

# Blog appendix
> temp = colors()
> length(temp) #657
[1] 657
>
> temp2 = grep("grey",temp)
> length(temp2) #108
[1] 108
> #temp[temp2]
>
> temp3 = grep("gray",temp)
> length(temp3) #116
[1] 116
> (108+116)/657
[1] 0.3409437
> round((108+116)/657,2)
[1] 0.34

Filed Under

colors

Post to Twitter

Statistics and Computer Science: For Whom the Hand Wrings, It Wrings for Me

Lately there has been much hand wringing about the future of statistics. My hands have also been wrung, and I expect to continue to wring them.

My own worries are several. These can be categorized as scientific concerns, naming rights, legacy/wasting effort, and "can't we all just get along?" concerns.

Scientific concerns

Suppose statistics disappears, and a new cadre of sheriffs come in and take over data analysis of scientific, public health, medical and public policy problems. My perception is the new guys don't actually understand data all that well. They seem to think algorithmically, not scientifically. I think (suspect/ am worried about/ fear) that the scientific/medical/public health/etc enterprise will be badly damaged. Until of course such time as data scientists manage to reinvent all our statistical wheels and spread those wheels around.
Operations research virtually disappeared when statistics started flowering. Similarly, statistics could disappear, replaced by other names or disciplines. What I worry about are that the issues (inference, the distinction between samples and populations, uncertainty assessment, bias) that concern us will be forgotten and much damage done to the scientific enterprise until such time as these issues get reinvented. (note: INFORMS, an organization of OR people has around 10000 members, so they're not in complete remission.)
Many of computer science's contributions are accompanied by a PR blitz that doesn't admit discussion and can be difficult to refute once sold to an untutored public: genetic algorithms (they mimic life itself!); unsupervised learning (finding hidden patterns without knowing about them in the first place!); artificial intelligence (it mimics how people think!); neural nets (it's just like the brain!). The problem is that many of the PR blitz-ers don't actually understand the words they use or the underlying issues. If you try to engage the PR-ers on the issues, they don't have a clue what you're talking about. This occurs elsewhere as well -- try talking to people who are well-drilled in Structural Equation Modeling (SEM) and to a greater or lesser extent elsewhere (instrumental variables, two-stage least squares, GEE, Bayesians even). They'll tell you why it's great, but are often unable to actually address or discuss the issues that underly the methodology and why it might or might not be preferable to some other model/methodology.
I think of myself as a scientist, a generalist. My specialty is "doing science" and "drawing inferences" as opposed to expertise in a specific field. My colleagues in public health and medicine need my expertise -- they lack what I have, as I lack what they have. Together we make a better paper, and do better research than either of us separately.
Engineering versus science. Many of the solutions I see promulgated out of computer science are engineering solutions. They (may) do a good job of prediction, but don't necessarily tell us anything about the underlying structure of the problem. In other words, they build an edifice, but don't necessarily create understanding about the world. I wonder what Newton would have said to some predictive model about apples and gravity that predicts really well but doesn't actually tell you about the important stuff like mass-1 times mass-2 divided by the squared distance between them. Engineering solutions to statistical problems created without understanding of the underlying science and issues will fail when the regime changes (bye bye dinosaurs, you lasted a long time) and the background evolves.
Computer science has been very successful. My perception is that they have been so successful and they've grown so large that they need or want to encompass new fields. Computer scientists have spent huge resources in developing data base technology, and what goes into data bases but data? Suddenly they've got all this data and they've decided to get into the business of analyzing data.
Data analysts could inadvertently own the keys to the kingdom (see the Google, Rise of). Hence the need to compete directly with statistics and statisticians. Or better yet, eliminate statistics and statisticians and take over the paradigm yourself.

Wasting effort

A whole new language has sprung up to talk about things that statisticians already have a language for. This is pretty annoying. Now either statisticians have to learn new words for old concepts or computer scientists get to learn statistics and translate if anyone wants to talk to anyone else. Or we go our separate ways and create separate literatures and duplicate each others work. Wait, isn't that what's happening already?
Suppose statistics disappears in a title (pun intended) wave of machine learning, data science and unsupported vector machines, etc., I worry that my legacy and that of my students may disappear or be marginalized. Self-centered? You bet, but one of my issues. In the off-chance I did something useful in my career, (cue Julie Andrews, singing "Something Good"), I'd prefer it to be remembered and built upon, not reinvented with someone else's name on it (but, Stigler's law of eponymy).

Naming rights

I don't mind changing the name from statistics to data science. It's catchier. Statistics and statistician have that popular connotation of collecting lots of random numerical facts, not the actuality of producing information from raw data.
As a member of a biostatistics department though, do we change our name to biodata science? Or data bioscience? Perhaps a mouthful Public Health Data Science, given that we're in a school of public health.
And if we're going to change the name of statistics to data science, shouldn't we at least see if someone else will pay us more for the name? We could call ourselves Google (data) scientists or perhaps Kellog's Rice Crispies scientists, depending on who buys the naming rights.
I suppose it could happen that statistics gets absorbed into computer science -- there does seem to be a rapidly growing number of dataheads in computer science, all clamoring for a piece of the pie. I've certainly known colleagues who early on decided to switch their verbal allegiance from statistics to computer science in pursuit of a larger paycheck. I don't mind being absorbed by the group, particularly, if, as is my perception, we get paid more in computer science. However, I don't want to be absorbed by the hive mind, and then to regress in our knowledge, practice and understanding.

Can't we all just get along?

People criticizing statistics often seem angry with statisticians as much as they are actually unhappy about statistics. Perhaps they discern a lack of respect from statisticians? Could be. I sense a lack of reverse respect -- for example someone claiming to be doing "data science" and claiming not to be doing statistics.
Inaccurate and narrow characterizations of statistics and what it is seem to be rampant, including from people claiming a statistics background!
There's a lot of ignorance about statistics out there. Seems like many computer scientists/data scientists/machine learners/etc deny that they are doing statistics. Statistics-deniers. Then they list the things they do, most of which are main stream statistics. I'm reminded of a quote attributed to Abraham Lincoln "How many legs does a dog have if you call the tail a leg? Four. Calling a tail a leg doesn't make it a leg."
Counting, doing a census, taking a mean, survey sampling, smoothing, shrinkage, variable selection, data analysis, cross-validation, non-parametrics (including permutation type non-parametrics and modern Bayesian non-parametrics), classification, confidence interval construction, presenting statistical conclusions, and many more, these are solidly statistical enterprises that require statistical thinking and are statistics. Constructing the data base that the data you collect go into, if you would like to claim that's not statistics, that's okay by me. If you don't know the end uses of the data, the statistics involved, it's easy to imagine making mistakes during the construction of the data base.
Science versus business. In science, we have to do our analyses in the time allotted, with the people we've got, and with the allotted resources. Oh wait, that's the problem in business too.
There has always been a dearth of statisticians. Industry has never had enough, the pharma industry by itself could have hired every PhD statistician produced in a year and still needed more. Now we have the big internet firms needing statistics and statisticians too. Academia has never had more than a small fraction of the needed statistical expertise. Because of this, other disciplines have developed their own training systems: psychometrics, econometrics, education are well known. However, these disciplines were never training so many people that they were doing any more than supplementing the meager supply of statisticians. In contrast, computer science has developed a rather large training system that gives every appearance of being able to compete head to head with statistics for mass or at least in paper production or Google counts.

Where do we go from here?

A diversity of skills are needed for modern research (my own interest), and for modern business applications.
I've met a few operations researchers, very successful in statistics. Two that come to mind are Luke Tierney (U of Iowa Statistics formerly Minnesota Statistics where I got my degree) and Chris Nachtsheim (U of Minnesota Operations and Management Science Department). Both are great statisticians, perhaps better than they might have been because their training was (I'm guessing here) non-standard for statistics. Truth be told, non-standard training has often been good for many statisticians, and I suspect it's been good for many scientists. I'm not against expanding, growing or diversifying statistics training.
In the other direction, I've been known to do some statistical computing. Should I call myself a computer scientist then? You got me. Certainly the tools I learned in my meager collection of computer science courses have been of use to me and influenced me.
Statisticians have contributed to function maximization (Levenberg–Marquardt, Nelder-Mead), numerical integration (Hastings of Metropolis-Hastings). Mathematicians (Tukey, Savage) and computer scientists (Jordan, Pearl) and physicists (Jeffries, Laplace, Jaynes) and psychologists (Bentler), medical doctors (Robins, Heckerman) and people based in many other disciplines have contributed to statistics. Good for them, and the better for statistics.
If you're manipulating data for purposes of making an inference, it is statistics. Some claim that visualization isn't statistics. I am inclined to disagree, but not vociferously, preferring as I do multiple viewpoints on this and other issues. But typically the purpose of visualization is to look at the visualization and draw some conclusion about the data. Then in that sense, visualization isn't the be-all end-all, but a means along the way to a statistical conclusion, and therefore visualization ought to be considered part of statistics.
A critique often levied against statisticians is that we haven't contributed to data visualization. Rats. Gee, have to admit, I was doing something else at the time. Also, there are a lot of skills involved in taking huge amounts of numbers and compacting them into a visual display. Some of the knowledge and skills I know about include statistical skills (I have a few of those) involving data representation & summarization; computing & engineering skills involving the placement of many pixels on a display screen, psychological knowledge involving aspects of how the human visual system process color, images, lines and light, just to name three distinct disciplines that matter. Its hard to resist a shout out to Bill Cleveland (http://www.stat.purdue.edu/~wsc/) at this point. Plus there's the subject matter involved in developing visualizations. Recall some of the early graphics of John Tukey -- designed in an era where big massive line printers were king, a technology not in common use these days.
I'd rather collaborate than compete. Statisticians learn collaboration. Reasons are that no one person knows all that can be understood about any one activity. My research groups usually have, at a minimum, an expert scientist, an expert statistician, a junior scientist who will execute a lot of the work, and a junior statistician. Often we'll have several scientists, people expert in different aspects of the science. Heck, on occasion we've had several expert statisticians contributing to analyzing some problems.
When the problems get complex, the data sets get huge, we need people expert in manipulating those large amounts of data and developing algorithms to get done what needs doing. There's plenty of room in the scientific (or business) enterprise for someone with new tools.

The title phrase and intro paragraph borrows from John Donne's Devotions Upon Emergent Occasions Via Hemingway and Wikipedia http://en.wikipedia.org/wiki/For_Whom_the_Bell_Tolls#Title and Hadley Wickham's Essay "Data science: how is it different to statistics?" in the September 2014 IMS Bulletin (http://bulletin.imstat.org/2014/09/data-science-how-is-it-different-to-…)

A small selection of more or less related web posts: Larry Wasserman, (http://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-s…); Karl Broman, (http://kbroman.wordpress.com/2013/04/05/data-science-is-statistics/); Thomas Spiedel, (http://magazine.amstat.org/wp-content/uploads/2009/11/October2014_AN.pdf. see page 19);
Norman Matloff, (http://magazine.amstat.org/blog/2014/11/01/statistics-losing-ground-to-…)
and many others.

Filed Under

computer science and statistics

future of statistics

data science

Post to Twitter

Kathryn Chaloner 1954-2014

Prescript: Memory is fickle. It's been a while since these events, a while since I took her regression course and a while since I've read her papers. One thing I've found is that memories of the contents of particular papers evolves with time, and memory often doesn't line up with actual content. Sometimes my memory is associated with the conclusions I've drawn and the lessons I've learned, not the actual paper content.

Kathryn Chaloner taught the regression course I took in grad school at the University of Minnesota. There was one homework set I thought I had a perfectly good solution to. But I only got an A- on the problem set. Kathryn gave us her analysis of the problem, and, in explaining the grading, said there actually was a variable missing in the analysis. And unless you had that variable you couldn't get it right; or at least we should have concluded that something was missing. So she felt no one deserved an A. Well, we didn't have the variable in the data set, nor even a hint about it, and I was rather miffed over the whole thing. But as you can tell, her claim did stay with me. It's a good attitude for a researcher to have, that something is missing, and to always be thinking about what it might be. And sure enough, over the years, it's been a rare analysis that wasn't missing something or other. Perhaps something the investigator forgot or didn't know to tell you and equally often, there's information that was never collected that you would want or need for the analysis.

In grad school, I read as many papers by the faculty as I could manage or stand. This included Kathryn's papers. The thing is, I read possibly more of her papers and definitely a greater percentage of her output as compared to any other prof. In terms of percentage, probably by a factor of 2 or maybe even 3. Now partly this was because she was a young assistant prof, and it was actually possible to read a large fraction of her work. But also it was because I was interested in her work. I could read it; and the work mattered.

There was her definition of Bayes residual (Chaloner & Brant 1988; Chaloner 1991; Chaloner 1994). I've a doctoral student plotting posterior mean residuals as part of her dissertation right now and I'll be talking about them later this quarter in my Bayesian course.

I never quite got interested in optimal design, but I did think about it; as part of that I read Chaloner & Larntz (1989). I spent a lot of time with that 1989 paper.

For a heavily Bayesian department, Minnesota didn't do much elicitation that I noticed, or even much in the way of prior specification. Kathryn had the two papers with George Duncan (Chaloner & Duncan 1983; Chaloner & Duncan 1987) which I recall liking at the time as well as the later Chaloner, Church, Louis & Matts (1993). I spend quite a bit of time in my Bayes class on specifying priors.

In class I exhort my students that if they're not using a proper, informative prior then they're leaving money on the table. Kathryn's 1987 Technometrics paper was certainly eye-opening. Here was a Bayesian playing a frequentist's game. Putting informative priors on parameters means that Bayes estimators will win by even more.

These papers all influenced me, and at a formative stage of my career. And they're all pretty much still with me, affecting my research and affecting what I teach. And I still think about them; all of them.

Here are the citations. Read 'em. I did. Read 'em and weep if you need to. But, give a cheer also. Cheer for Kathryn and cheer for her accomplishments. This is my cheer for Kathryn. I'll weep in a bit.

Chaloner, K., & Brant, R. (1988). A Bayesian approach to outlier detection and residual analysis. Biometrika, 75(4), 651-659.

Chaloner, K. (1991). Bayesian residual analysis in the presence of censoring.Biometrika, 78(3), 637-644.

Chaloner, K. (1994). Residual analysis and outliers in Bayesian hierarchical models. Aspects of Uncertainty A Tribute to DV Lindley, 149-157.

Chaloner, K., & Larntz, K. (1989). Optimal Bayesian design applied to logistic regression experiments. Journal of Statistical Planning and Inference, 21(2), 191-208.

Chaloner, K. M., & Duncan, G. T. (1983). Assessment of a beta prior distribution: PM elicitation. The Statistician, 174-180.

Chaloner, K., & Duncan, G. T. (1987). Some properties of the Dirichlet-multinomial distribution and its use in prior elicitation. Communications in Statistics-Theory and Methods, 16(2), 511-523.

Chaloner, K., Church, T., Louis, T. A., & Matts, J. P. (1993). Graphical elicitation of a prior distribution for a clinical trial. The Statistician, 341-353.

Chaloner, K. (1987). A Bayesian approach to the estimation of variance components for the unbalanced one-way random model. Technometrics, 29(3), 323-337.

Filed Under

Post to Twitter

Favorite Feller-ism: The Persistence of Bad Luck

William Feller's book An Introduction to Probability Theory and Its Applications Volume I is more commonly and affectionately known as Feller Volume One. It is on many statistician's and mathematician's deserted (desert?) island book lists. The deserted island book list is your list of 10 (or so) books that you take with you to a deserted island to keep you mentally fit. Because, as is well known, there is so much food, shelter and water on any deserted island you will have plenty of time to read Feller Volume One and do the problems. I am assured there will be plenty of room to write solutions in the sand between low and high tide. I know this from watching reruns of Gilligan's island and the high-tech version of Gilligan's island, LOST.

There is also Feller Volume 2, an excellent text on mathematical statistics. One of my favorite mathematical stories is from Volume II and involves the 'persistence of bad luck'. None of the following is mine, it is all rephrasing of Feller in Feller Volume II starting on page 15.

Consider a waiting time $X$. I experience $X_0$ as my waiting time. What waiting time do you ask? Well, I usually think of this at Costco as I go to pick a checkout line to stand in. So $X_0$ is the waiting time until I get to the checker.

Of course, this waiting time is much too long. For a spot of cruel fun, I get my countably infinite friends $i=1, 2, \ldots$ to visit Costco and stand in the same interminable line and experience waiting times $X_i$. How long until one of my friends $i$ experiences a waiting time $X_i$ longer than my waiting time?

We are looking for the friend $i$ where $X_i > X_0$ but all previous friends, $j=1, \ldots, i-1$ have $X_j < X_0$. At friend $i$, the probability that the longest time is $X_i$ and the second longest is $X_0$ is $1/(i*(i+1))$. The random variable $i$ until friend $i$ waits longer than I do has an infinite mean! That shows that I was indeed very unlucky in how long I had to wait until I finally got to the front of the check out queue.

Yes, there are a few conditions: the $X_0$ and $X_i$ need to be iid continuous random variables. My friends need to be numbered (You do have friends right? And you number your friends don't you?). As most people don't like the tattoo option, I've had to come up with a database of friends and their numbers. Also, every friend needs to have an infinitely accurate stop watch (available from Costco online!) and a Costco membership. You remember that quarter Costco had infinite profits? That was the year all my friends joined up.

Filed Under

bad luck

Feller

probability

Post to Twitter

Big Data Without Statisticians: BD2K Symposium At UCLA

UCLA is having a big data conference on Thursday and Friday Mar 27, 28 2014. The conference is organized by four computer science and genomic biology types. Speakers cluster [one of the rare appropriate uses of cluster analysis I know of] into three types of folks. Big biologists [they must be big, they're doing big data] doing big data, genomic stuff; computer scientists doing topic models, and a few math modelers who as far as I know, don't usually look at data at all.

The conference [well, it's called a workshop] is idiosyncratic [unique] in that it has no statisticians involved. The four organizers hail from Radiological Sciences, Physiology, Microbiology and Computer Science. If I read and recall their backgrounds correctly, two come from Comp Sci, two from Biology backgrounds, three are working in genomics of some sort and one in imaging. It seems probable that all are doing statistics, likely very complex statistics and that's fine; there aren't enough statisticians around, there never has been. Scientists have always had to do their own data analysis.

It is a bit hard to imagine who might be interested in the entire conference other than me and perhaps a few friends of mine. The genomicists may be interested in other genomicists but not in the math modelers. The comp sci folks may be interested in other topic modelers [a new class of perfectly good models] but will they be interested in genomics? And I haven't a clue what other talks the math modelers will want to hear.

My main interest as a statistician is perhaps to hear about the dynamic fringes of statistical application. The edges of statistical applications are where a lot of the fun stuff is going to be happening. And to defend a bit of statistics/biostatistics turf. I certainly want to hear about the topic models and from the topic modelers. I want to hear from the math modelers in case they've stumbled into some data; I happen to know a bit about how to combine data and math models and I care about the problem too. But I can't make progress without help. I'm not so interested in the genomics; I'm not not interested in the genomics mind you, but the problem is too much data, not enough time to investigate it.

Another interest of mine is getting to know who the players are. The big data movement is going to get political shortly on campus and you can't tell the players without a scorecard.

I imagine eventually the big data movement will devolve into a fight over resources. There is a ton of data out there and few statisticians. People will want resources to analyze their data, and will want those resources supplied by central campus, but won't want the resources controlled by stat or biostat departments. Minnesota is currently having a nice campus wide initiative (or fight) about big data; apparently the powers that be didn't want statistics or biostatistics involved at all.

There are lots of misconceptions about statisticians and statistics. I'm expecting to hear some and I'll try to keep track of what I hear at the conference. Some of these are sheer prejudice: that statisticians don't care about big data or that we don't care about someone's specific type of data. This is partly a forest/tree problem: individual statisticians are busy and may not have time for some one else's data or problems. Another issue is way too much data and way too few statisticians. Statisticians code up lots of solutions to problems in canned software but that will never handle the bleeding edge of science and statistical problems. So we all get to develop new methods and apply old methods in new ways to new sorts of data. That's why statistics is so much fun. Welcome to our world.

I'll see if I can stomach the entire conference or not. There are serious difficulties in communicating across disciplinary boundaries and I've had my share of successes and failures, possibly more than most given that biostatistics is naturally poly-disciplinary and that I have a tendency to be apian and flit from science to science. Often people don't understand their audience's background and they don't communicate well because of that. That applies to me, and it applies to scientists that I sometimes try to talk to or listen to. Even people who spend their careers on the boundaries can easily make mistakes: mistakes in communicating, mistakes in talking about the tools that come from outside their immediate scientific background, mistakes in assessing what an audience needs to hear to follow the research. Thus it is quite possible that most speakers will be talking to their own close colleagues and it will be of interest to see whether some or indeed any of the speakers are prepared to speak to a broad audience as this conference seems designed to attract.

Filed Under

Post to Twitter

Seymour Geisser and Me

A few stories from graduate school at the University of Minnesota School of Statistics where Seymour Geisser was head. I didn't have as many interactions with Seymour as some folks did, but his philosophy certainly influenced me. Perhaps I'll detail his influence on my research in another post.

The most common Seymour story involves seminar. Each year he told the graduate students that attending seminar was obligatory, but not mandatory. Or was it mandatory but not obligatory? We interpreted this to mean we really ought to go, but if we could live with ourselves, we could skip. We were obliged to go, but not mandated to go. Me, I attended every seminar except one where I missed not one, but two buses and arrived so late that I would have greatly disturbed seminar, so I left rather than open the closed seminar room door at the front of the room. The next time I missed seminar, I was out of town on job interviews.

Seymour sat in the front row at seminars, and would crane his head around to inspect the audience. We assumed he was checking on grad student attendance, though we didn't really know. I now do the same thing at seminars, looking around at the audience to see how/whether the graduate students are attending seminar in the numbers they are supposed to be. And yes, it's something I picked up from Seymour.

Seymour was very enamored with algebra. In his multivariate class we were oft treated to his love of algebraic manipulation. Some proofs/calculations could get complex enough that Seymour would start recycling notation, defining a symbol $a$ to be one thing early in the calculation, then later redefining the symbol to be something different. Unfortunately this offended my righteous sense of rightness. Towards the end of day one in a two day proof (one day = one class lecture), I had a premonition we were to be treated to more redefining of notation. Somewhat imperiously, I requested that Seymour not reuse notation. Actually, it's possible I instructed him not to or insisted that he not redefine notation. He looked startled (that's the closest I can get to describing his facial expression), and I can't recall his verbal response. But sure enough, at the beginning of the second day, $a$ got a new definition.

Seymour could be quite generous. I opted to work with Dennis Cook for my thesis. Originally we had agreed I would work on optimal design, specifically on optimal design in the presence of non-constant variance. I told plenty of people that that was my research area, but over a 9 month period I spent almost no time on the subject. It was clear I wasn't going to extract a dissertation out of optimal design and I began to look for another topic. Eventually Dennis and I discussed perhaps working on diagnostics, and I think I was particularly interested in Bayesian diagnostics. Well, Bayesian diagnostics was Seymour's purview and he had a grant on the subject. So I went over to meet with him. He showed me his grant, and even allowed me to make a copy to read over to see if there was something in there that was of interest to me. Grants often hold proprietary information on a professor's research program; sharing a grant isn't a given among researchers.

I eventually settled into a dissertation on Bayesian diagnostics, though not something that derived from Seymour's grant. And eventually I had a final defense and Seymour was on my committee. After presenting a nerve-wracking seminar in front of a large crowd, my dissertation committee and I retired to the library to grill me. There was an interaction with my thesis topic and having Seymour on my committee. My thesis involved a set of measures of case influence along with a graphical influence tool. I knew how to assess influence on the posterior. But as is well known, Seymour was a predictivist; supposedly he did not even believe in parameters except as intermediaries to get to a prediction. And I did not know how to use my measures to assess predictive influence, influence on predictions. The committee members asked me questions in turn. When Seymour's turn came, he started asking me about my time in the consulting clinic. I was a bit startled, because I wasn't expecting this direction. It hadn't previously occurred to me that Seymour might know (and why would he care?) that I had been in the consulting clinic.

He asked: What are consulting clinic clients most concerned with? I responded with some cant about F-tests and p-values. He pushed: what else are they interested in? And I mentioned they sometimes want estimates and standard errors but not as often. And what else, he asked again. Design sometimes. Sample size. And again he queried. I stonewalled. We danced for half an hour. I answered truthfully, because no one coming in to the consulting clinic had ever once asked about predictions. I knew and everyone in the room knew what Seymour wanted. But because I didn't know how to extend my dissertation to predictions, I wasn't going to volunteer the idea of predictions and wasn't going to make it easy for him to ask the obvious and perfectly sensible question. And Seymour was equally stubborn and wanted me to mention predictions first. Eventually Dennis intervened with a gentle 'Seymour, we talked about this' comment. And the defense continued.

These days I have a better understanding of what a dissertation defense is about and what the possible outcomes involve. But at the time I thought I was under real threat of flunking the final defense.

The class of influence measures that I used in my dissertation involved a set of divergence measures between densities. Turns out someone else had investigated them previously, though not for purposes of influence diagnostics. Seymour knew of the paper (Csiszar, 1967). Once I had the reference from Seymour, it still took me a while to find the paper. Pre-internet, this counted as a pretty obscure reference, but Seymour knew it. Eventually I got my thesis published (Weiss and Cook 1992, Biometrika). And a few years after that, I figured out how to do predictive inference as well (Weiss 1996, JRSS-B). Seymour didn't get a citation in the 1992 paper, but I did send a couple citations his way in the 1996 paper.

Post to Twitter

Short Review: How Children Succeed

How Children Succeed: Grit, Curiosity, and the Hidden Power of Character
by Paul Tough

Unless your definition of children includes college students, the second word of the title is a misnomer. This book is about how young people, from infants to college students succeed. Perhaps a bit more subtly and more accurately, Tough's book describes how adults in various organizations, from NGOs to schools organize teaching and mentorship to assist young people to develop the traits that may allow them to succeed in today's society. The book describes approaches to developing character in young people so that they can succeed against the odds stacked against them. It describes mentoring of young mothers to help them raise children who can succeed against a backdrop of poverty. It describes a middle school chess team from Brooklyn that succeeds in national competition against teams from much higher socio-economic schools. Tough describes several schools that have been built from the ground up to instill 'character' in the students with the idea that the schools don't necessarily select a priori for kids who will be successful. Rather most of the schools are set in impoverished areas. One school was rather rich; there the problem was kids from families so rich, the kids were overprotected and never learned how to pick up the pieces after a failure. This part of the book harkens somewhat to Malcolm Gladwell's David and Goliath.

This book is something you want to read if you have a baby in the house. Spoiler alert: the answer is to hug and hold your baby. None of letting your kid cry himself out in the crib so he learns how to comfort himself. Pick him up and hold him. My explanation: then the kid doesn't learn how to cry. They learn to comfort themselves by always being comfortable. Don't let the kid practice being out of control. Let your kid practice always being in control; that's how you raise a kid who can handle herself when she grows up.

But this book has also affected how I advise grad students. The problem for rich kids is overprotection. So rather than tell my grad students everything, it's important to let them figure things out for themselves.

The book isn't perfect. But I'm glad its not perfect. If it were perfect, it would be too late; everyone would know this stuff, and we wouldn't need to learn this stuff.

I highly recommend How Children Succeed to everyone. If this doesn't affect how you deal with other people, then you're not in a situation where you deal with other people ever. If you're that lone hermit, fine, don't read this. But everyone else should read this book today.

Filed Under

Post to Twitter

Remembering Seymour Geisser

This is the text, minus the nice formatting, of an email from Dennis Cook (my thesis advisor and current director of the U of MN School of Statistics) and Wes Johnson (a U of MN alum, a good friend, a great colleague and a student of Seymour Geisser's) about the University of Minnesota School of Statistics (my alma mater) activities in honoring Seymour Geisser. And there is a request for donations at the end, I hope you don't mind. I'll post something about Seymour's influence on my research in another post.

Dear Colleagues,

This year marks the tenth anniversary of Seymour Geisser's death. Starting in the early 1970's, Seymour guided the then-nascent School of Statistics to one of the top statistics departments in the world, maintaining that standing throughout his tenure as director. He was a scholar in the best traditions of our discipline, a mentor to many of us, and our leader. We can still see Seymour jousting with seminar speakers to convince them that prediction should be the ultimate goal of their work. We will be taking steps over the next year to honor Seymour and his legacy.

The School's website will have a special page devoted to Seymour. We encourage you to visit that webpage and recall Seymour's impact on your professional life.

Several people have furnished accounts of Seymour's influence, which we have included on his webpage. Here are a few excerpts from those accounts:

As my thesis advisor, Seymour saw that I would make good progress without active supervision and that I was cantankerous enough to resist it, so he mostly left me alone and let his lessons seep in rather than try to pound them in. Only much later did I appreciate how much restraint and wisdom -- as well as shrewdness -- his advising style required.
--James Hodges (PhD 1985)

Simply put, Seymour Geisser was a great scholar, educator, and human being. He was a very caring individual who made a deep impact on my personal life and my career in so many ways.
--Joseph Ibrahim (PhD 1988)

As department leader and in his classes, Prof. Geisser set an environment where we would be challenged to think broadly. Vigorous debate, being challenged to create and defend, and his care for all to grow and learn are his legacy that benefited me and so many others.
--Dennis Jennings (PhD 1982)

Seymour combined a deep understanding of the foundations of statistics with a keen (perhaps ruthless) eye for what has meaning in practice.
--Robert McCulloch (PhD 1985)

If you have a special memory or tale of Seymour that you are willing to share, we would love to hear from you and to include it on his webpage. Please forward it to Doug Hawkins.

The Geisser lecturer for this year will be Joe Ibrahim, currently scheduled for October 30, 2014. Please stop by if you are in the neighborhood at that time.

We are making a special effort to bring the Seymour Geisser Fellowship to a sustaining level. The fellowship will be used to support a PhD student whose work best reflects Seymour's scholarship and heritage. We encourage you to consider honoring Seymour's memory by contributing to his fellowship. Contributions can be made online.

Wishing you all the best over the coming year,

Dennis
R. Dennis Cook
Director, School of Statistics
University of Minnesota

Wesley O. Johnson
Professor, Department of Statistics
University of California, Irvine

Filed Under

Post to Twitter

Short Review: the War of Art by Steven Pressfield

The War of Art: Winning the Inner Creative Battle
by Steven Pressfield

Pressfield is the author of several bestsellers. The War of Art is a 12 step self-help support group for procrastinators, a biological and psychological disection of procrastination and your own personal writer's side-kick in the war against procrastination all in one short text. Pressfield calls the cause of procrastination resistance. Resistance is the voice inside your head, the one that tells you you'll never make it, you're going to fail. Resistance tells you that you NEED to watch the next episode in your TV show, NOW; that going shopping IMMEDIATELY is more important than sitting down to write your dissertation. Pressfield takes resistance apart, explains it in clear language and explains how to overcome it. The book is a quick read. Reading War of Art won't satisfy resistance and as soon as you read the book, resistance is going to kick into high gear with long discussions of why it's important to, well, do whatever it is except get up and do what needs doing, powdermilk.

Resistance is that thing that makes us read tons of Andy Gelman blog posts instead of working on our next paper. Blogging could arguably also be a form of resistance. I prefer to think blog-keeping is my way of staying sane and cataloging a few of my semi-great thoughts for my future students to hear. Are you listening, future students? Hear hear! And more importantly, blogging is my way of practicing writing on a frequent enough basis to grease the mental writing wheels.

The War of Art title harkens back to The Art of War by Sun Tzu, available for free on most fine digital reading platforms in multiple versions. I've yet to make it even partway through Sun Tzu's book, but I made it through Pressfield's book in a few bus trips in to work.

Resistance is feudal. [I've always wanted to write that.] It holds you in fief, and demands you do anything but what is important.

I'll keep this review short. Be done with your blog reading. NOW. STOP READING THIS BLOG! Go do something you aspire to. If something is holding you back, that something is resistance. Read The War of Art and get yourself on track.

Filed Under

Post to Twitter

Short Review: Writing Tools: 50 Essential Strategies for Every Writer

This is the first of perhaps three short book reviews.

Certain basics of writing I go over with almost every student. Organization, content, paragraphs and sentences. Roy Peter Clark's Writing Tools: 50 Essential Strategies for Every Writer covers most of the them. Clark is an entertaining writer and this is a highly enjoyable read. It's worth reading as literature, even if you aren't in the market for improving your writing. This is highly recommended for beginning professional writers. That includes beginning statisticians like my students and continuing statisticians like myself.

There are four major parts: Nuts and Bolts, Special Effects, Blueprints, and Useful Habits. Each part contains from 10 to 16 short chapters each presenting a different 'Tool'. As I read, my head kept nodding up-down, yes-yes, uh-huh, yep, told so-and-so that at our last meeting, made those comments yesterday to such and such. Roy Peter Clark presents the tools more succinctly, colorfully and intelligently than I can. He's got more tools than I have and they are ordered, cataloged and polished; I learned a lot!

One strength is that he presents his advice as tools, not as rules. Rules suggest hard and fast brook no prisoners strict laws. A tool is adaptable to many different situations. A tool helps, rules restrict.

Nuts and Bolts gives expert guidance on writing strong sentences. Less is he explaining common mistakes and more on how to avoid the mistakes in the first place. This is material I spend a lot of time on with my students. Begin sentences with subjects and verbs (Tool 1); place strong words at the beginning and at the end (Tool 2); strong verbs create action, save words (Tool 3). Those are the title, subtitle, and part of the subtitle for the first three tools. Clark crystallizes rules I didn't realize I knew, and adds to the tools I have for writing and for advising students. Clark explains when to use passive voice -- virtually every student starts writing too much passive voice. Some students were taught to use passive in scientific writing (didn't that advice die decades ago?). Students write in passive voice when they are unsure of what they write; they try to distance themselves from what they have written.

Fear not the long sentence (Tool 7) is advice I usually can't use, but set the pace with sentence length (Tool 18) prefer the simple over the technical (Tool 11), in short works don't waste a syllable (Tool 37) are very appropriate for scientific writing. Prefer the simple over the technical has the subhead: Use shorter words, sentences and paragraphs at points of complexity. This is great general advice as well as great specific advice for any given sentence as technical writing is almost always quite complex. Get the name of the dog (Tool 14) is about supplying informative details -- something that virtually all students do not understand until taught. That model you just presented: tell us what it does, how it works (data in, inferences out, but how?), and why it is needed (What's so great about it?)!

Other tools would never have occurred to me, but are quite valuable. Save string (Tool 44) talks about saving up little ideas, thoughts and data until you have enough for a paper. I've been doing that, sometimes for decades, but didn't have a name for the behavior or a way to even think about the behavior. Some tools I should engage in but haven't: Recruit your own support group (Tool 47) I should do more of, while (Tool 41) turn procrastination into rehearsal, is my excuse for every delay. Some tools are better for fiction and newspaper writing, but they're fun to read and think about and may be utile even in scientific writing. Tool 26, use dialogue as a form of action talks about how the eye is drawn to short sentences with lots of white space -- advice I promptly used in advising someone preparing presentation slides.

Tune your voice (Tool 23) I took to heart as advice to me about advising my students: let my students find their own voice. Similarly, limit self-criticism in early drafts (Tool 48) is vital for getting the meat of a project on paper before tightening up language and organizing the content. Too much criticism to early and the creative brain shuts down and that segues into my next review.

I strongly recommend Peter Roy Clark's Writing Tools, 50 Essential Strategies for Every Writer to every graduate student and to any professor who wishes to improve their writing.

Filed Under

Post to Twitter

Subscribe to