Happy New Year, It's Too Late

[Image of Clock]Couple people wished me happy new year yesterday, Jan 16th.  But, you realize, the year is already 1/24th over?  From R, rounding by W

> 16/365
[1] 0.0438
> 1/24
[1] 0.0417
> 15/365
[1] 0.0411

Somewhere between the 15th and the 16th we crossed the divide from less than 1/24th to more than 1/24th over. Today being the 17th, your year is 4.66% over.  Though you can wait until midnight to celebrate that particular milestone.  For your planning purposes, when January ends,

> 31/365
[1] 0.08493151
> 1/12
[1] 0.08333333
> 30/365
[1] 0.08219178

we will have passed the 1/12th point of the year.  
Whatever happened to two thousand and twelve?    

 

Filed Under

Clarity and Kindness

I'm editing a generally well written, near-final draft of a biostatistics paper. Worth broadcasting are several writing problems that occur in almost all grad student writing. 

  • Don't denigrate your contributions. 
    • Original: A simple way to achieve this ... 
    • Edit: A way to achieve this ...
    • Comment: Be respectful of your contributions. Are you so close to your own solution you can't see how important it is? Perhaps you've forgotten how innovative your solution was, given how long you've been living with it. Modesty, either real or false is not rewarded in academia. Besides, if you're really a scientist (and you are if you're a statistician), honesty is an important characteristic. Being honest about the importance of your work may not be easy, but it is important. Your work may be mathematically simple, but if you describe your idea as simple, readers will assume you meant that the idea is trivial.
  • Don't represent,
    • Original:  ... [statement of key idea] because this represents [key idea alternative] ...
    • Edit: ... [key idea] because this is [key idea alternative] ...
    • Comment: Represents is wishy-washy and could imply any of a number of relationships. Be firm. If A and B are the same thing, say A is B, not A represents B. 
  • Use the same language every time. ​
    • First Original: dispersion around $x$
    • Second Original: dispersion
    • Edit both times: dispersion around $x$ 
    • Comment: Apparently in the original text, there can be more than one dispersion.  Describing the dispersion as "around $x$", implies there are or could be other kinds of dispersions not around $x$. Thus the need to keep the modifier in repeated useage.  
  • Plot don't Show
    • Original: Figure 2 shows ...
    • Edit: Figure 2 plots ...
    • Comment: Or: Figure 2 is ... . Figure 2 doesn't show anything if you're not well enough educated to understand the plot in the first place. Figure 2 contains the plot, but Figure 2 doesn't show anything. 

 

Why shouldn't I dichotomize my outcome variable?

My collaborators periodically want to dichotimize a continuous outcome such as a depression scale into a binary depressed/not depressed variable. Another popular one is Body Mass Index (BMI) gets classified into obese/not obese. Every time this arises, I get to discourage them from dichotomizing and I get to explain why. Sometimes dichotomization is called "analyzing caseness". Here are some of the reasons why we shouldn't directly analyze caseness. 

  1. Dichotomization is a bad idea because it throws away useful information. Dichotimizing variables takes a nice plump juicy continuous variable with lots of information and turns it into a scrawny binary 0-1 single bit of datum with very little information. 
    1. This is like having your computer person take your hard drives with all that hard won data on them, and having her throw out 3 out of every 4 hard drives. (She wouldn't do that really. So why would you?) 
  2. Caseness treats people who are similar, but on different sides of the cut point as very different. Consider IQ score, a continuous variable with (say) mean 100, sd of 15.  Cutting the data at Y=100 divides people into above or below average. 
    1. A person who has an IQ of 99 is treated as hugely different from a person with IQ of 101. 
    2. Conversely, two people, one of whom has IQ 115 is treated different from a person with IQ of 85, but the difference between those two people is treated the same as the difference between the IQ 99 and IQ 101 pair. 
    3. Two people, one who has IQ 102 and one who has IQ 201 are treated identically. Hmmmm.
  3. On average, dichotomizing a continuous variable will lead to larger standard errors, smaller effect sizes, less power, and missed effects. 
  4. Solution!  Do not throw the baby out with the bathwater. [Actually bathwater is a solution too.]
    1. We can analyze the continuous variable and after we are done with the analysis, if it is of interest, we can convert to caseness at the end of the analysis, and draw conclusions about the probability (or odds) of caseness in the treatment group versus control group, or among men versus among women. 

Given a point estimate \hat{\mu}_{tmt} for the mean of the treatment group, and given an estimated population standard deviation \hat{s}, we can calculate the probability of caseness in the treatment group as \Phi( (hat{\mu}_{tmt} - c) / \hat{s} ), where \Phi(z) is the cumulative distribution function of the standard normal. We can do the same computation in the control group. To report on significance or not, I'd use the test that compares \mu_{tmt} to \mu_{cntl}, though that isn't quite the same thing. To get a standard error of the difference in probabilities, I'd probably run a simple simulation that incorporated the uncertainty in \hat{\mu}_{tmt}, \hat{\mu}_{cntl} and \hat{s} and also included any covariance between the tmt and cntl mean estimates. Actually, much easier to run a Bayes analysis and use the McMC (Marked-up chain Monet Carla) output to estimate the uncertainty in the differences in probabilities of caseness. 

I've co-authored an editorial in Medical Decision Making on the subject of dichotomization. At a minimum, Medical Decision Making will make you jump through extra hoops if you discretize continuous variables before you can publish. Remember: Don't drink and dichotomize! Though if you must dichotomize, choosy mothers choose DON'T.   

Reference
Dawson, Neal V. and Weiss, Robert (2012). Dichotomizing Continuous Variables in Statistical Analysis: A Practice to Avoid. Medical Decision Making 32, 225--226. DOI: 10.1177/0272989X12437605

Guest Post: The Importance of Keeping Your CV/Resume Current

Guest post by Robin Jeffries, copied from the niece* blog NorCalBiostat.

My graduate advisor was adamant about me keeping my CV current. Every little consulting project, every award, presentation or co-authorship on a paper had to be on there. When I would share my joy at getting an award, acceptance at a conference presentation or for a poster his immediate first statement was “Is it on your CV yet?” Well, perhaps after a congratulations.

It’s such a simple thing to do but also a simple thing to keep putting it off and then forgetting. Over the past few years I’ve gotten better at adding things on almost immediately, and it has paid off so many times. Right now I’m very casually looking at what my next career step will be. When I find something that I just can’t pass up I am always thankful that it is only minor changes and update to my CV that need to be made. Applying for jobs can be stressful enough. Keeping your CV up to date makes it one less thing to worry about. Save your energy for your cover letter.

And don’t be afraid to change the style on your resume now and again. Yes it can be a lot of work, but tastes change and what you thought was an amazing font may not look so good a few months later.

Same concept applies to blogs, but that will take me much longer to become a habit.

I concur.  

* Robin was my doctoral student. This is my blog. NorCalBiostat is her blog. The doctoral student of my doctoral student is my grand-student. Andy Gelman regular refers to his blog's sister blog. Therefore she is my blog's sister, and her blog is my blog's niece blog. Does Ancestry.com have any documentation on this?

You Are What You Write

To my wonderful students: These paragraphs are a revision of advice recently given to a student writer. 

Writing is a craft we all must master. And we all will. You are young: enthusiasm and energy come through in your writing: keep that and add to it. Become a better writer by growing as a writer and as a person. It took me time to become a decent writer. I hope you become better than me, and in less time. Every paper, report, memo, email, text is opportunity: take it and become a better writer.

The best writing instruction ever: Good writing is bad writing rewritten. I got it from Stephen King. Where Stephen King got it, I don't know. I'm giving it to you. You tell your students.

Write your first draft. Now go back, edit and revise. Rinse and repeat. 

Academics must write with accuracy. Compulsively read and re-read each sentence you write and game the sentence: Pretend to be an intelligent reader, but ignorant of the background. Search for multiple interpretations. Many first drafts have multiple meanings. Then put your re-write hat on and fix the sentence. One sentence at a time. Ditto: Does this sentence say what you want it to say? Ditto: Does this sentence belong here? Or elsewhere in this paper? Or in a different paper altogether? 

Density. Good writing has high information content per word. Find the words that don't convey meaning and delete them. You can eliminate 10% of the words from a first draft. Now go back and eliminate another 10%. Have you said something before? Delete it.

Writing is work. Successful writing: a pleasure. 

Filed Under

As We Said Before in Other Words: Grad Student Writing Hints

Redundancy, duplication and reiteration are not desirable in technical writing. Assume your reader remembers everything previously written! 

  1. You write "As mentioned before", "As previously stated", "From section X.X we know that" in a biostat paper or thesis proposal. 
    1. Continue writing. 
      1. It is okay to continue writing, because it is okay for you to write incorrectly or sub-optimally. 
      2. We all write sub-optimally.
      3. Get the words out, finish the draft. 
    2. It is permitted that your draft may be flawed. 
      1. The words on the page are just a draft and are not important
      2. The words on the page are not final yet. 
      3. Therefore it is okay to change the words. 
    3. It is not okay that the final version has major flaws.
      1. Bad writing will irritate your reader.
      2. Bad writing slows the reader. 
      3. Bad writing wastes the time of everyone who has to read your paper. 
      4. Fewer people will read your paper. You will have less impact. 
      5. If you are submitting a research paper to a journal, there will be serious consequences.
        1. At best, the referees will be irritated by the poor writing and make you revise it extensively. They may suggest rejecting the paper merely because of the writing. 
        2. At worst, the editor will reject the paper without even allowing referees to put their two cents in.
      6. Bad writing will be taken as a sign that
        1. You don't care about what you are writing;
        2. That you think the topic isn't important; or
        3. That you are a poor thinker.
      7. None of the previous item is true:
        1. You wouldn't be writing this if you didn't care,
        2. You wouldn't work on this topic if you didn't think it was important.
        3. You can't be that poor of a thinker, you got into graduate school for goodness sakes!
    4. Back to the business at hand. The current text is flawed. Make it better. 
    5. ​The text says the same thing in at least two places.
      1. The text is allowed to make a statement one time. The text is allowed to make a statement one time.
      2. If a statement occurs twice, there's a good chance it occurs three times. Or more. Check. 
      3. Which copy to keep? First (contact)? Second (base)? Last (supper)?​
        1. Is the statement needed at the first occurrence, or can it wait until the second? The best place to make a statement is the place where its consequences are utilized immediately which will leave hooks in the reader's memory to make recall easier. 
        2. If you leave the statement in place at the first occurrence, state it in a way the reader will recall at each spot without a restatement.
      4. Remove all but one copy of the statement
      5. Now your task is to make sure the reader will understand the text with the removal. 
        1. Repair the text in each place where you have removed duplicate text.
        2. Smooth and shorten. 
        3. Likely you will need to rearrange text, lots of text.
        4. Can the two or three locations be combined?
  2. Often there are additional duplicates: text, phrases, definitions, statements.
    1. Scan through quickly for duplicates. 
    2. Identify an important phrase that you use several times. Use global search to identify every usage. At each use, does the text say the same (or similar) thing about or with it? 
    3. Go fix. 
    4. Reduce. Do not re-use. Do not re-cycle! 
  3. A related writing problem is identified by the phrase "in other words". In other words signifies what we intended to write is important, but we're pretty sure the reader won't get it from the current set of words.
    1. There is fault here; the fault is in the initial phrasing.
    2. Thus we restate the point in other words, usually immediately in the next sentence.
    3. Solution: Typically the restatement is easier to follow so keep it and delete the first flawed version.
    4. Smooth your text.
    5. Have you omitted important information from the first statement? Is it really important? If yes, add it back. Tersely. But not the whole thing. 
  4. Taking a break from writing if needed. 
    1. It's hard to edit immediately after writing until you've garnered some experience switching modes. 
    2. After 10 minutes of break, go back and edit to remove duplicates. 
      1. Until you are an experienced editor, going back and editing can be painful. But editing is as important as the initial writing. Even if you're not ready, go back and edit. 
      2. Get to it. 

Enjoy!

What is This Blog About?

Statistics, Statistical Analysis, Quantitative Thinking, Statistical Humor, Biostatistics, Teaching, Writing, Publishing, Statistical Analysis, Jokes, Numbers, Numeracy, Whatever I'm Interested In (WIII), Science, Public Health, Medicine, Research, Business of Science, Advice to Students, How to do all of that, How not to do all of that.  

Am applying to a stat blog aggregation site. One of the questions there is a blog description, and another is 'additional information'. As my blog is single, perhaps I could give them my blog's phone number? Looking for other blogs to hang out with.  

Filed Under

2013 is gone and 2014 is here. 2015 and 2016 await. Prime factorization version

2013 and 2014 as integers to factor are fairly interesting. Each has three prime factors with no repeats. Each has one one digit prime factor, one 2 digit prime factor under 20 and one 2 digit prime factor over 30. It's a miracle!

2013 = 2 * 19 * 53

2014 = 3 * 11 * 61

and no factors in common. No repeated factors. 2015 follows completely in the footsteps of 2013 and 2014

2015 = 5 * 13 * 31

No repeated factors and no prime factors in common with 2013 or 2014!

It's not until we get to 2016 that the pattern is not just broken, but completely destroyed: 2 repeated factors and no factor over 7.

2016 = 2^5 * 3^2 * 7

At least the powers 5 and 2 are also prime. And there are exactly 3 distinct prime factors, and 3 is prime.  

Today's quiz.

  • What was the last prime year?
  • What was the most recent year before 2013 that had the same properties of factorization as 2013, 2014 and 2015?  As this is ambiguous, try these versions of this question
    • Three prime factors, one under 10, one between 10 and 20, one over 30.
    • That, plus no prime factors in common with 2013, 2014 or 2015?
    • What is the next year that satisfies both these properties?
  • What was the last year with exactly two factors?  (Note: three distinct correct answers, give both. )

Remember: there are 3 kinds of statisticians, those who can count, and those who can't.  

Todays relatively hard bonus quiz.  All questions refer to years AD prior to 2014.

  • How many years with a prime number of prime factors where each prime factor's multiplicity is prime?
  • How many with one prime factor?
    • Two distinct prime factors?
    • ​Three distinct prime factors?
  • ​Repeat that last question, where each factor has multiplicity one.  

Meta bonus quiz question: What assumptions am I making that distinguishes the quiz from the relatively hard bonus quiz?

Happy new year!

Subscribe to