· r-2

R: I write more in the last week of the month, or do I?

I've been writing on this blog for almost 7 years and have always believed that I write more frequently towards the end of a month. Now that I've got all the data I thought it'd be interesting to test that belief.

I started with a data frame containing each post and its publication date and added an extra column which works out how many weeks from the end of the month that post was written:


> df %>% sample_n(5)
                                                               title                date
946  Python: Equivalent to flatMap for flattening an array of arrays 2015-03-23 00:45:00
175                                         Ruby: Hash default value 2010-10-16 14:02:37
375               Java/Scala: Runtime.exec hanging/in 'pipe_w' state 2011-11-20 20:20:08
1319                            Coding Dojo #18: Groovy Bowling Game 2009-06-26 08:15:23
381                   Continuous Delivery: Removing manual scenarios 2011-12-05 23:13:34

calculate_start_of_week = function(week, year) {
  date <- ymd(paste(year, 1, 1, sep="-"))
  week(date) = week
  return(date)
}

tidy_df  = df %>% 
  mutate(year = year(date), 
         week = week(date),
         week_in_month = ceiling(day(date) / 7),
         max_week = max(week_in_month), 
         weeks_from_end = max_week - week_in_month,
         start_of_week = calculate_start_of_week(week, year))

> tidy_df %>% select(date, weeks_from_end, start_of_week) %>% sample_n(5)

                    date weeks_from_end start_of_week
1023 2008-08-08 21:16:02              3    2008-08-05
800  2014-01-31 06:51:06              0    2014-01-29
859  2014-08-14 10:24:52              3    2014-08-13
107  2010-07-10 22:49:52              3    2010-07-09
386  2011-12-20 23:57:51              2    2011-12-17

Next I want to get a count of how many posts were published in a given week. The following code does that transformation for us:


weeks_from_end_counts =  tidy_df %>%
  group_by(start_of_week, weeks_from_end) %>%
  summarise(count = n())

> weeks_from_end_counts
Source: local data frame [540 x 4]
Groups: start_of_week, weeks_from_end

   start_of_week weeks_from_end year count
1     2006-08-27              0 2006     1
2     2006-08-27              4 2006     3
3     2006-09-03              4 2006     1
4     2008-02-05              3 2008     2
5     2008-02-12              3 2008     2
6     2008-07-15              2 2008     1
7     2008-07-22              1 2008     1
8     2008-08-05              3 2008     8
9     2008-08-12              2 2008     5
10    2008-08-12              3 2008     9
..           ...            ...  ...   ...

We group by both 'start_of_week' and 'weeks_from_end' because we could have posts published in the same week but different month and we want to capture that difference. Now we can run a correlation on the data frame to see if there's any relationship between 'count' and 'weeks_from_end':


> cor(weeks_from_end_counts %>% ungroup() %>% select(weeks_from_end, count))
               weeks_from_end       count
weeks_from_end     1.00000000 -0.08253569
count             -0.08253569  1.00000000

This suggests there's a slight negative correlation between the two variables i.e. 'count' decreases as 'weeks_from_end' increases. Let's plug the data frame into a linear model to see how good 'weeks_from_end' is as a predictor of 'count':


> fit = lm(count ~ weeks_from_end, weeks_from_end_counts)

> summary(fit)

Call:
lm(formula = count ~ weeks_from_end, data = weeks_from_end_counts)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0000 -1.5758 -0.5758  1.1060  8.0000 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3.00000    0.13764  21.795   <2e-16 ***
weeks_from_end -0.10605    0.05521  -1.921   0.0553 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.698 on 538 degrees of freedom
Multiple R-squared:  0.006812,	Adjusted R-squared:  0.004966 
F-statistic:  3.69 on 1 and 538 DF,  p-value: 0.05527

We see a similar result here. The effect of 'weeks_from_end' is worth 0.1 posts per week with a p value of 0.0553 so it's on the border line of being significant.

We also have a very low 'R squared' value which suggests the 'weeks_from_end' isn't explaining much of the variation in the data which makes sense given that we didn't see much of a correlation.

If we charged on and wanted to predict the number of posts likely to be published in a given week we could run the predict function like this:


> predict(fit, data.frame(weeks_from_end=c(1,2,3,4,5)))
       1        2        3        4        5 
2.893952 2.787905 2.681859 2.575812 2.469766

Obviously it's a bit flawed since we could plug in any numeric value we want, even ones that don't make any sense, and it'd still come back with a prediction:


> predict(fit, data.frame(weeks_from_end=c(30 ,-10)))
        1         2 
-0.181394  4.060462 

I think we'd probably protect against that with a function wrapping our call to predict that doesn't allow 'weeks_from_end' to be greater than 5 or less than 0.

So far it looks like my belief is incorrect! I'm a bit dubious about my calculation of 'weeks_from_end' though - it's not completely capturing what I want since in some months the last week only contains a couple of days.

Next I'm going to explore whether it makes any difference if I calculate that value by counting the number of days back from the last day of the month rather than using week number.

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket