· r-2

R: Conditionally updating rows of a data frame

In a blog post I wrote a couple of days ago about cohort analysis I had to assign a monthNumber to each row in a data frame and started out with the following code:


library(zoo)
library(dplyr)

monthNumber = function(cohort, date) {
  cohortAsDate = as.yearmon(cohort)
  dateAsDate = as.yearmon(date)
    
  if(cohortAsDate > dateAsDate) {
    "NA"
  } else {
    paste(round((dateAsDate - cohortAsDate) * 12), sep="")
  }
}

cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber)

If we time this function using system.time we'll see that it's not very snappy:


system.time(cohortAttendance %>% 
  group_by(row_number()) %>% 
  mutate(monthNumber = monthNumber(cohort, date)) %>%
  filter(monthNumber != "NA") %>%
  filter(monthNumber != "0") %>% 
  mutate(monthNumber = as.numeric(monthNumber)) %>% 
  arrange(monthNumber))

   user  system elapsed 
  1.968   0.019   2.016 

The reason for the poor performance is that we process each row of the data table individually due to the call to group_by on the second line. One way we can refactor the code is to use the ifelse which can process multiple rows at a time:


system.time(
cohortAttendance %>% 
  mutate(monthNumber = ifelse(as.yearmon(cohort) > as.yearmon(date), 
                              paste((round(as.yearmon(date) - as.yearmon(cohort))*12), sep=""), 
                              NA)))
   user  system elapsed 
  0.026   0.000   0.026 

Antonios suggested another approach which involves first setting every row to 'NA' and then selectively updating the appropriate rows. I ended up with the following code:


cohortAttendance$monthNumber = NA

cohortAttendance$monthNumber[as.yearmon(cohortAttendance$cohort) > as.yearmon(cohortAttendance$date)] = paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep="")

Let's measure that:


system.time(paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep=""))
   user  system elapsed 
  0.013   0.000   0.013 

Both approaches are much quicker than my original version although this one seems to be marginally quicker than the ifelse approach.

Note to future Mark: try to avoid grouping by row number - there's usually a better and faster solution!

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket