R: Conditionally updating rows of a data frame
In a blog post I wrote a couple of days ago about cohort analysis I had to assign a monthNumber to each row in a data frame and started out with the following code:
library(zoo)
library(dplyr)
monthNumber = function(cohort, date) {
cohortAsDate = as.yearmon(cohort)
dateAsDate = as.yearmon(date)
if(cohortAsDate > dateAsDate) {
"NA"
} else {
paste(round((dateAsDate - cohortAsDate) * 12), sep="")
}
}
cohortAttendance %>%
group_by(row_number()) %>%
mutate(monthNumber = monthNumber(cohort, date)) %>%
filter(monthNumber != "NA") %>%
filter(monthNumber != "0") %>%
mutate(monthNumber = as.numeric(monthNumber)) %>%
arrange(monthNumber)
If we time this function using system.time we’ll see that it’s not very snappy:
system.time(cohortAttendance %>%
group_by(row_number()) %>%
mutate(monthNumber = monthNumber(cohort, date)) %>%
filter(monthNumber != "NA") %>%
filter(monthNumber != "0") %>%
mutate(monthNumber = as.numeric(monthNumber)) %>%
arrange(monthNumber))
user system elapsed
1.968 0.019 2.016
The reason for the poor performance is that we process each row of the data table individually due to the call to group_by on the second line. One way we can refactor the code is to use the ifelse which can process multiple rows at a time:
system.time(
cohortAttendance %>%
mutate(monthNumber = ifelse(as.yearmon(cohort) > as.yearmon(date),
paste((round(as.yearmon(date) - as.yearmon(cohort))*12), sep=""),
NA)))
user system elapsed
0.026 0.000 0.026
Antonios suggested another approach which involves first setting every row to 'NA' and then selectively updating the appropriate rows. I ended up with the following code:
cohortAttendance$monthNumber = NA
cohortAttendance$monthNumber[as.yearmon(cohortAttendance$cohort) > as.yearmon(cohortAttendance$date)] = paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep="")
Let’s measure that:
system.time(paste((round(as.yearmon(cohortAttendance$date) - as.yearmon(cohortAttendance$cohort))*12), sep=""))
user system elapsed
0.013 0.000 0.013
Both approaches are much quicker than my original version although this one seems to be marginally quicker than the ifelse approach.
Note to future Mark: try to avoid grouping by row number - there’s usually a better and faster solution!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.