R: dplyr - Update rows with earlier/previous rows values
Recently I had a data frame which contained a column which had mostly empty values:
> data.frame(col1 = c(1,2,3,4,5), col2 = c("a", NA, NA , "b", NA))
col1 col2
1 1 a
2 2 <NA>
3 3 <NA>
4 4 b
5 5 <NA>
I wanted to fill in the NA values with the last non NA value from that column. So I want the data frame to look like this:
1 1 a
2 2 a
3 3 a
4 4 b
5 5 b
I spent ages searching around before I came across the na.locf function in the zoo library which does the job:
library(zoo)
library(dplyr)
> data.frame(col1 = c(1,2,3,4,5), col2 = c("a", NA, NA , "b", NA)) %>%
do(na.locf(.))
col1 col2
1 1 a
2 2 a
3 3 a
4 4 b
5 5 b
This will fill in the missing values for every column, so if we had a third column with missing values it would populate those too:
> data.frame(col1 = c(1,2,3,4,5), col2 = c("a", NA, NA , "b", NA), col3 = c("A", NA, "B", NA, NA)) %>%
do(na.locf(.))
col1 col2 col3
1 1 a A
2 2 a A
3 3 a B
4 4 b B
5 5 b B
If we only want to populate 'col2' and leave 'col3' as it is we can apply the function specifically to that column:
> data.frame(col1 = c(1,2,3,4,5), col2 = c("a", NA, NA , "b", NA), col3 = c("A", NA, "B", NA, NA)) %>%
mutate(col2 = na.locf(col2))
col1 col2 col3
1 1 a A
2 2 a <NA>
3 3 a B
4 4 b <NA>
5 5 b <NA>
It’s quite a neat function and certainly comes in helpful when cleaning up data sets which don’t tend to be as uniform as you’d hope!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.