R: Filter a data frame based on values in two columns
In the most recent assignment of the Computing for Data Analysis course we had to filter a data frame which contained N/A values in two columns to only return rows which had no N/A's.
I started with a data frame that looked like this:
> data <- read.csv("specdata/002.csv")
> # we'll just use a few rows to make it easier to see what's going on
> data[2494:2500,]
Date sulfate nitrate ID
2494 2007-10-30 3.25 0.902 2
2495 2007-10-31 NA NA 2
2496 2007-11-01 NA NA 2
2497 2007-11-02 6.56 1.270 2
2498 2007-11-03 NA NA 2
2499 2007-11-04 NA NA 2
2500 2007-11-05 6.10 0.772 2
We want to only return the rows which have a value in both the 'sulfate' and the 'nitrate' columns.
I initially tried to use the Filter function but wasn't very successful:
> smallData <- data[2494:2500,]
> Filter(function(x) !is.na(x$sulfate), smallData)
Error in x$sulfate : $ operator is invalid for atomic vectors
I'm not sure that Filter is designed to filter data frames - it seems more appropriate for lists or vectors - so I ended up filtering the data frame using what I think is called an extract operation:
> smallData[!is.na(smallData$sulfate) & !is.na(smallData$nitrate),]
Date sulfate nitrate ID
2494 2007-10-30 3.25 0.902 2
2497 2007-11-02 6.56 1.270 2
2500 2007-11-05 6.10 0.772 2
The code inside the square brackets returns a collection indicating whether or not we should return each row:
> !is.na(smallData$sulfate) & !is.na(smallData$nitrate)
[1] TRUE FALSE FALSE TRUE FALSE FALSE TRUE
which is equivalent to doing this:
> smallData[c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE),]
Date sulfate nitrate ID
2494 2007-10-30 3.25 0.902 2
2497 2007-11-02 6.56 1.270 2
2500 2007-11-05 6.10 0.772 2
We put a comma after the list of true/false values to indicate that we want to return all the columns otherwise we'd get this error:
> smallData[c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE)]
Error in `[.data.frame`(smallData, c(TRUE, FALSE, FALSE, TRUE, FALSE, :
undefined columns selected
We could filter the columns as well if we wanted to:
> smallData[c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE), c(1,2)]
Date sulfate
2494 2007-10-30 3.25
2497 2007-11-02 6.56
2500 2007-11-05 6.10
As is no doubt obvious, I don't know much R so if there's a better way to do anything please let me know.
The full code is on github if you're interested in seeing it in context.
About the author
Mark Needham is a Developer Relations Engineer for Neo4j, the world's leading graph database.