· r-2

R: Filter a data frame based on values in two columns

In the most recent assignment of the Computing for Data Analysis course we had to filter a data frame which contained N/A values in two columns to only return rows which had no N/A's.

I started with a data frame that looked like this:


> data <- read.csv("specdata/002.csv") 
> # we'll just use a few rows to make it easier to see what's going on
> data[2494:2500,]
           Date sulfate nitrate ID
2494 2007-10-30    3.25   0.902  2
2495 2007-10-31      NA      NA  2
2496 2007-11-01      NA      NA  2
2497 2007-11-02    6.56   1.270  2
2498 2007-11-03      NA      NA  2
2499 2007-11-04      NA      NA  2
2500 2007-11-05    6.10   0.772  2

We want to only return the rows which have a value in both the 'sulfate' and the 'nitrate' columns.

I initially tried to use the Filter function but wasn't very successful:


> smallData <- data[2494:2500,]
> Filter(function(x) !is.na(x$sulfate), smallData)
Error in x$sulfate : $ operator is invalid for atomic vectors

I'm not sure that Filter is designed to filter data frames - it seems more appropriate for lists or vectors - so I ended up filtering the data frame using what I think is called an extract operation:


> smallData[!is.na(smallData$sulfate) & !is.na(smallData$nitrate),]
           Date sulfate nitrate ID
2494 2007-10-30    3.25   0.902  2
2497 2007-11-02    6.56   1.270  2
2500 2007-11-05    6.10   0.772  2

The code inside the square brackets returns a collection indicating whether or not we should return each row:


> !is.na(smallData$sulfate) & !is.na(smallData$nitrate)
[1]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE

which is equivalent to doing this:


> smallData[c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE),]
           Date sulfate nitrate ID
2494 2007-10-30    3.25   0.902  2
2497 2007-11-02    6.56   1.270  2
2500 2007-11-05    6.10   0.772  2

We put a comma after the list of true/false values to indicate that we want to return all the columns otherwise we'd get this error:


> smallData[c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE)]
Error in `[.data.frame`(smallData, c(TRUE, FALSE, FALSE, TRUE, FALSE,  : 
  undefined columns selected

We could filter the columns as well if we wanted to:


> smallData[c(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE), c(1,2)]
           Date sulfate
2494 2007-10-30    3.25
2497 2007-11-02    6.56
2500 2007-11-05    6.10

As is no doubt obvious, I don't know much R so if there's a better way to do anything please let me know.

The full code is on github if you're interested in seeing it in context.

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket