R/dplyr: Extracting data frame column value for filtering with %in%
I’ve been playing around with dplyr over the weekend and wanted to extract the values from a data frame column to use in a later filtering step.
I had a data frame:
library(dplyr)
df = data.frame(userId = c(1,2,3,4,5), score = c(2,3,4,5,5))
And wanted to extract the userIds of those people who have a score greater than 3. I started with:
highScoringPeople = df %>% filter(score > 3) %>% select(userId)
> highScoringPeople
userId
1 3
2 4
3 5
And then filtered the data frame expecting to get back those 3 people:
> df %>% filter(userId %in% highScoringPeople)
[1] userId score
<0 rows> (or 0-length row.names)
No rows! I created vector with the numbers 3-5 to make sure that worked:
> df %>% filter(userId %in% c(3,4,5))
userId score
1 3 4
2 4 5
3 5 5
That works as expected so highScoringPeople obviously isn’t in the right format to facilitate an 'in lookup'. Let’s explore:
> str(c(3,4,5))
num [1:3] 3 4 5
> str(highScoringPeople)
'data.frame': 3 obs. of 1 variable:
$ userId: num 3 4 5
Now it’s even more obvious why it doesn’t work - highScoringPeople is still a data frame when we need it to be a vector/list.
One way to fix this is to extract the userIds using the $ syntax instead of the select function:
highScoringPeople = (df %>% filter(score > 3))$userId
> str(highScoringPeople)
num [1:3] 3 4 5
> df %>% filter(userId %in% highScoringPeople)
userId score
1 3 4
2 4 5
3 5 5
Or if we want to do the column selection using dplyr we can extract the values for the column like this:
highScoringPeople = (df %>% filter(score > 3) %>% select(userId))[[1]]
> str(highScoringPeople)
num [1:3] 3 4 5
Not so difficult after all.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.