R: dplyr - Error in (list: invalid subscript type 'double'
In my continued playing around with R I wanted to find the minimum value for a specified percentile given a data frame representing a cumulative distribution function (CDF).
e.g. imagine we have the following CDF represented in a data frame:
library(dplyr)
df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5))
and we want to find the minimum value for the 0.05 percentile. We can use the filter function to do so:
> (df %>% filter(percentile > 0.05) %>% slice(1))$score
[1] 7
Things become more tricky if we want to return multiple percentiles in one go.
My first thought was to create a data frame with one row for each target percentile and then pull in the appropriate row from our original data frame:
targetPercentiles = c(0.05, 0.2)
percentilesDf = data.frame(targetPercentile = targetPercentiles)
> percentilesDf %>%
group_by(targetPercentile) %>%
mutate(x = (df %>% filter(percentile > targetPercentile) %>% slice(1))$score)
Error in (list(score = c(5, 7, 8, 10, 12, 20), percentile = c(0.05, 0.1, :
invalid subscript type 'double'
Unfortunately this didn’t quite work as I expected - Antonios pointed out that this is probably because we’re mixing up two pipelines and dplyr can’t figure out what we want to do.
Instead he suggested the following variant which uses the https://github.com/hadley/dplyr#do function
df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5))
targetPercentiles = c(0.05, 0.2)
> data.frame(targetPercentile = targetPercentiles) %>%
group_by(targetPercentile) %>%
do(df) %>%
filter(percentile > targetPercentile) %>%
slice(1) %>%
select(targetPercentile, score)
Source: local data frame [2 x 2]
Groups: targetPercentile
targetPercentile score
1 0.05 7
2 0.20 12
We can then wrap this up in a function:
percentiles = function(df, targetPercentiles) {
# make sure the percentiles are in order
df = df %>% arrange(percentile)
data.frame(targetPercentile = targetPercentiles) %>%
group_by(targetPercentile) %>%
do(df) %>%
filter(percentile > targetPercentile) %>%
slice(1) %>%
select(targetPercentile, score)
}
which we call like this:
df = data.frame(score = c(5,7,8,10,12,20), percentile = c(0.05,0.1,0.15,0.20,0.25,0.5))
> percentiles(df, c(0.08, 0.10, 0.50, 0.80))
Source: local data frame [2 x 2]
Groups: targetPercentile
targetPercentile score
1 0.08 7
2 0.10 8
Note that we don’t actually get any rows back for 0.50 or 0.80 since we don’t have any entries greater than those percentiles. With a proper CDF we would though so the function does its job.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.