R: dplyr - Group by field dynamically ('regroup' is deprecated / no applicable method for 'as.lazy' applied to an object of class "list" )
A few months ago I wrote a blog explaining how to dynamically/programatically group a data frame by a field using dplyr but that approach has been deprecated in the latest version.
To recap, the original function looked like this:
library(dplyr)
groupBy = function(df, field) {
df %.% regroup(list(field)) %.% summarise(n = n())
}
And if we execute that with a sample data frame we’ll see the following:
> data = data.frame(
letter = sample(LETTERS, 50000, replace = TRUE),
number = sample (1:10, 50000, replace = TRUE)
)
> groupBy(data, 'letter') %>% head(5)
Source: local data frame [5 x 2]
letter n
1 A 1951
2 B 1903
3 C 1954
4 D 1923
5 E 1886
Warning messages:
1: %.% is deprecated. Please use %>%
2: %.% is deprecated. Please use %>%
3: 'regroup' is deprecated.
Use 'group_by_' instead.
See help("Deprecated")
I replaced each of the deprecated operators and ended up with this function:
groupBy = function(df, field) {
df %>% group_by_(list(field)) %>% summarise(n = n())
}
Now if we run that:
> groupBy(data, 'letter') %>% head(5)
Error in UseMethod("as.lazy") :
no applicable method for 'as.lazy' applied to an object of class "list"
It turns out the 'group_by_' function doesn’t want to receive a list of fields so let’s remove the call to list:
groupBy = function(df, field) {
df %>% group_by_(field) %>% summarise(n = n())
}
And now if we run that:
> groupBy(data, 'letter') %>% head(5)
Source: local data frame [5 x 2]
letter n
1 A 1951
2 B 1903
3 C 1954
4 D 1923
5 E 1886
Good times! We get the correct result and no deprecation messages.
If we want to group by multiple fields we can just pass in the field names like so:
groupBy = function(df, field1, field2) {
df %>% group_by_(field1, field2) %>% summarise(n = n())
}
> groupBy(data, 'letter', 'number') %>% head(5)
Source: local data frame [5 x 3]
Groups: letter
letter number n
1 A 1 200
2 A 2 218
3 A 3 205
4 A 4 176
5 A 5 203
Or with this simpler version:
groupBy = function(df, ...) {
df %>% group_by_(...) %>% summarise(n = n())
}
> groupBy(data, 'letter', 'number') %>% head(5)
Source: local data frame [5 x 3]
Groups: letter
letter number n
1 A 1 200
2 A 2 218
3 A 3 205
4 A 4 176
5 A 5 203
I realised that we can actually just use the group_by itself and pass in the field names without quotes, something I couldn’t get to work in earlier versions:
groupBy = function(df, ...) {
df %>% group_by(...) %>% summarise(n = n())
}
> groupBy(data, letter, number) %>% head(5)
Source: local data frame [5 x 3]
Groups: letter
letter number n
1 A 1 200
2 A 2 218
3 A 3 205
4 A 4 176
5 A 5 203
We could even get a bit of pipelining going on if we fancied it:
> data %>% groupBy(letter, number) %>% head(5)
Source: local data frame [5 x 3]
Groups: letter
letter number n
1 A 1 200
2 A 2 218
3 A 3 205
4 A 4 176
5 A 5 203
And as of dplyr 0.3 we can simplify our groupBy function to make use of the new count function which combines group_by and summarise:
groupBy = function(df, ...) {
df %>% count(...)
}
> data %>% groupBy(letter, number) %>% head(5)
Source: local data frame [5 x 3]
Groups: letter
letter number n
1 A 1 200
2 A 2 218
3 A 3 205
4 A 4 176
5 A 5 203
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.