9 Nov 2014

R: dplyr - Ordering by count after multiple column group_by

I was recently trying to group a data frame by two columns and then sort by the count using dplyr but it wasn’t sorting in the way I expecting which was initially very confusing.

I started with this data frame:

R library(dplyr)

data = data.frame(
  letter = sample(LETTERS, 50000, replace = TRUE),
  number = sample (1:10, 50000, replace = TRUE)
  )

And I wanted to find out how many occurrences of each (letter, number) pair exist in the data set. I started with the following code:

R > data %>% count(letter, number, sort = TRUE)
Source: local data frame [260 x 3]
Groups: letter

   letter number   n
1       A      4 205
2       A      9 201
3       A      3 197
4       A      1 195
5       A     10 191
6       A      2 189
7       A      8 184
8       A      7 183
9       A      5 181
10      A      6 173
..    ...    ... ...

As you can see it’s only showing A’s which is interesting as I wouldn’t expect there to be a bias towards that letter. Let’s filter out the A’s:

R > data %>% filter(letter != "A") %>% count(letter, number, sort = TRUE)
Source: local data frame [250 x 3]
Groups: letter

   letter number   n
1       B      8 222
2       B      9 212
3       B      5 207
4       B      6 201
5       B     10 200
6       B      7 192
7       B      2 189
8       B      3 189
9       B      1 187
10      B      4 181
..    ...    ... ...

Now all we see are B’s and we can see that both (B,8) and (B,9) have a higher 'n' value than any of the A’s.

I put the code back into the more verbose form to see if it was the count function that behaved unexpectedly:

R > data %>% group_by(letter, number) %>% summarise(n = n()) %>% arrange(desc(n))
Source: local data frame [260 x 3]
Groups: letter

   letter number   n
1       A      4 205
2       A      9 201
3       A      3 197
4       A      1 195
5       A     10 191
6       A      2 189
7       A      8 184
8       A      7 183
9       A      5 181
10      A      6 173
..    ...    ... ...

Nope, still the same behaviour.

At this point I vaguely remembered there being a function called ungroup which I hadn’t used and wondered if now was the time.

R > data %>% group_by(letter, number) %>% summarise(n = n()) %>% ungroup() %>% arrange(desc(n))
Source: local data frame [260 x 3]

   letter number   n
1       L      2 236
2       V      1 231
3       Y      8 226
4       J      4 225
5       J     10 223
6       Q      7 223
7       B      8 222
8       O      9 222
9       Q     10 221
10      Z      9 221
..    ...    ... ...

Indeed it was and now we can go back to our original version of the code using count and handle the sorting afterwards:

R > data %>% count(letter, number) %>% ungroup() %>% arrange(desc(n))
Source: local data frame [260 x 3]

   letter number   n
1       L      2 236
2       V      1 231
3       Y      8 226
4       J      4 225
5       J     10 223
6       Q      7 223
7       B      8 222
8       O      9 222
9       Q     10 221
10      Z      9 221
..    ...    ... ...

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.