R: dplyr - squashing multiple rows per group into one
I spent a bit of the day working on my Wimbledon data set and the next thing I explored is all the people that have beaten Andy Murray in the tournament.
The following dplyr query gives us the names of those people and the year the match took place:
library(dplyr)
> main_matches %>% filter(loser == "Andy Murray") %>% select(winner, year)
winner year
1 Grigor Dimitrov 2014
2 Roger Federer 2012
3 Rafael Nadal 2011
4 Rafael Nadal 2010
5 Andy Roddick 2009
6 Rafael Nadal 2008
7 Marcos Baghdatis 2006
8 David Nalbandian 2005
As you can see, Rafael Nadal shows up multiple times. I wanted to get one row per player and list all the years in a single column.
This was my initial attempt:
> main_matches %>% filter(loser == "Andy Murray") %>%
group_by(winner) %>% summarise(years = paste(year))
Source: local data frame [6 x 2]
winner years
1 Andy Roddick 2009
2 David Nalbandian 2005
3 Grigor Dimitrov 2014
4 Marcos Baghdatis 2006
5 Rafael Nadal 2011
6 Roger Federer 2012
Unfortunately it just gives you the last matching row per group which isn’t quite what we want.. I realised my mistake while trying to pass a vector into paste and noticing that a vector came back when I’d expected a string:
> paste(c(2008,2009,2010))
[1] "2008" "2009" "2010"
The missing argument was 'collapse' - something I’d come across when using plyr last year:
> paste(c(2008,2009,2010), collapse=", ")
[1] "2008, 2009, 2010"
Now, if we apply that to our original function:
> main_matches %>% filter(loser == "Andy Murray") %>%
group_by(winner) %>% summarise(years = paste(year, collapse=", "))
Source: local data frame [6 x 2]
winner years
1 Andy Roddick 2009
2 David Nalbandian 2005
3 Grigor Dimitrov 2014
4 Marcos Baghdatis 2006
5 Rafael Nadal 2011, 2010, 2008
6 Roger Federer 2012
That’s exactly what we want. Let’s tidy that up a bit:
> main_matches %>% filter(loser == "Andy Murray") %>%
group_by(winner) %>% arrange(year) %>%
summarise(years = paste(year, collapse =","), times = length(year)) %>%
arrange(desc(times), years)
Source: local data frame [6 x 3]
winner years times
1 Rafael Nadal 2008,2010,2011 3
2 David Nalbandian 2005 1
3 Marcos Baghdatis 2006 1
4 Andy Roddick 2009 1
5 Roger Federer 2012 1
6 Grigor Dimitrov 2014 1
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.