· r-2

R: ggplot - Show discrete scale even with no value

As I mentioned in a previous blog post, I’ve been scraping data for the Wimbledon tennis tournament, and having got the data for the last ten years I wrote a query using dplyr to find out how players did each year over that period.

I ended up with the following functions to filter my data frame of all the matches:

round_reached = function(player, main_matches) {
  furthest_match = main_matches %>%
    filter(winner == player | loser == player) %>%
    arrange(desc(round)) %>%
    head(1)

    return(ifelse(furthest_match$winner == player, "Winner", as.character(furthest_match$round)))
}

player_performance = function(name, matches) {
  player = data.frame()
  for(y in 2005:2014) {
    round = round_reached(name, filter(matches, year == y))
    if(length(round) == 1) {
      player = rbind(player, data.frame(year = y, round = round))
    } else {
      player = rbind(player, data.frame(year = y, round = "Did not enter"))
    }
  }
  return(player)
}

When we call that function we see the following output:

> player_performance("Andy Murray", main_matches)
   year          round
1  2005    Round of 32
2  2006    Round of 16
3  2007  Did not enter
4  2008 Quarter-Finals
5  2009    Semi-Finals
6  2010    Semi-Finals
7  2011    Semi-Finals
8  2012         Finals
9  2013         Winner
10 2014 Quarter-Finals

I wanted to create a chart showing Murray’s progress over the years with the round reached on the y axis and the year on the x axis. In order to do this I had to make sure the 'round' column was being treated as a factor variable:

df = player_performance("Andy Murray", main_matches)

rounds = c("Did not enter", "Round of 128", "Round of 64", "Round of 32", "Round of 16", "Quarter-Finals", "Semi-Finals", "Finals", "Winner")
df$round = factor(df$round, levels =  rounds)

> df$round
 [1] Round of 32    Round of 16    Did not enter  Quarter-Finals Semi-Finals    Semi-Finals    Semi-Finals
 [8] Finals         Winner         Quarter-Finals
Levels: Did not enter Round of 128 Round of 64 Round of 32 Round of 16 Quarter-Finals Semi-Finals Finals Winner

Now that we’ve got that we can plot his progress:

ggplot(aes(x = year, y = round, group=1), data = df) +
    geom_point() +
    geom_line() +
    scale_x_continuous(breaks=df$year) +
    scale_y_discrete(breaks = rounds)
2015 06 26 23 37 32

This is a good start but we’ve lost the rounds which don’t have a corresponding entry on the x axis. I’d like to keep them so it’s easier to compare the performance of different players.

It turns out that all we need to do is pass 'drop = FALSE' to scale_y_discrete and it will work exactly as we want:

ggplot(aes(x = year, y = round, group=1), data = df) +
    geom_point() +
    geom_line() +
    scale_x_continuous(breaks=df$year) +
    scale_y_discrete(breaks = rounds, drop = FALSE)
2015 06 26 23 41 01

Neat. Now let’s have a look at the performances of some of the other top players:

draw_chart = function(player, main_matches){
  df = player_performance(player, main_matches)
  df$round = factor(df$round, levels =  rounds)

  ggplot(aes(x = year, y = round, group=1), data = df) +
    geom_point() +
    geom_line() +
    scale_x_continuous(breaks=df$year) +
    scale_y_discrete(breaks = rounds, drop=FALSE) +
    ggtitle(player) +
    theme(axis.text.x=element_text(angle=90, hjust=1))
}

a = draw_chart("Andy Murray", main_matches)
b = draw_chart("Novak Djokovic", main_matches)
c = draw_chart("Rafael Nadal", main_matches)
d = draw_chart("Roger Federer", main_matches)

library(gridExtra)
grid.arrange(a,b,c,d, ncol=2)
2015 06 26 23 46 15

And that’s all for now!

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket