R: Wimbledon - How do the seeds get on?
Continuing on with the Wimbledon data set I’ve been playing with I wanted to do some exploration on how the seeded players have fared over the years.
Taking the last 10 years worth of data there have always had 32 seeds and with the following function we can feed in a seeding and get back the round they would be expected to reach:
expected_round = function(seeding) {
if(seeding == 1) {
return("Winner")
} else if(seeding == 2) {
return("Finals")
} else if(seeding <= 4) {
return("Semi-Finals")
} else if(seeding <= 8) {
return("Quarter-Finals")
} else if(seeding <= 16) {
return("Round of 16")
} else {
return("Round of 32")
}
}
> expected_round(1)
[1] "Winner"
> expected_round(4)
[1] "Semi-Finals"
We can then have a look at each of the Wimbledon tournaments and work out how far they actually got.
round_reached = function(player, main_matches) {
furthest_match = main_matches %>%
filter(winner == player | loser == player) %>%
arrange(desc(round)) %>%
head(1)
return(ifelse(furthest_match$winner == player, "Winner", as.character(furthest_match$round)))
}
seeds = function(matches_to_consider) {
winners = matches_to_consider %>% filter(!is.na(winner_seeding)) %>%
select(name = winner, seeding = winner_seeding) %>% distinct()
losers = matches_to_consider %>% filter( !is.na(loser_seeding)) %>%
select(name = loser, seeding = loser_seeding) %>% distinct()
return(rbind(winners, losers) %>% distinct() %>% mutate(name = as.character(name)))
}
Let’s have a look how the seeds got on last year:
matches_to_consider = main_matches %>% filter(year == 2014)
result = seeds(matches_to_consider) %>% group_by(name) %>%
mutate(expected = expected_round(seeding), round = round_reached(name, matches_to_consider)) %>%
ungroup() %>% arrange(seeding)
rounds = c("Did not enter", "Round of 128", "Round of 64", "Round of 32", "Round of 16", "Quarter-Finals", "Semi-Finals", "Finals", "Winner")
result$round = factor(result$round, levels = rounds, ordered = TRUE)
result$expected = factor(result$expected, levels = rounds, ordered = TRUE)
> result %>% head(10)
Source: local data frame [10 x 4]
name seeding expected round
1 Novak Djokovic 1 Winner Winner
2 Rafael Nadal 2 Finals Round of 16
3 Andy Murray 3 Semi-Finals Quarter-Finals
4 Roger Federer 4 Semi-Finals Finals
5 Stan Wawrinka 5 Quarter-Finals Quarter-Finals
6 Tomas Berdych 6 Quarter-Finals Round of 32
7 David Ferrer 7 Quarter-Finals Round of 64
8 Milos Raonic 8 Quarter-Finals Semi-Finals
9 John Isner 9 Round of 16 Round of 32
10 Kei Nishikori 10 Round of 16 Round of 16
We’ll wrap all of that code into the following function:
expectations = function(y, matches) {
matches_to_consider = matches %>% filter(year == y)
result = seeds(matches_to_consider) %>% group_by(name) %>%
mutate(expected = expected_round(seeding), round = round_reached(name, matches_to_consider)) %>%
ungroup() %>% arrange(seeding)
result$round = factor(result$round, levels = rounds, ordered = TRUE)
result$expected = factor(result$expected, levels = rounds, ordered = TRUE)
return(result)
}
Next, instead of showing the round names it’d be cool to come up with numerical value indicating how well the player did:
-
-1 would mean they lost in the round before their seeding suggested e.g. seed 2 loses in Semi Final
-
2 would mean they got 2 rounds further than they should have e.g. Seed 7 reaches the Final
The unclass function comes to our rescue here:
# expectations plot
years = 2005:2014
exp = data.frame()
for(y in years) {
differences = (expectations(y, main_matches) %>%
mutate(expected_n = unclass(expected),
round_n = unclass(round),
difference = round_n - expected_n))$difference %>% as.numeric()
exp = rbind(exp, data.frame(year = rep(y, length(differences)), difference = differences))
}
> exp %>% sample_n(10)
Source: local data frame [10 x 6]
name seeding expected_n round_n difference year
1 Tomas Berdych 6 6 5 -1 2011
2 Tomas Berdych 7 6 6 0 2013
3 Rafael Nadal 2 8 5 -3 2014
4 Fabio Fognini 16 5 4 -1 2014
5 Robin Soderling 13 5 5 0 2009
6 Jurgen Melzer 16 5 5 0 2010
7 Nicolas Almagro 19 4 2 -2 2010
8 Stan Wawrinka 14 5 3 -2 2011
9 David Ferrer 7 6 5 -1 2011
10 Mikhail Youzhny 14 5 5 0 2007
We can then group by the 'difference' column to see how seeds are getting on as a whole:
> exp %>% count(difference)
Source: local data frame [9 x 2]
difference n
1 -5 2
2 -4 7
3 -3 24
4 -2 70
5 -1 66
6 0 85
7 1 43
8 2 17
9 3 4
library(ggplot2)
ggplot(aes(x = difference, y = n), data = exp %>% count(difference)) +
geom_bar(stat = "identity") +
scale_x_continuous(limits=c(min(potential), max(potential) + 1))
So from this visualisation we can see that the most common outcome for a seed is that they reach the round they were expected to reach. There are still a decent number of seeds who do 1 or 2 rounds worse than expected as well though.
Antonios suggested doing some analysis of how the seeds fared on a year by year basis - we’ll start by looking at what % of them exactly achieved their seeding:
exp$correct_pred = 0
exp$correct_pred[dt$difference==0] = 1
exp %>% group_by(year) %>%
summarise(MeanDiff = mean(difference),
PrcCorrect = mean(correct_pred),
N=n())
Source: local data frame [10 x 4]
year MeanDiff PrcCorrect N
1 2005 -0.6562500 0.2187500 32
2 2006 -0.8125000 0.2812500 32
3 2007 -0.4838710 0.4193548 31
4 2008 -0.9677419 0.2580645 31
5 2009 -0.3750000 0.2500000 32
6 2010 -0.7187500 0.4375000 32
7 2011 -0.7187500 0.0937500 32
8 2012 -0.7500000 0.2812500 32
9 2013 -0.9375000 0.2500000 32
10 2014 -0.7187500 0.1875000 32
Some years are better than others - we can use a chisq test to see whether there are any significant differences between the years:
tbl = table(exp$year, exp$correct_pred)
tbl
> chisq.test(tbl)
Pearson's Chi-squared test
data: tbl
X-squared = 14.9146, df = 9, p-value = 0.09331
This looks for at least one statistically significant different between the years, although it doesn’t look like there are any. We can also try doing a comparison of each year against all the others:
> pairwise.prop.test(tbl)
Pairwise comparisons using Pairwise comparison of proportions
data: tbl
2005 2006 2007 2008 2009 2010 2011 2012 2013
2006 1.00 - - - - - - - -
2007 1.00 1.00 - - - - - - -
2008 1.00 1.00 1.00 - - - - - -
2009 1.00 1.00 1.00 1.00 - - - - -
2010 1.00 1.00 1.00 1.00 1.00 - - - -
2011 1.00 1.00 0.33 1.00 1.00 0.21 - - -
2012 1.00 1.00 1.00 1.00 1.00 1.00 1.00 - -
2013 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 -
2014 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
P value adjustment method: holm
2007/2011 and 2010/2011 show the biggest differences but they’re still not significant. Since we have so few data items in each bucket there has to be a really massive difference for it to be significant.
The data I used in this post is available on this gist if you want to look into it and come up with your own analysis.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.