25 Jun 2015

R: Scraping Wimbledon draw data

Given Wimbledon starts next week I wanted to find a data set to explore before it gets underway. Having searched around and failed to find one I had to resort to scraping the ATP World Tour’s event page which displays the matches in an easy to access format.

We’ll be using the Wimbledon 2013 draw since Andy Murray won that year! This is what the page looks like:

Each match is in its own row of a table and each column has a class attribute which makes it really easy to scrape. We’ll be using R’s rvest again. I wrote the following script which grabs the player names, seedings and score of the match and stores everything in a data frame:

library(rvest)
library(dplyr)
library(stringr)

s = html_session("http://www.atpworldtour.com/en/scores/archive/wimbledon/540/2013/results")
rows = s %>% html_nodes("div#scoresResultsContent tr")

matches = data.frame()
for(row in rows) {
  players = row %>% html_nodes("td.day-table-name a")
  seedings = row %>% html_nodes("td.day-table-seed")
  score = row %>% html_node("td.day-table-score a")

  if(!is.null(score)) {
    player1 = players[1] %>% html_text() %>% str_trim()
    seeding1 = ifelse(!is.na(seedings[1]), seedings[1] %>% html_node("span") %>% html_text() %>% str_trim(), NA)

    player2 = players[2] %>% html_text() %>% str_trim()
    seeding2 = ifelse(!is.na(seedings[2]), seedings[2] %>% html_node("span") %>% html_text() %>% str_trim(), NA)

    matches = rbind(data.frame(winner = player1,
                               winner_seeding = seeding1,
                               loser = player2,
                               loser_seeding = seeding2,
                               score = score %>% html_text() %>% str_trim(),
                               round = round), matches)

  } else {
    round = row %>% html_node("th") %>% html_text()
  }
}

This is what the data frame looks like:

> matches %>% sample_n(10)
               winner winner_seeding                       loser loser_seeding            score                round
61      Wayne Odesnik            (4)                Thiago Alves          <NA>            61 64 1st Round Qualifying
4     Danai Udomchoke           <NA>            Marton Fucsovics          <NA>       61 57 1210 1st Round Qualifying
233    Jerzy Janowicz           (24)                Lukasz Kubot          <NA>         75 64 64       Quarter-Finals
90       Malek Jaziri           <NA>             Illya Marchenko           (9)        674 75 64 2nd Round Qualifying
222      David Ferrer            (4)         Alexandr Dolgopolov          (26) 676 762 26 61 62          Round of 32
54  Michal Przysiezny           (11)                 Dusan Lojda          <NA>         26 63 62 1st Round Qualifying
52           Go Soeda           (13)               Nikola Mektic          <NA>            62 60 1st Round Qualifying
42    Ruben Bemelmans           (23) Jonathan Dasnieres de Veigy          <NA>            63 64 1st Round Qualifying
31        Mirza Basic           <NA>              Tsung-Hua Yang          <NA>     674 33 (RET) 1st Round Qualifying
179     Jurgen Melzer           <NA>              Julian Reister           (Q)    36 762 765 62          Round of 64

It also contains qualifying matches which I’m not so interested in. Let’s strip those out:

main_matches = matches %>% filter(!grepl("Qualifying", round)) %>% mutate(year = 2013)

We’ll also put a column in for 'year' so that we can handle the draws for multiple years later on.

Next I wanted to clean up the data a bit. I’d like to be able to do some queries based on the seedings of the players but at the moment that column contains numeric brackets in values as well as some other values which indicate whether a player is a qualifier, lucky loser or wildcard entry.

I started by adding a column to store this extra information:

main_matches$winner_type = NA
main_matches$winner_type[main_matches$winner_seeding == "(WC)"] = "wildcard"
main_matches$winner_type[main_matches$winner_seeding == "(Q)"] = "qualifier"
main_matches$winner_type[main_matches$winner_seeding == "(LL)"] = "lucky loser"

main_matches$loser_type = NA
main_matches$loser_type[main_matches$loser_seeding == "(WC)"] = "wildcard"
main_matches$loser_type[main_matches$loser_seeding == "(Q)"] = "qualifier"
main_matches$loser_type[main_matches$loser_seeding == "(LL)"] = "lucky loser"

And then I cleaned up the existing column:

tidy_seeding = function(seeding) {
  no_brackets = gsub("\\(|\\)", "", seeding)
  return(gsub("WC|Q|L", NA, no_brackets))
}

main_matches = main_matches %>%
  mutate(winner_seeding = as.numeric(tidy_seeding(winner_seeding)),
         loser_seeding = as.numeric(tidy_seeding(loser_seeding)))

Now we can write a query against the data frame to find out when the underdog won i.e. a player with no seeding beat a player with a seeding or a lower seeded player beat a higher seeded one:

> main_matches %>%  filter((winner_seeding > loser_seeding) | (is.na(winner_seeding) & !is.na(loser_seeding)))
                  winner winner_seeding                 loser loser_seeding                  score          round year
1          Jurgen Melzer             NA         Fabio Fognini            30           675 75 63 62   Round of 128 2013
2          Bernard Tomic             NA           Sam Querrey            21       766 763 36 26 63   Round of 128 2013
3        Feliciano Lopez             NA          Gilles Simon            19             62 64 7611   Round of 128 2013
4             Ivan Dodig             NA Philipp Kohlschreiber            16 46 676 763 63 21 (RET)   Round of 128 2013
5         Viktor Troicki             NA      Janko Tipsarevic            14              63 64 765   Round of 128 2013
6         Lleyton Hewitt             NA         Stan Wawrinka            11               64 75 63   Round of 128 2013
7           Steve Darcis             NA          Rafael Nadal             5             764 768 64   Round of 128 2013
8      Fernando Verdasco             NA      Julien Benneteau            31             761 764 64    Round of 64 2013
9           Grega Zemlja             NA       Grigor Dimitrov            29       36 764 36 64 119    Round of 64 2013
10      Adrian Mannarino             NA            John Isner            18               11 (RET)    Round of 64 2013
11         Igor Sijsling             NA          Milos Raonic            17              75 64 764    Round of 64 2013
12     Kenny De Schepper             NA           Marin Cilic            10                  (W/O)    Round of 64 2013
13        Ernests Gulbis             NA    Jo-Wilfried Tsonga             6         36 63 63 (RET)    Round of 64 2013
14     Sergiy Stakhovsky             NA         Roger Federer             3         675 765 75 765    Round of 64 2013
15          Lukasz Kubot             NA          Benoit Paire            25               61 63 64    Round of 32 2013
16     Kenny De Schepper             NA           Juan Monaco            22              64 768 64    Round of 32 2013
17        Jerzy Janowicz             24       Nicolas Almagro            15              766 63 64    Round of 32 2013
18         Andreas Seppi             23         Kei Nishikori            12        36 62 674 61 64    Round of 32 2013
19         Bernard Tomic             NA       Richard Gasquet             9          767 57 75 765    Round of 32 2013
20 Juan Martin Del Potro              8          David Ferrer             4              62 64 765 Quarter-Finals 2013
21           Andy Murray              2        Novak Djokovic             1               64 75 64         Finals 2013

There are actually very few times when a lower seeded player beat a higher seeded one but there are quite a few instances of non seeds beating seeds. We’ve got 21 occurrences of underdogs winning out of a total of 127 matches.

Let’s filter that set of rows and see which seeds lost in the first round:

> main_matches %>%  filter(round == "Round of 128" & !is.na(loser_seeding))
           winner winner_seeding                 loser loser_seeding                  score        round year
1   Jurgen Melzer             NA         Fabio Fognini            30           675 75 63 62 Round of 128 2013
2   Bernard Tomic             NA           Sam Querrey            21       766 763 36 26 63 Round of 128 2013
3 Feliciano Lopez             NA          Gilles Simon            19             62 64 7611 Round of 128 2013
4      Ivan Dodig             NA Philipp Kohlschreiber            16 46 676 763 63 21 (RET) Round of 128 2013
5  Viktor Troicki             NA      Janko Tipsarevic            14              63 64 765 Round of 128 2013
6  Lleyton Hewitt             NA         Stan Wawrinka            11               64 75 63 Round of 128 2013
7    Steve Darcis             NA          Rafael Nadal             5             764 768 64 Round of 128 2013

Rafael Nadal is the most prominent but Stan Wawrinka also lost in the first round that year which I’d forgotten about! Next let’s make the 'round' column an ordered factor one so that we can sort matches by round:

main_matches$round = factor(main_matches$round, levels =  c("Round of 128", "Round of 64", "Round of 32", "Round of 16", "Quarter-Finals", "Semi-Finals", "Finals"))

> main_matches$round
...
Levels: Round of 128 Round of 64 Round of 32 Round of 16 Quarter-Finals Semi-Finals Finals

We can now really easily work out which unseeded players went the furthest in the tournament:

> main_matches %>% filter(is.na(loser_seeding)) %>% arrange(desc(round)) %>% head(5)
             winner winner_seeding             loser loser_seeding           score          round year
1    Jerzy Janowicz             24      Lukasz Kubot            NA        75 64 64 Quarter-Finals 2013
2       Andy Murray              2 Fernando Verdasco            NA  46 36 61 64 75 Quarter-Finals 2013
3 Fernando Verdasco             NA Kenny De Schepper            NA        64 64 64    Round of 16 2013
4      Lukasz Kubot             NA  Adrian Mannarino            NA  46 63 36 63 64    Round of 16 2013
5    Jerzy Janowicz             24     Jurgen Melzer            NA 36 761 64 46 64    Round of 16 2013

Next up I thought it’d be cool to write a function which showed which round each player exited in:

round_reached = function(player, main_matches) {
  furthest_match = main_matches %>%
    filter(winner == player | loser == player) %>%
    arrange(desc(round)) %>%
    head(1)

    return(ifelse(furthest_match$winner == player, "Winner", as.character(furthest_match$round)))
}

Our function isn’t vectorisable - it only works if we pass in a single player at a time so we’ll have to group the data frame by player before calling it. Let’s check it works by seeing how far Andy Murray and Rafael Nadal got:

> round_reached("Rafael Nadal", main_matches)
[1] "Round of 128"
> round_reached("Andy Murray", main_matches)
[1] "Winner"

Great. What about if we try it against each of the top 8 seeds?

> rbind(main_matches %>% filter(winner_seeding %in% 1:8) %>% mutate(name = winner, seeding = winner_seeding),
        main_matches %>% filter(loser_seeding %in% 1:8) %>% mutate(name = loser, seeding = loser_seeding)) %>%
    select(name, seeding) %>%
    distinct() %>%
    arrange(seeding) %>%
    group_by(name) %>%
    mutate(round_reached = round_reached(name, main_matches))
Source: local data frame [8 x 3]
Groups: name

                   name seeding  round_reached
1        Novak Djokovic       1         Finals
2           Andy Murray       2         Winner
3         Roger Federer       3    Round of 64
4          David Ferrer       4 Quarter-Finals
5          Rafael Nadal       5   Round of 128
6    Jo-Wilfried Tsonga       6    Round of 64
7         Tomas Berdych       7 Quarter-Finals
8 Juan Martin Del Potro       8    Semi-Finals

Neat. Next up I want to do a comparison between the round they reached and the round you’d expect them to get to given their seeding but that’s for the weekend!

I’ve put a CSV file containing all the data in this gist in case you want to play with it. I’m planning to scrape a few more years worth of data before Monday and add in some extra fields as well but in case I don’t get around to it the full script in this blog post is included in the gist as well so feel free to tweak it if tennis is your thing.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.