R: Calculating the difference between ordered factor variables
In my continued exploration of Wimbledon data I wanted to work out whether a player had done as well as their seeding suggested they should.
I therefore wanted to work out the difference between the round they reached and the round they were expected to reach. A 'round' in the dataset is an ordered factor variable.
These are all the possible values:
rounds = c("Did not enter", "Round of 128", "Round of 64", "Round of 32", "Round of 16", "Quarter-Finals", "Semi-Finals", "Finals", "Winner")
And if we want to factorise a couple of strings into this factor we would do it like this:
round = factor("Finals", levels = rounds, ordered = TRUE)
expected = factor("Winner", levels = rounds, ordered = TRUE)
> round
[1] Finals
9 Levels: Did not enter < Round of 128 < Round of 64 < Round of 32 < Round of 16 < Quarter-Finals < ... < Winner
> expected
[1] Winner
9 Levels: Did not enter < Round of 128 < Round of 64 < Round of 32 < Round of 16 < Quarter-Finals < ... < Winner
In this case the difference between the actual round and expected round should be -1 - the player was expected to win the tournament but lost in the final. We can calculate that differnce by calling the unclass function on each variable:
> unclass(round) - unclass(expected)
[1] -1
attr(,"levels")
[1] "Did not enter" "Round of 128" "Round of 64" "Round of 32" "Round of 16" "Quarter-Finals"
[7] "Semi-Finals" "Finals" "Winner"
That still seems to have some remnants of the factor variable so to get rid of that we can cast it to a numeric value:
> as.numeric(unclass(round) - unclass(expected))
[1] -1
And that’s it! We can now go and apply this calculation to all seeds to see how they got on.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.