Neo4j: BBC football live text fouls graph
I recently came across the Partially Derivative podcast and in episode 17 they describe how Kirk Goldsberry scraped a bunch of data about shots in basketball matches then ran some analysis on that data.
It got me thinking that we might be able to do something similar for football matches and although event based data for football matches only comes from Opta, the BBC does expose some of them in live text feeds.
We’ll start with the Champions League match between Barcelona and Bayern Munich from last Tuesday.
Our first task is to extract the events that happened in the match along with the players involved. After we’ve got that we’ll generate a Neo4j graph and see if we can find some interesting insights.
I find the feedback cycle with this type of work is dramatically improved if we have the source data available locally so the first step was to get the BBC web page downloaded:
$ wget http://www.bbc.co.uk/sport/0/football/32683310
Next we need to write a scraper which will extract all the events. We want to get an array containing one entry for each event, where the following is an example of an event:
HTML-wise it looks like this:
I do most of my scraping work in Python so I used the Beautiful Soup library with the soupselect wrapper to get the data into CSV format ready to import into Neo4j.
It was mostly a straight forward job of finding the appropriate CSS tag and pulling out the values although the way fouls are described in the page is a bit strange - sometimes the person fouled comes first row and the fouler comes on the next line and sometimes vice versa.
Luckily the two parts of the foul can be joined together by matching the time which made life easier.
The full code for the scrapper is on github if you want to play with it.
This is what the resulting CSV file looks like:
$ head -n 10 data/events.csv
matchId,foulId,freeKickId,time,foulLocation,fouledPlayer,fouledPlayerTeam,foulingPlayer,foulingPlayerTeam
32683310,3,2,90:00 +0:40,in the defensive half.,Xabi Alonso,FC Bayern München,Pedro,Barcelona
32683310,9,8,84:38,on the right wing.,Rafinha,FC Bayern München,Pedro,Barcelona
32683310,12,13,83:17,in the attacking half.,Lionel Messi,Barcelona,Sebastian Rode,FC Bayern München
32683310,15,14,82:43,in the defensive half.,Sebastian Rode,FC Bayern München,Neymar,Barcelona
32683310,17,18,80:41,in the attacking half.,Pedro,Barcelona,Xabi Alonso,FC Bayern München
32683310,22,23,76:31,in the defensive half.,Neymar,Barcelona,Rafinha,FC Bayern München
32683310,25,26,75:03,in the attacking half.,Lionel Messi,Barcelona,Xabi Alonso,FC Bayern München
32683310,31,30,69:37,in the attacking half.,Bastian Schweinsteiger,FC Bayern München,Dani Alves,Barcelona
32683310,36,35,63:27,in the attacking half.,Robert Lewandowski,FC Bayern München,Ivan Rakitic,Barcelona
Now it’s time to create a graph. We’ll aim to massage the data into this model:
Next we need to write some Cypher code to get the CSV data into the graph. The full script is here, a sample of which is below:
// match
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Match {id: row.matchId});
// teams
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Team {name: row.foulingPlayerTeam});
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (:Team {name: row.fouledPlayerTeam});
// players
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MERGE (player:Player {id: row.foulingPlayer + "_" + row.foulingPlayerTeam})
ON CREATE SET player.name = row.foulingPlayer;
// appearances
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MATCH (match:Match {id: row.matchId})
MATCH (player:Player {id: row.foulingPlayer + "_" + row.foulingPlayerTeam})
MATCH (team:Team {name: row.foulingPlayerTeam})
MERGE (appearance:Appearance {id: player.id + " in " + row.matchId})
MERGE (player)-[:MADE_APPEARANCE]->(appearance)
MERGE (appearance)-[:IN_MATCH]->(match)
MERGE (appearance)-[:FOR_TEAM]->(team);
// fouls
LOAD CSV WITH HEADERS FROM "file:///Users/markhneedham/projects/neo4j-bbc/data/events.csv" AS row
MATCH (foulingPlayer:Player {id:row.foulingPlayer + "_" + row.foulingPlayerTeam })
MATCH (fouledPlayer:Player {id:row.fouledPlayer + "_" + row.fouledPlayerTeam })
MATCH (match:Match {id: row.matchId})
MERGE (foul:Foul {eventId: row.foulId})
ON CREATE SET foul.time = row.time, foul.location = row.foulLocation
MERGE (foul)<-[:COMMITTED_FOUL]-(foulingPlayer)
MERGE (foul)-[:COMMITTED_AGAINST]->(fouledPlayer)
MERGE (foul)-[:COMMITTED_IN_MATCH]->(match);
We’ll use neo4j-shell to execute the script:
$ ./neo4j-community-2.2.1/bin/neo4j-shell --file import.cql
Now that we’ve got the data into Neo4j we need to come up with some questions to ask of it. I came up with the following but perhaps you can think of some others!
-
Where do the fouls happen on the pitch?
-
Who made the most fouls?
-
Who was fouled the most?
-
Who fouled who the most?
-
Which team fouled the most?
-
Who’s the worst fouler in each team?
-
Who’s the most fouled in each team?
Where do the fouls happen?
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)
RETURN foul.location AS location, COUNT(*) as fouls
ORDER BY fouls DESC;
+----------------------------------+
| location | fouls |
+----------------------------------+
| "in the defensive half." | 12 |
| "in the attacking half." | 12 |
| "on the right wing." | 3 |
| "on the left wing." | 3 |
+----------------------------------+
4 rows
Who fouls the most?
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler)
RETURN fouler.name AS fouler, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
+------------------------------+
| fouler | fouls |
+------------------------------+
| "Rafinha" | 4 |
| "Pedro" | 3 |
| "Medhi Benatia" | 3 |
| "Dani Alves" | 3 |
| "Xabi Alonso" | 3 |
| "Javier Mascherano" | 2 |
| "Thiago Alcántara" | 2 |
| "Robert Lewandowski" | 2 |
| "Sebastian Rode" | 1 |
| "Sergio Busquets" | 1 |
+------------------------------+
10 rows
Who was fouled the most?
// who was fouled the most
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]->(fouled)
RETURN fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
+----------------------------------+
| fouled | fouls |
+----------------------------------+
| "Robert Lewandowski" | 4 |
| "Lionel Messi" | 4 |
| "Neymar" | 3 |
| "Pedro" | 2 |
| "Xabi Alonso" | 2 |
| "Andrés Iniesta" | 2 |
| "Rafinha" | 2 |
| "Bastian Schweinsteiger" | 2 |
| "Sebastian Rode" | 1 |
| "Sergio Busquets" | 1 |
+----------------------------------+
10 rows
Who fouled who the most?
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]->(fouled),
(foul)<-[:COMMITTED_FOUL]-(fouler)
RETURN fouler.name AS fouler, fouled.name AS fouled, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
+--------------------------------------------------------+
| fouler | fouled | fouls |
+--------------------------------------------------------+
| "Javier Mascherano" | "Robert Lewandowski" | 2 |
| "Dani Alves" | "Bastian Schweinsteiger" | 2 |
| "Xabi Alonso" | "Lionel Messi" | 2 |
| "Rafinha" | "Neymar" | 2 |
| "Rafinha" | "Andrés Iniesta" | 2 |
| "Dani Alves" | "Xabi Alonso" | 1 |
| "Thiago Alcántara" | "Javier Mascherano" | 1 |
| "Pedro" | "Juan Bernat" | 1 |
| "Medhi Benatia" | "Pedro" | 1 |
| "Neymar" | "Sebastian Rode" | 1 |
+--------------------------------------------------------+
10 rows
Which team fouled the most?
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
(fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
(app)-[:FOR_TEAM]->(team)
RETURN team.name, COUNT(*) as fouls
ORDER BY fouls DESC
LIMIT 10;
+-----------------------------+
| team.name | fouls |
+-----------------------------+
| "FC Bayern München" | 18 |
| "Barcelona" | 12 |
+-----------------------------+
2 rows
Worst fouler for each team?
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)<-[:COMMITTED_FOUL]-(fouler),
(fouler)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
(app)-[:FOR_TEAM]->(team)
WITH team, fouler, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouler:fouler, fouls:fouls})[0] AS topFouler
RETURN team.name, topFouler.fouler.name, topFouler.fouls;
+---------------------------------------------------------------+
| team.name | topFouler.fouler.name | topFouler.fouls |
+---------------------------------------------------------------+
| "FC Bayern München" | "Rafinha" | 4 |
| "Barcelona" | "Pedro" | 3 |
+---------------------------------------------------------------+
2 rows
Most fouled against for each team
match (match:Match)<-[:COMMITTED_IN_MATCH]-(foul)-[:COMMITTED_AGAINST]-(fouled),
(fouled)-[:MADE_APPEARANCE]-(app)-[:IN_MATCH]-(match),
(app)-[:FOR_TEAM]->(team)
WITH team, fouled, COUNT(*) AS fouls
ORDER BY team.name, fouls DESC
WITH team, COLLECT({fouled:fouled, fouls:fouls})[0] AS topFouled
RETURN team.name, topFouled.fouled.name, topFouled.fouls;
+---------------------------------------------------------------+
| team.name | topFouled.fouled.name | topFouled.fouls |
+---------------------------------------------------------------+
| "FC Bayern München" | "Robert Lewandowski" | 4 |
| "Barcelona" | "Lionel Messi" | 4 |
+---------------------------------------------------------------+
2 rows
So Bayern fouled a bit more than Barca, the main forwards for each team (Messi/Lewandowski) were the most fouled players on the pitch and the fouling was mostly in the middle of the pitch.
I expect this graph will become much more interesting to query with more matches and with the other event types as well but I haven’t got those scraped yet. The code is on githubif you want to play around with it and perhaps get the other events into the graph.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.