neo4j/cypher: Getting rid of an optional match
I was looking back over some of the queries I wrote for my football data set and I came across one I’d written to work out how many goals players scored in matches that were televised.
The data model looks like this:
My initial query to work out the top 10 scorers in televised games was as follows:
MATCH (player:Player)
WITH player
MATCH player-[:played|subbed_on]->stats-[:in]->game-[t?:on_tv]->channel
WITH COLLECT({goals: stats.goals, type: TYPE(t)}) AS games, player
RETURN player.name,
REDUCE(goals = 0, h IN [g IN games WHERE g.type IS NULL] | goals + h.goals) AS nonTvGoals,
REDUCE(goals = 0, h IN [g IN games WHERE g.type <> NULL] | goals + h.goals) AS tvGoals,
REDUCE(goals = 0, h in games | goals + h.goals) AS totalGoals
ORDER BY tvGoals DESC
LIMIT 10
If we evaluate the query we see that Robin Van Persie scored most goals in televised games:
==> +--------------------------------------------------------+
==> | player.name | nonTvGoals | tvGoals | totalGoals |
==> +--------------------------------------------------------+
==> | "Robin Van Persie" | 11 | 15 | 26 |
==> | "Gareth Bale" | 8 | 13 | 21 |
==> | "Luis Suárez" | 12 | 11 | 23 |
==> | "Theo Walcott" | 5 | 9 | 14 |
==> | "Demba Ba" | 7 | 8 | 15 |
==> | "Edin Dzeko" | 7 | 7 | 14 |
==> | "Santi Cazorla" | 5 | 7 | 12 |
==> | "Juan Mata" | 6 | 6 | 12 |
==> | "Steven Gerrard" | 3 | 6 | 9 |
==> | "Carlos Tevez" | 5 | 6 | 11 |
==> +--------------------------------------------------------+
==> 10 rows
==> 520 ms
The query takes 1/2 second to evaluate even after running it a few times and since using optional patterns often leads to slow queries I thought I should try and remove the optional on_tv relationship.
I introduced it so that I could determine which matches were on TV, which then allowed me to aggregate goals scored in other games as well.
Wes helped me do this although his first suggestion was to replace the REDUCE with a SUM to simplify the query. The CASE statement makes this easier:
MATCH (player:Player)
WITH player
MATCH player-[:played|subbed_on]->stats-[:in]->game-[t?:on_tv]->channel
WITH {goals: stats.goals, type: TYPE(t)} AS game, player
RETURN player.name,
SUM(CASE WHEN game.type IS NULL THEN game.goals ELSE 0 END) as nonTvGoals,
SUM(CASE WHEN game.type <> NULL THEN game.goals ELSE 0 END) as tvGoals,
SUM(game.goals) as allGoals
ORDER BY tvGoals DESC
LIMIT 10
Our next step is to split up the finding of stats nodes (which represent a player’s performance in a match) from the finding of the TV channel that the match may have been shown on.
MATCH (player:Player)-[:played|subbed_on]->stats
WITH player, stats
MATCH stats-[:in]->game-[t?:on_tv]->channel
WITH {goals: stats.goals, type: TYPE(t)} AS game, player
RETURN player.name,
SUM(CASE WHEN game.type IS NULL THEN game.goals ELSE 0 END) as nonTvGoals,
SUM(CASE WHEN game.type <> NULL THEN game.goals ELSE 0 END) as tvGoals,
SUM(game.goals) as allGoals
ORDER BY tvGoals DESC
LIMIT 10
Finally we can make the on_tv relationship explicit and change the way we sum the goals slightly:
MATCH (player:Player)-[:played|subbed_on]->stats
WITH stats.goals AS goals, player, stats-[:in]->()-[:on_tv]-() as onTv
RETURN player.name,
SUM(CASE WHEN onTv = FALSE THEN goals ELSE 0 END) as nonTvGoals,
SUM(CASE WHEN onTv = TRUE THEN goals ELSE 0 END) as tvGoals,
SUM(goals) as allGoals
ORDER BY tvGoals DESC
LIMIT 10
If we run that a few times the query execution time has now halved:
==> +------------------------------------------------------+
==> | player.name | nonTvGoals | tvGoals | allGoals |
==> +------------------------------------------------------+
==> | "Robin Van Persie" | 11 | 15 | 26 |
==> | "Gareth Bale" | 8 | 13 | 21 |
==> | "Luis Suárez" | 12 | 11 | 23 |
==> | "Theo Walcott" | 5 | 9 | 14 |
==> | "Demba Ba" | 7 | 8 | 15 |
==> | "Santi Cazorla" | 5 | 7 | 12 |
==> | "Edin Dzeko" | 7 | 7 | 14 |
==> | "Carlos Tevez" | 5 | 6 | 11 |
==> | "Juan Mata" | 6 | 6 | 12 |
==> | "Steven Gerrard" | 3 | 6 | 9 |
==> +------------------------------------------------------+
==> 10 rows
==> 219 ms
Thanks Wes for showing me how to do this!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.