Neo4j 2.1: Passing around node ids vs UNWIND
When Neo4j 2.1 is released we’ll have the UNWIND clause which makes working with collections of things easier.
In my blog post about creating adjacency matrices we wanted to show how many people were members of the first 5 meetup groups ordered alphabetically and then check how many were members of each of the other groups.
Without the UNWIND clause we’d have to do this:
MATCH (g:Group)
WITH g
ORDER BY g.name
LIMIT 5
WITH COLLECT(id(g)) AS groups
MATCH (g1) WHERE id(g1) IN groups
MATCH (g2) WHERE id(g2) IN groups
OPTIONAL MATCH path = (g1)<-[:MEMBER_OF]-()-[:MEMBER_OF]->(g2)
RETURN g1.name, g2.name, CASE WHEN path is null THEN 0 ELSE COUNT(path) END AS overlap
Here we get the first 5 groups, put their IDs into a collection and then create a cartesian product of groups by doing back to back MATCH’s with a node id lookup.
If instead of passing around node ids in 'groups' we pass around nodes and then used those in the MATCH step we’d end up doing a full node scan which becomes very slow as the store grows.
e.g. this version would be very slow
MATCH (g:Group)
WITH g
ORDER BY g.name
LIMIT 5
WITH COLLECT(g) AS groups
MATCH (g1) WHERE g1 IN groups
MATCH (g2) WHERE g2 IN groups
OPTIONAL MATCH path = (g1)<-[:MEMBER_OF]-()-[:MEMBER_OF]->(g2)
RETURN g1.name, g2.name, CASE WHEN path is null THEN 0 ELSE COUNT(path) END AS overlap
This is the output from the original query:
+-------------------------------------------------------------------------------------------------------------+
| g1.name | g2.name | overlap |
+-------------------------------------------------------------------------------------------------------------+
| "Big Data Developers in London" | "Big Data / Data Science / Data Analytics Jobs" | 17 |
| "Big Data Jobs in London" | "Big Data London" | 190 |
| "Big Data London" | "Big Data Developers in London" | 244 |
| "Cassandra London" | "Big Data / Data Science / Data Analytics Jobs" | 16 |
| "Big Data Jobs in London" | "Big Data Developers in London" | 52 |
| "Cassandra London" | "Cassandra London" | 0 |
| "Big Data London" | "Big Data / Data Science / Data Analytics Jobs" | 36 |
| "Big Data London" | "Cassandra London" | 422 |
| "Big Data Jobs in London" | "Big Data Jobs in London" | 0 |
| "Big Data / Data Science / Data Analytics Jobs" | "Big Data / Data Science / Data Analytics Jobs" | 0 |
| "Big Data Jobs in London" | "Cassandra London" | 74 |
| "Big Data Developers in London" | "Big Data London" | 244 |
| "Cassandra London" | "Big Data Jobs in London" | 74 |
| "Cassandra London" | "Big Data London" | 422 |
| "Big Data / Data Science / Data Analytics Jobs" | "Big Data London" | 36 |
| "Big Data Jobs in London" | "Big Data / Data Science / Data Analytics Jobs" | 20 |
| "Big Data Developers in London" | "Big Data Jobs in London" | 52 |
| "Cassandra London" | "Big Data Developers in London" | 69 |
| "Big Data / Data Science / Data Analytics Jobs" | "Big Data Jobs in London" | 20 |
| "Big Data Developers in London" | "Big Data Developers in London" | 0 |
| "Big Data Developers in London" | "Cassandra London" | 69 |
| "Big Data / Data Science / Data Analytics Jobs" | "Big Data Developers in London" | 17 |
| "Big Data London" | "Big Data Jobs in London" | 190 |
| "Big Data / Data Science / Data Analytics Jobs" | "Cassandra London" | 16 |
| "Big Data London" | "Big Data London" | 0 |
+-------------------------------------------------------------------------------------------------------------+
25 rows
If we use UNWIND we don’t need to pass around node ids anymore, instead we can collect up the nodes into a collection and then explode them out into a cartesian product:
MATCH (g:Group)
WITH g
ORDER BY g.name
LIMIT 5
WITH COLLECT(g) AS groups
UNWIND groups AS g1
UNWIND groups AS g2
OPTIONAL MATCH path = (g1)<-[:MEMBER_OF]-()-[:MEMBER_OF]->(g2)
RETURN g1.name, g2.name, CASE WHEN path is null THEN 0 ELSE COUNT(path) END AS overlap
There’s not significantly less code but I think the intent of the query is a bit clearer using UNWIND.
I’m looking forward to seeing the innovative uses of UNWIND people come up with once 2.1 is GA.
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.