Neo4j: Cross database querying with Neo4j Fabric
A couple of weeks ago I wrote a QuickGraph blog post about the Australian Open, in which I showed how to use Neo4j 4.0’s multi database feature.
In that post we focused on queries that could be run on one database, but the 4.0 release also contains another feature for doing cross database querying - Neo4j Fabric - and we’re going to learn how to use that in this post.
What is Fabric?
To quote from the source:
Fabric, introduced in Neo4j 4.0, is a way to store and retrieve data in multiple databases, whether they are on the same Neo4j DBMS or in multiple DBMSs, using a single Cypher query.
https://neo4j.com/docs/operations-manual/current/fabric/introduction/
That sounds like what we have. It goes on to say:
In practical terms, Fabric provides the infrastructure and tooling for:
Data Federation: the ability to access data available in distributed sources in the form of disjointed graphs.
Data Sharding: the ability to access data available in distributed sources in the form of a common graph partitioned on multiple databases.
Data Federation sounds closer to what we have in our Australian Open Graph. The nodes representing the tournaments are the same, but the player, matches, and all other relationships are completely different.
A fabric database can be setup on a Neo4j instance of its own, or alongside other databases.
The only requirement is that it must be on a Neo4j instance where dbms.mode=SINGLE
.
Setting up Neo4j
We’re going to use the following Docker Compose configuration in this blog post:
version: '3.7'
services:
neo4j:
image: neo4j:4.0.0-enterprise
container_name: "quickgraph-aus-open"
volumes:
- ./plugins:/plugins
- ./data:/data
- ./import:/var/lib/neo4j/import
ports:
- "7474:7474"
- "7687:7687"
environment:
- "NEO4J_ACCEPT_LICENSE_AGREEMENT=yes"
- "NEO4J_AUTH=neo4j/neo"
- NEO4J_fabric_database_name=fabric
- NEO4J_fabric_graph_0_name=mens
- NEO4J_fabric_graph_0_uri=neo4j://localhost:7687
- NEO4J_fabric_graph_0_database=mens
- NEO4J_fabric_graph_1_name=womens
- NEO4J_fabric_graph_1_uri=neo4j://localhost:7687
- NEO4J_fabric_graph_1_database=womens
The Fabric specific config begins with the line NEO4J_fabric_database_name
, which gives a name for our fabric database.
We’ll use that later on.
We also provide details of the graphs that we’d like to query via Fabric:
-
NEO4J_fabric_graph_<n>_name
specifies the name that we’ll use when writing our Fabric queries. -
NEO4J_fabric_graph_<n>_database
specifies the name of the Neo4j database against which we’d like to execute queries.
In our case, both these values are the same.
Querying across multiple databases
Let’s write some queries!
Who didn’t drop a set in the whole tournament?
If we wanted to find out which players didn’t drop a set in the whole tournament, we could write the following query:
:use mens;
MATCH (winner:Player)-[:WINNER]->(:Match {round: "F"})-[:IN_TOURNAMENT]->(t)
MATCH (winner)-[:WINNER]->(match)-[:IN_TOURNAMENT]->(t)
WITH winner, match, t
ORDER BY t.year
WITH winner, t,
collect([(match)<-[:IN_MATCH]-(set:Set)
WHERE (winner)-[:LOSER]->(set) | set
][0]) AS setDropped
where size(setDropped) = 0
RETURN winner, t;
winner | t |
---|---|
(:Player {name: "Roger Federer", id: "103819"}) |
(:Tournament {name: "Australian Open", year: 2007}) |
And we could do the same for the women’s tournament as well by switching database:
:use womens;
MATCH (winner:Player)-[:WINNER]->(:Match {round: "F"})-[:IN_TOURNAMENT]->(t)
MATCH (winner)-[:WINNER]->(match)-[:IN_TOURNAMENT]->(t)
WITH winner, match, t
ORDER BY t.year
WITH winner, t,
collect([(match)<-[:IN_MATCH]-(set:Set)
WHERE (winner)-[:LOSER]->(set) | set
][0]) AS setDropped
where size(setDropped) = 0
RETURN winner, t;
winner | t |
---|---|
(:Player {name: "Lindsay Davenport", id: "200128"}) |
(:Tournament {name: "Australian Open", year: 2000}) |
(:Player {name: "Maria Sharapova", id: "201345"}) |
(:Tournament {name: "Australian Open", year: 2008}) |
(:Player {name: "Serena Williams", id: "200033"}) |
(:Tournament {name: "Australian Open", year: 2017}) |
Fabric allows us to combine the results of these queries. Let’s switch to the Fabric database:
:use fabric;
Once we’ve done this we could write the following query to return all winners of the Australian Open who didn’t drop a set:
USE fabric.mens
MATCH (winner:Player)-[:WINNER]->(:Match {round: "F"})-[:IN_TOURNAMENT]->(t)
MATCH (winner)-[:WINNER]->(match)-[:IN_TOURNAMENT]->(t)
WITH winner, match, t
ORDER BY t.year
WITH winner, t,
collect([(match)<-[:IN_MATCH]-(set:Set)
WHERE (winner)-[:LOSER]->(set) | set
][0]) AS setDropped
where size(setDropped) = 0
RETURN winner.name AS winner, t.year AS year
UNION ALL
USE fabric.womens
MATCH (winner:Player)-[:WINNER]->(:Match {round: "F"})-[:IN_TOURNAMENT]->(t)
MATCH (winner)-[:WINNER]->(match)-[:IN_TOURNAMENT]->(t)
WITH winner, match, t
ORDER BY t.year
WITH winner, t,
collect([(match)<-[:IN_MATCH]-(set:Set)
WHERE (winner)-[:LOSER]->(set) | set
][0]) AS setDropped
where size(setDropped) = 0
RETURN winner.name AS winner, t.year AS year;
winner | year |
---|---|
"Lindsay Davenport" |
2000 |
"Maria Sharapova" |
2008 |
"Serena Williams" |
2017 |
"Roger Federer" |
2007 |
I’m not sure why the year hasn’t ordered properly here, perhaps I’ve made a mistake somewhere.
It’s also a bit annoying having to repeat the query twice though, so an alternative is to use Fabric’s <fabric database name>.graphIds
function in combinatin with the CALL sub query syntax.
That gives us the following:
WITH ["Men's", "Women's"] AS tournaments
UNWIND fabric.graphIds() AS graphId
CALL {
USE fabric.graph(graphId)
MATCH (winner:Player)-[:WINNER]->(:Match {round: "F"})-[:IN_TOURNAMENT]->(t)
MATCH (winner)-[:WINNER]->(match)-[:IN_TOURNAMENT]->(t)
WITH winner, match, t
ORDER BY t.year
WITH winner, t,
collect([(match)<-[:IN_MATCH]-(set:Set)
WHERE (winner)-[:LOSER]->(set) | set
][0]) AS setDropped
where size(setDropped) = 0
RETURN winner, t
}
RETURN tournaments[graphId] AS event, winner.name AS winner, t.year AS year
ORDER BY t.year
event | winner | year |
---|---|---|
"Women’s" |
"Lindsay Davenport" |
2000 |
"Men’s" |
"Roger Federer" |
2007 |
"Women’s" |
"Maria Sharapova" |
2008 |
"Women’s" |
"Serena Williams" |
2017 |
Sweet! The data’s sorted correctly and we only had to specify the main part of the query once.
How long did players wait from their first final defeat until their first win?
Let’s use Fabric to look at one more query.
In the initial blog post we wrote a query to find out how long it took from a player’s first final defeat until their first win. While writing a blog post about longest path queries I realised that the query was incorrect as it didn’t filter out players who had won the final before losing it, and it also returned each final a player had lost before finally winning, rather than just the first one.
The following query finds the longest wait for players from their first final defeat until their first win:
WITH ["Men's", "Women's"] AS tournaments
UNWIND fabric.graphIds() AS graphId
CALL {
USE fabric.graph(graphId)
MATCH (player)-[:LOSER]->(:Match {round: "F"})-[:IN_TOURNAMENT]->(t)-[:NEXT_TOURNAMENT*]->(t2),
(player)-[:WINNER]->(:Match {round: "F"})-[:IN_TOURNAMENT]->(t2)
// Exclude paths where the player has been in the final of an earlier tournament
WHERE not ((player)-[:LOSER|WINNER]->(:Match {round: "F"})-[:IN_TOURNAMENT]
->()-[:NEXT_TOURNAMENT*]->(t))
RETURN player, t, t2
}
RETURN player.name, t.year, t2.year, t2.year - t.year AS difference
ORDER BY difference DESC
And if we run that query, we’ll see the following results:
player.name | t.year | t2.year | difference |
---|---|---|---|
"Kim Clijsters" |
2004 |
2011 |
7 |
"Marat Safin" |
2002 |
2005 |
3 |
"Na Li" |
2011 |
2014 |
3 |
"Maria Sharapova" |
2007 |
2008 |
1 |
So Clijsters had the longest wait, although there are many other players who are still waiting!
That’s all for now, but if you want to learn more about Neo4j Fabric, we’ve written a developer guide showing how to use it against a retail dataset. And don’t forget the docs!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.