Neo4j: Exporting a subset of data from one database to another
As part of the preparation for another blog post, I wanted to export a subset of data from one Neo4j database to another one, which seemed like a blog post in its own right.
Setting up Neo4j
We’re going to use the following Docker Compose configuration in this blog post:
version: '3.7'
services:
neo4j:
image: neo4j:4.0.0-enterprise
container_name: "quickgraph-aus-open"
volumes:
- ./plugins:/plugins
- ./data:/data
- ./import:/var/lib/neo4j/import
ports:
- "7474:7474"
- "7687:7687"
environment:
- "NEO4J_ACCEPT_LICENSE_AGREEMENT=yes"
- "NEO4J_AUTH=neo4j/neo"
- NEO4J_apoc_import_file_use__neo4j__config=true
- NEO4J_apoc_import_file_enabled=true
- NEO4J_apoc_export_file_enabled=true
- NEO4JLABS_PLUGINS=["apoc"]
Once we’ve created that file we need to open a terminal session where that file lives and then run the following command to spin up Neo4j:
docker-compose up
If we run that command, we’ll see the following (truncated) output:
Started quickgraph-aus-open ... done
Attaching to quickgraph-aus-open
quickgraph-aus-open | Changed password for user 'neo4j'.
quickgraph-aus-open | Fetching versions.json for Plugin 'apoc' from https://neo4j-contrib.github.io/neo4j-apoc-procedures/versions.json
quickgraph-aus-open | Installing Plugin 'apoc' from https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/4.0.0.0/apoc-4.0.0.0-all.jar to /plugins/apoc.jar
quickgraph-aus-open | Applying default values for plugin apoc to neo4j.conf
quickgraph-aus-open | Directories in use:
quickgraph-aus-open | home: /var/lib/neo4j
quickgraph-aus-open | config: /var/lib/neo4j/conf
quickgraph-aus-open | logs: /logs
quickgraph-aus-open | plugins: /plugins
quickgraph-aus-open | import: /import
quickgraph-aus-open | data: /var/lib/neo4j/data
quickgraph-aus-open | certificates: /var/lib/neo4j/certificates
quickgraph-aus-open | run: /var/lib/neo4j/run
quickgraph-aus-open | Starting Neo4j.
quickgraph-aus-open | 2020-01-21 22:24:29.976+0000 INFO ======== Neo4j 4.0.0 ========
quickgraph-aus-open | 2020-01-21 22:24:29.982+0000 INFO Starting...
quickgraph-aus-open | 2020-01-21 22:24:35.656+0000 INFO Called db.clearQueryCaches(): Query cache already empty.
quickgraph-aus-open | 2020-01-21 22:24:40.765+0000 INFO Sending metrics to CSV file at /var/lib/neo4j/metrics
quickgraph-aus-open | 2020-01-21 22:24:40.790+0000 INFO Bolt enabled on 0.0.0.0:7687.
quickgraph-aus-open | 2020-01-21 22:24:40.791+0000 INFO Started.
quickgraph-aus-open | 2020-01-21 22:24:40.879+0000 INFO Server thread metrics have been registered successfully
quickgraph-aus-open | 2020-01-21 22:24:41.723+0000 INFO Remote interface available at http://0.0.0.0:7474/
Once we see that last line we’re ready to roll.
Exporting data
In the Australian Open QuickGraph blog post we imported all the matches from the Australian Open tennis tournament. We can see a sample of the imported graph in the Neo4j Browser visualisation below:
Let’s quickly check how much data this graph contains. We’ll query the database via the Cypher Shell command, which we can launch by running the following:
$ docker exec -it quickgraph-aus-open cypher-shell -u neo4j -p neo -d womens
Connected to Neo4j 4.0.0 at neo4j://localhost:7687 as user neo4j.
Type :help for a list of available commands or :exit to exit the shell.
Note that Cypher queries must end with a semicolon.
neo4j@womens>
We’ll use APOC’s apoc.meta.stats
procedure to get back a summary of what’s in this database:
CALL apoc.meta.stats()
YIELD nodeCount, relCount, labels, relTypesCount
RETURN nodeCount, relCount, labels, relTypesCount
nodeCount | relCount | labels | relTypesCount |
---|---|---|---|
9013 |
27727 |
{Player: 571, Set: 5882, Tournament: 20, Match: 2540} |
{WINNER: 8383, IN_TOURNAMENT: 2540, NEXT_MATCH: 2520, NEXT_TOURNAMENT: 19, LOSER: 8383, IN_MATCH: 5882} |
There’s not much data in there, but we only want to export the Tournament
nodes, NEXT_TOURNAMENT
relationships, Match
nodes for the finals, and the Player
nodes that competed in those finals.
We’re going to export the data with a little help from APOC's export to JSON procedures.
P.S. 👇
Finally finished refactoring the #Neo4j #APOC export documentation - https://t.co/RXTjV9n5rV
— Mark Needham (@markhneedham) January 24, 2020
Let us know what's missing/what else can be improved!
When we’re using the export procedures, we need to specify some properties that enable this functionality. In our Dockerfile, we have the following lines:
- NEO4J_apoc_import_file_use__neo4j__config=true
- NEO4J_apoc_export_file_enabled=true
These properties are the equivalent of adding the following values to the apoc.conf
file:
apoc.import.file.use_neo4j_config=true
apoc.export.file.enabled=true
-
apoc.export.file.enabled=true
enables writing export files to disk -
apoc.import.file.use_neo4j_config=true
writes those export files to theimport
directory
We’ll export the data in two parts using the apoc.export.json.query
procedure, writing:
-
one query to export the
Tournament
nodes and theNEXT_TOURNAMENT
relationships between them -
one query to export the
Match
nodes whereround: "F"
and the associatedPlayer
andTournament
nodes and any relationships
tournaments.json
WITH "MATCH path = (t1:Tournament)-[rel:NEXT_TOURNAMENT]->(t2)
RETURN t1, t2, rel" AS query
CALL apoc.export.json.query(query, "tournaments.json", {format: "plain"})
YIELD file
RETURN file
The contents of tournaments.json
, truncated for brevity, are shown below:
{"t1":{"type":"node","id":"256","labels":["Tournament"],"properties":{"year":2000,"name":"Australian Open"}},"t2":{"type":"node","id":"257","labels":["Tournament"],"properties":{"year":2001,"name":"Australian Open"}},"rel":{"id":"10266","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"256","labels":["Tournament"]},"end":{"id":"257","labels":["Tournament"]}}}
{"t1":{"type":"node","id":"257","labels":["Tournament"],"properties":{"year":2001,"name":"Australian Open"}},"t2":{"type":"node","id":"258","labels":["Tournament"],"properties":{"year":2002,"name":"Australian Open"}},"rel":{"id":"10267","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"257","labels":["Tournament"]},"end":{"id":"258","labels":["Tournament"]}}}
{"t1":{"type":"node","id":"258","labels":["Tournament"],"properties":{"year":2002,"name":"Australian Open"}},"t2":{"type":"node","id":"259","labels":["Tournament"],"properties":{"year":2003,"name":"Australian Open"}},"rel":{"id":"10268","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"258","labels":["Tournament"]},"end":{"id":"259","labels":["Tournament"]}}}
...
{"t1":{"type":"node","id":"265","labels":["Tournament"],"properties":{"year":2009,"name":"Australian Open"}},"t2":{"type":"node","id":"266","labels":["Tournament"],"properties":{"year":2010,"name":"Australian Open"}},"rel":{"id":"10275","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"265","labels":["Tournament"]},"end":{"id":"266","labels":["Tournament"]}}}
{"t1":{"type":"node","id":"266","labels":["Tournament"],"properties":{"year":2010,"name":"Australian Open"}},"t2":{"type":"node","id":"267","labels":["Tournament"],"properties":{"year":2011,"name":"Australian Open"}},"rel":{"id":"10276","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"266","labels":["Tournament"]},"end":{"id":"267","labels":["Tournament"]}}}
{"t1":{"type":"node","id":"267","labels":["Tournament"],"properties":{"year":2011,"name":"Australian Open"}},"t2":{"type":"node","id":"268","labels":["Tournament"],"properties":{"year":2012,"name":"Australian Open"}},"rel":{"id":"10277","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"267","labels":["Tournament"]},"end":{"id":"268","labels":["Tournament"]}}}
...
{"t1":{"type":"node","id":"274","labels":["Tournament"],"properties":{"year":2018,"name":"Australian Open"}},"t2":{"type":"node","id":"0","labels":["Tournament"],"properties":{"year":2019,"name":"Australian Open"}},"rel":{"id":"10284","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"274","labels":["Tournament"]},"end":{"id":"0","labels":["Tournament"]}}}
finalists.json
WITH "MATCH (t:Tournament)<-[tournRel:IN_TOURNAMENT]-(match:Match {round: 'F'})<-[winnerRel:WINNER]-(winner),
(loser)-[loserRel:LOSER]->(match)
RETURN t, tournRel, winner, winnerRel, loser, loserRel, match" AS query
CALL apoc.export.json.query(query, "finalists.json", {format: "plain"})
YIELD file
RETURN file
The contents of finalists.json
, truncated for brevity, are shown below:
{"t":{"type":"node","id":"0","labels":["Tournament"],"properties":{"year":2019,"name":"Australian Open"}},"tournRel":{"id":"380","type":"relationship","label":"IN_TOURNAMENT","start":{"id":"191","labels":["Match"]},"end":{"id":"0","labels":["Tournament"]}},"winner":{"type":"node","id":"34","labels":["Player"],"properties":{"name":"Naomi Osaka","id":"211768"}},"winnerRel":{"id":"378","type":"relationship","label":"WINNER","start":{"id":"34","labels":["Player"]},"end":{"id":"191","labels":["Match"]}},"loser":{"type":"node","id":"66","labels":["Player"],"properties":{"name":"Petra Kvitova","id":"201520"}},"loserRel":{"id":"379","type":"relationship","label":"LOSER","start":{"id":"66","labels":["Player"]},"end":{"id":"191","labels":["Match"]}},"match":{"type":"node","id":"191","labels":["Match"],"properties":{"score":"7-6(2) 5-7 6-4","round":"F","id":"20190114_226"}}}
{"t":{"type":"node","id":"256","labels":["Tournament"],"properties":{"year":2000,"name":"Australian Open"}},"tournRel":{"id":"887","type":"relationship","label":"IN_TOURNAMENT","start":{"id":"464","labels":["Match"]},"end":{"id":"256","labels":["Tournament"]}},"winner":{"type":"node","id":"401","labels":["Player"],"properties":{"name":"Lindsay Davenport","id":"200128"}},"winnerRel":{"id":"885","type":"relationship","label":"WINNER","start":{"id":"401","labels":["Player"]},"end":{"id":"464","labels":["Match"]}},"loser":{"type":"node","id":"276","labels":["Player"],"properties":{"name":"Martina Hingis","id":"200001"}},"loserRel":{"id":"886","type":"relationship","label":"LOSER","start":{"id":"276","labels":["Player"]},"end":{"id":"464","labels":["Match"]}},"match":{"type":"node","id":"464","labels":["Match"],"properties":{"score":"6-1 7-5","round":"F","id":"20000117_127"}}}
...
{"t":{"type":"node","id":"260","labels":["Tournament"],"properties":{"year":2004,"name":"Australian Open"}},"tournRel":{"id":"2405","type":"relationship","label":"IN_TOURNAMENT","start":{"id":"1054","labels":["Match"]},"end":{"id":"260","labels":["Tournament"]}},"winner":{"type":"node","id":"278","labels":["Player"],"properties":{"name":"Justine Henin","id":"200003"}},"winnerRel":{"id":"2403","type":"relationship","label":"WINNER","start":{"id":"278","labels":["Player"]},"end":{"id":"1054","labels":["Match"]}},"loser":{"type":"node","id":"547","labels":["Player"],"properties":{"name":"Kim Clijsters","id":"200079"}},"loserRel":{"id":"2404","type":"relationship","label":"LOSER","start":{"id":"547","labels":["Player"]},"end":{"id":"1054","labels":["Match"]}},"match":{"type":"node","id":"1054","labels":["Match"],"properties":{"score":"6-3 4-6 6-3","round":"F","id":"20040119_127"}}}
...
{"t":{"type":"node","id":"274","labels":["Tournament"],"properties":{"year":2018,"name":"Australian Open"}},"tournRel":{"id":"7745","type":"relationship","label":"IN_TOURNAMENT","start":{"id":"2984","labels":["Match"]},"end":{"id":"274","labels":["Tournament"]}},"winner":{"type":"node","id":"96","labels":["Player"],"properties":{"name":"Caroline Wozniacki","id":"201496"}},"winnerRel":{"id":"7743","type":"relationship","label":"WINNER","start":{"id":"96","labels":["Player"]},"end":{"id":"2984","labels":["Match"]}},"loser":{"type":"node","id":"2","labels":["Player"],"properties":{"name":"Simona Halep","id":"201594"}},"loserRel":{"id":"7744","type":"relationship","label":"LOSER","start":{"id":"2","labels":["Player"]},"end":{"id":"2984","labels":["Match"]}},"match":{"type":"node","id":"2984","labels":["Match"],"properties":{"score":"7-6(2) 3-6 6-4","round":"F","id":"20180115_701"}}}
Importing data
We’re now ready to import the data into a new database. Let’s first create a new database and start using it:
:use system;
CREATE DATABASE blog;
:use blog
Before we import any data, let’s create the constraints that we had on our original database:
CREATE CONSTRAINT ON (p:Player)
ASSERT p.id IS UNIQUE;
CREATE CONSTRAINT ON (m:Match)
ASSERT m.id IS UNIQUE;
CREATE CONSTRAINT ON (t:Tournament)
ASSERT (t.name, t.year) IS NODE KEY;
We can check that those have been created by running the following command:
CALL db.indexes();
id | name | state | populationPercent | uniqueness | type | entityType | labelsOrTypes | properties | provider |
---|---|---|---|---|---|---|---|---|---|
5 |
"constraint_989cb13a" |
"ONLINE" |
100.0 |
"UNIQUE" |
"BTREE" |
"NODE" |
["Tournament"] |
["name", "year"] |
"native-btree-1.0" |
3 |
"constraint_a016a763" |
"ONLINE" |
100.0 |
"UNIQUE" |
"BTREE" |
"NODE" |
["Match"] |
["id"] |
"native-btree-1.0" |
1 |
"constraint_cbe92269" |
"ONLINE" |
100.0 |
"UNIQUE" |
"BTREE" |
"NODE" |
["Player"] |
["id"] |
"native-btree-1.0" |
Alright, it’s time to import the data!
We’re going to use the apoc.load.json
procedure to do this, and we’ll start with tournaments.json
.
When we’re using the import procedures, we need to specify some properties that enable importing from local files. In our Dockerfile, we have the following lines:
- NEO4J_apoc_import_file_use__neo4j__config=true
- NEO4J_apoc_import_file_enabled=true
These properties are the equivalent of adding the following values to the apoc.conf
file:
apoc.import.file.use_neo4j_config=true
apoc.import.file.enabled=true
-
apoc.import.file.enabled=true
enables reading files to disk -
apoc.import.file.use_neo4j_config=true
reads those files from theimport
directory
Let’s quickly run an exploratory query to remind ourselves the data that we’ll be working with:
CALL apoc.load.json("tournaments.json")
YIELD value
RETURN value.t1, value.t2, value.rel
LIMIT 1;
value.t1 | value.t2 | value.rel |
---|---|---|
{id: "256", type: "node", properties: {name: "Australian Open", year: 2000}, labels: ["Tournament"]} |
{id: "257", type: "node", properties: {name: "Australian Open", year: 2001}, labels: ["Tournament"]} |
{start: {id: "256", labels: ["Tournament"]}, end: {id: "257", labels: ["Tournament"]}, id: "10266", label: "NEXT_TOURNAMENT", type: "relationship"} |
where:
-
t1
andt2
representTournament
nodes -
rel
represents theNEXT_TOURNAMENT
relationship
We can recreate the original graph structure by running the following query:
CALL apoc.load.json("tournaments.json") YIELD value
CALL apoc.merge.node(value.t1.labels, value.t1.properties) YIELD node AS t1
CALL apoc.merge.node(value.t2.labels, value.t2.properties) YIELD node AS t2
CALL apoc.merge.relationship(t1, value.rel.label, {}, {}, t2, {}) YIELD rel
RETURN count(*);
We’re using the apoc.merge.node
and apoc.merge.relationship
procedures to create the nodes and relationships.
These procedures are the equivalent of Cypher’s MERGE
clause, and come in handy when we’re importing data from a dynamic source
e.g. a JSON file
count(*) |
---|
19 |
Everything looks good, but let’s quickly check the contents of our database by running the apoc.meta.stats
procedure that we ran earlier:
CALL apoc.meta.stats()
YIELD nodeCount, relCount, labels, relTypesCount
RETURN nodeCount, relCount, labels, relTypesCount;
nodeCount | relCount | labels | relTypesCount |
---|---|---|---|
20 |
19 |
{Tournament: 20} |
{NEXT_TOURNAMENT: 19} |
So we’ve got 20 nodes representing the 20 tournaments from 2000 to 2019 and 19 relationships between them. So far so good.
Time to import finalists.json
.
We’ll start with an exploratory query:
CALL apoc.load.json("finalists.json")
YIELD value
RETURN value.t, value.tournRel,
value.winner, value.winnerRel,
value.loser, value.loserRel,
value.match
LIMIT 1;
value.t | value.tournRel | value.winner | value.winnerRel | value.loser | value.loserRel | value.match |
---|---|---|---|---|---|---|
{id: "0", type: "node", properties: {name: "Australian Open", year: 2019}, labels: ["Tournament"]} |
{start: {id: "191", labels: ["Match"]}, end: {id: "0", labels: ["Tournament"]}, id: "380", label: "IN_TOURNAMENT", type: "relationship"} |
{id: "34", type: "node", properties: {name: "Naomi Osaka", id: "211768"}, labels: ["Player"]} |
{start: {id: "34", labels: ["Player"]}, end: {id: "191", labels: ["Match"]}, id: "378", label: "WINNER", type: "relationship"} |
{id: "66", type: "node", properties: {name: "Petra Kvitova", id: "201520"}, labels: ["Player"]} |
{start: {id: "66", labels: ["Player"]}, end: {id: "191", labels: ["Match"]}, id: "379", label: "LOSER", type: "relationship"} |
{id: "191", type: "node", properties: {score: "7-6(2) 5-7 6-4", id: "20190114_226", round: "F"}, labels: ["Match"]} |
where:
-
t
represents theTournament
node -
winner
andloser
representPlayer
nodes -
match
represents theMatch
node -
winnerRel
represents theWINNER
relationship between aPlayer
and aMatch
-
loserRel
represents theLOSER
relationship between aPlayer
and aMatch
-
tournRel
represents theIN_TOURNAMENT
relationship between aMatch
and aTournament
Let’s now run the following query to import our matches:
CALL apoc.load.json("finalists.json") yield value
CALL apoc.merge.node(value.winner.labels, value.winner.properties) YIELD node AS winner
CALL apoc.merge.node(value.loser.labels, value.loser.properties) YIELD node AS loser
CALL apoc.merge.node(value.match.labels, value.match.properties) YIELD node AS match
CALL apoc.merge.node(value.t.labels, value.t.properties) YIELD node AS tournament
CALL apoc.merge.relationship(winner, value.winnerRel.label, {}, {}, match, {}) YIELD rel AS winnerRel
CALL apoc.merge.relationship(loser, value.loserRel.label, {}, {}, match, {}) YIELD rel AS loserRel
CALL apoc.merge.relationship(match, value.tournRel.label, {}, {}, tournament, {}) YIELD rel AS tournRel
return count(*);
count(*) |
---|
20 |
Let’s do one last call to apoc.meta.stats
:
CALL apoc.meta.stats()
YIELD nodeCount, relCount, labels, relTypesCount
RETURN nodeCount, relCount, labels, relTypesCount;
nodeCount | relCount | labels | relTypesCount |
---|---|---|---|
59 |
79 |
{Player: 19, Tournament: 20, Match: 20} |
{WINNER: 20, LOSER: 20, IN_TOURNAMENT: 20, NEXT_TOURNAMENT: 19} |
And finally, let’s run one of the queries from the original QuickGraph post to check that everything’s in order:
MATCH (winner:Player)-[:WINNER]->(match:Match {round: "F"})<-[:LOSER]-(loser),
(match)-[:IN_TOURNAMENT]->(tournament)
RETURN tournament.year AS year, winner.name AS winner,
loser.name AS loser, match.score AS score
ORDER BY tournament.year
year | winner | loser | score |
---|---|---|---|
2000 |
"Lindsay Davenport" |
"Martina Hingis" |
"6-1 7-5" |
2001 |
"Jennifer Capriati" |
"Martina Hingis" |
"6-4 6-3" |
2002 |
"Jennifer Capriati" |
"Martina Hingis" |
"4-6 7-6(7) 6-2" |
2003 |
"Serena Williams" |
"Venus Williams" |
"7-6(4) 3-6 6-4" |
2004 |
"Justine Henin" |
"Kim Clijsters" |
"6-3 4-6 6-3" |
2005 |
"Serena Williams" |
"Lindsay Davenport" |
"2-6 6-3 6-0" |
2006 |
"Amelie Mauresmo" |
"Justine Henin" |
"6-1 2-0 RET" |
2007 |
"Serena Williams" |
"Maria Sharapova" |
"6-1 6-2" |
2008 |
"Maria Sharapova" |
"Ana Ivanovic" |
"7-5 6-3" |
2009 |
"Serena Williams" |
"Dinara Safina" |
"6-0 6-3" |
2010 |
"Serena Williams" |
"Justine Henin" |
"6-4 3-6 6-2" |
2011 |
"Kim Clijsters" |
"Na Li" |
"3-6 6-3 6-3" |
2012 |
"Victoria Azarenka" |
"Maria Sharapova" |
"6-3 6-0" |
2013 |
"Victoria Azarenka" |
"Na Li" |
"4-6 6-4 6-3" |
2014 |
"Na Li" |
"Dominika Cibulkova" |
"7-6(3) 6-0" |
2015 |
"Serena Williams" |
"Maria Sharapova" |
"6-3 7-6(5)" |
2016 |
"Angelique Kerber" |
"Serena Williams" |
"6-4 3-6 6-4" |
2017 |
"Serena Williams" |
"Venus Williams" |
"6-4 6-4" |
2018 |
"Caroline Wozniacki" |
"Simona Halep" |
"7-6(2) 3-6 6-4" |
2019 |
"Naomi Osaka" |
"Petra Kvitova" |
"7-6(2) 5-7 6-4" |
Same results as before, which I think classifies as a #win!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.