Neo4j: Exporting a subset of data from one database to another
As part of the preparation for another blog post, I wanted to export a subset of data from one Neo4j database to another one, which seemed like a blog post in its own right.
Setting up Neo4j
We’re going to use the following Docker Compose configuration in this blog post:
version: '3.7'
services:
neo4j:
image: neo4j:4.0.0-enterprise
container_name: "quickgraph-aus-open"
volumes:
- ./plugins:/plugins
- ./data:/data
- ./import:/var/lib/neo4j/import
ports:
- "7474:7474"
- "7687:7687"
environment:
- "NEO4J_ACCEPT_LICENSE_AGREEMENT=yes"
- "NEO4J_AUTH=neo4j/neo"
- NEO4J_apoc_import_file_use__neo4j__config=true
- NEO4J_apoc_import_file_enabled=true
- NEO4J_apoc_export_file_enabled=true
- NEO4JLABS_PLUGINS=["apoc"]
Once we’ve created that file we need to open a terminal session where that file lives and then run the following command to spin up Neo4j:
docker-compose up
If we run that command, we’ll see the following (truncated) output:
Started quickgraph-aus-open ... done
Attaching to quickgraph-aus-open
quickgraph-aus-open | Changed password for user 'neo4j'.
quickgraph-aus-open | Fetching versions.json for Plugin 'apoc' from https://neo4j-contrib.github.io/neo4j-apoc-procedures/versions.json
quickgraph-aus-open | Installing Plugin 'apoc' from https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/4.0.0.0/apoc-4.0.0.0-all.jar to /plugins/apoc.jar
quickgraph-aus-open | Applying default values for plugin apoc to neo4j.conf
quickgraph-aus-open | Directories in use:
quickgraph-aus-open | home: /var/lib/neo4j
quickgraph-aus-open | config: /var/lib/neo4j/conf
quickgraph-aus-open | logs: /logs
quickgraph-aus-open | plugins: /plugins
quickgraph-aus-open | import: /import
quickgraph-aus-open | data: /var/lib/neo4j/data
quickgraph-aus-open | certificates: /var/lib/neo4j/certificates
quickgraph-aus-open | run: /var/lib/neo4j/run
quickgraph-aus-open | Starting Neo4j.
quickgraph-aus-open | 2020-01-21 22:24:29.976+0000 INFO ======== Neo4j 4.0.0 ========
quickgraph-aus-open | 2020-01-21 22:24:29.982+0000 INFO Starting...
quickgraph-aus-open | 2020-01-21 22:24:35.656+0000 INFO Called db.clearQueryCaches(): Query cache already empty.
quickgraph-aus-open | 2020-01-21 22:24:40.765+0000 INFO Sending metrics to CSV file at /var/lib/neo4j/metrics
quickgraph-aus-open | 2020-01-21 22:24:40.790+0000 INFO Bolt enabled on 0.0.0.0:7687.
quickgraph-aus-open | 2020-01-21 22:24:40.791+0000 INFO Started.
quickgraph-aus-open | 2020-01-21 22:24:40.879+0000 INFO Server thread metrics have been registered successfully
quickgraph-aus-open | 2020-01-21 22:24:41.723+0000 INFO Remote interface available at http://0.0.0.0:7474/
Once we see that last line we’re ready to roll.
Exporting data
In the Australian Open QuickGraph blog post we imported all the matches from the Australian Open tennis tournament. We can see a sample of the imported graph in the Neo4j Browser visualisation below:
Let’s quickly check how much data this graph contains. We’ll query the database via the Cypher Shell command, which we can launch by running the following:
$ docker exec -it quickgraph-aus-open cypher-shell -u neo4j -p neo -d womens
Connected to Neo4j 4.0.0 at neo4j://localhost:7687 as user neo4j.
Type :help for a list of available commands or :exit to exit the shell.
Note that Cypher queries must end with a semicolon.
neo4j@womens>
We’ll use APOC’s apoc.meta.stats procedure to get back a summary of what’s in this database:
CALL apoc.meta.stats()
YIELD nodeCount, relCount, labels, relTypesCount
RETURN nodeCount, relCount, labels, relTypesCount
| nodeCount | relCount | labels | relTypesCount |
|---|---|---|---|
9013 |
27727 |
{Player: 571, Set: 5882, Tournament: 20, Match: 2540} |
{WINNER: 8383, IN_TOURNAMENT: 2540, NEXT_MATCH: 2520, NEXT_TOURNAMENT: 19, LOSER: 8383, IN_MATCH: 5882} |
There’s not much data in there, but we only want to export the Tournament nodes, NEXT_TOURNAMENT relationships, Match nodes for the finals, and the Player nodes that competed in those finals.
We’re going to export the data with a little help from APOC's export to JSON procedures.
P.S. 👇
Finally finished refactoring the #Neo4j #APOC export documentation - https://t.co/RXTjV9n5rV
— Mark Needham (@markhneedham) January 24, 2020
Let us know what's missing/what else can be improved!
When we’re using the export procedures, we need to specify some properties that enable this functionality. In our Dockerfile, we have the following lines:
- NEO4J_apoc_import_file_use__neo4j__config=true
- NEO4J_apoc_export_file_enabled=true
These properties are the equivalent of adding the following values to the apoc.conf file:
apoc.import.file.use_neo4j_config=true
apoc.export.file.enabled=true
-
apoc.export.file.enabled=trueenables writing export files to disk -
apoc.import.file.use_neo4j_config=truewrites those export files to theimportdirectory
We’ll export the data in two parts using the apoc.export.json.query procedure, writing:
-
one query to export the
Tournamentnodes and theNEXT_TOURNAMENTrelationships between them -
one query to export the
Matchnodes whereround: "F"and the associatedPlayerandTournamentnodes and any relationships
tournaments.jsonWITH "MATCH path = (t1:Tournament)-[rel:NEXT_TOURNAMENT]->(t2)
RETURN t1, t2, rel" AS query
CALL apoc.export.json.query(query, "tournaments.json", {format: "plain"})
YIELD file
RETURN file
The contents of tournaments.json, truncated for brevity, are shown below:
{"t1":{"type":"node","id":"256","labels":["Tournament"],"properties":{"year":2000,"name":"Australian Open"}},"t2":{"type":"node","id":"257","labels":["Tournament"],"properties":{"year":2001,"name":"Australian Open"}},"rel":{"id":"10266","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"256","labels":["Tournament"]},"end":{"id":"257","labels":["Tournament"]}}}
{"t1":{"type":"node","id":"257","labels":["Tournament"],"properties":{"year":2001,"name":"Australian Open"}},"t2":{"type":"node","id":"258","labels":["Tournament"],"properties":{"year":2002,"name":"Australian Open"}},"rel":{"id":"10267","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"257","labels":["Tournament"]},"end":{"id":"258","labels":["Tournament"]}}}
{"t1":{"type":"node","id":"258","labels":["Tournament"],"properties":{"year":2002,"name":"Australian Open"}},"t2":{"type":"node","id":"259","labels":["Tournament"],"properties":{"year":2003,"name":"Australian Open"}},"rel":{"id":"10268","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"258","labels":["Tournament"]},"end":{"id":"259","labels":["Tournament"]}}}
...
{"t1":{"type":"node","id":"265","labels":["Tournament"],"properties":{"year":2009,"name":"Australian Open"}},"t2":{"type":"node","id":"266","labels":["Tournament"],"properties":{"year":2010,"name":"Australian Open"}},"rel":{"id":"10275","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"265","labels":["Tournament"]},"end":{"id":"266","labels":["Tournament"]}}}
{"t1":{"type":"node","id":"266","labels":["Tournament"],"properties":{"year":2010,"name":"Australian Open"}},"t2":{"type":"node","id":"267","labels":["Tournament"],"properties":{"year":2011,"name":"Australian Open"}},"rel":{"id":"10276","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"266","labels":["Tournament"]},"end":{"id":"267","labels":["Tournament"]}}}
{"t1":{"type":"node","id":"267","labels":["Tournament"],"properties":{"year":2011,"name":"Australian Open"}},"t2":{"type":"node","id":"268","labels":["Tournament"],"properties":{"year":2012,"name":"Australian Open"}},"rel":{"id":"10277","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"267","labels":["Tournament"]},"end":{"id":"268","labels":["Tournament"]}}}
...
{"t1":{"type":"node","id":"274","labels":["Tournament"],"properties":{"year":2018,"name":"Australian Open"}},"t2":{"type":"node","id":"0","labels":["Tournament"],"properties":{"year":2019,"name":"Australian Open"}},"rel":{"id":"10284","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"274","labels":["Tournament"]},"end":{"id":"0","labels":["Tournament"]}}}
finalists.jsonWITH "MATCH (t:Tournament)<-[tournRel:IN_TOURNAMENT]-(match:Match {round: 'F'})<-[winnerRel:WINNER]-(winner),
(loser)-[loserRel:LOSER]->(match)
RETURN t, tournRel, winner, winnerRel, loser, loserRel, match" AS query
CALL apoc.export.json.query(query, "finalists.json", {format: "plain"})
YIELD file
RETURN file
The contents of finalists.json, truncated for brevity, are shown below:
{"t":{"type":"node","id":"0","labels":["Tournament"],"properties":{"year":2019,"name":"Australian Open"}},"tournRel":{"id":"380","type":"relationship","label":"IN_TOURNAMENT","start":{"id":"191","labels":["Match"]},"end":{"id":"0","labels":["Tournament"]}},"winner":{"type":"node","id":"34","labels":["Player"],"properties":{"name":"Naomi Osaka","id":"211768"}},"winnerRel":{"id":"378","type":"relationship","label":"WINNER","start":{"id":"34","labels":["Player"]},"end":{"id":"191","labels":["Match"]}},"loser":{"type":"node","id":"66","labels":["Player"],"properties":{"name":"Petra Kvitova","id":"201520"}},"loserRel":{"id":"379","type":"relationship","label":"LOSER","start":{"id":"66","labels":["Player"]},"end":{"id":"191","labels":["Match"]}},"match":{"type":"node","id":"191","labels":["Match"],"properties":{"score":"7-6(2) 5-7 6-4","round":"F","id":"20190114_226"}}}
{"t":{"type":"node","id":"256","labels":["Tournament"],"properties":{"year":2000,"name":"Australian Open"}},"tournRel":{"id":"887","type":"relationship","label":"IN_TOURNAMENT","start":{"id":"464","labels":["Match"]},"end":{"id":"256","labels":["Tournament"]}},"winner":{"type":"node","id":"401","labels":["Player"],"properties":{"name":"Lindsay Davenport","id":"200128"}},"winnerRel":{"id":"885","type":"relationship","label":"WINNER","start":{"id":"401","labels":["Player"]},"end":{"id":"464","labels":["Match"]}},"loser":{"type":"node","id":"276","labels":["Player"],"properties":{"name":"Martina Hingis","id":"200001"}},"loserRel":{"id":"886","type":"relationship","label":"LOSER","start":{"id":"276","labels":["Player"]},"end":{"id":"464","labels":["Match"]}},"match":{"type":"node","id":"464","labels":["Match"],"properties":{"score":"6-1 7-5","round":"F","id":"20000117_127"}}}
...
{"t":{"type":"node","id":"260","labels":["Tournament"],"properties":{"year":2004,"name":"Australian Open"}},"tournRel":{"id":"2405","type":"relationship","label":"IN_TOURNAMENT","start":{"id":"1054","labels":["Match"]},"end":{"id":"260","labels":["Tournament"]}},"winner":{"type":"node","id":"278","labels":["Player"],"properties":{"name":"Justine Henin","id":"200003"}},"winnerRel":{"id":"2403","type":"relationship","label":"WINNER","start":{"id":"278","labels":["Player"]},"end":{"id":"1054","labels":["Match"]}},"loser":{"type":"node","id":"547","labels":["Player"],"properties":{"name":"Kim Clijsters","id":"200079"}},"loserRel":{"id":"2404","type":"relationship","label":"LOSER","start":{"id":"547","labels":["Player"]},"end":{"id":"1054","labels":["Match"]}},"match":{"type":"node","id":"1054","labels":["Match"],"properties":{"score":"6-3 4-6 6-3","round":"F","id":"20040119_127"}}}
...
{"t":{"type":"node","id":"274","labels":["Tournament"],"properties":{"year":2018,"name":"Australian Open"}},"tournRel":{"id":"7745","type":"relationship","label":"IN_TOURNAMENT","start":{"id":"2984","labels":["Match"]},"end":{"id":"274","labels":["Tournament"]}},"winner":{"type":"node","id":"96","labels":["Player"],"properties":{"name":"Caroline Wozniacki","id":"201496"}},"winnerRel":{"id":"7743","type":"relationship","label":"WINNER","start":{"id":"96","labels":["Player"]},"end":{"id":"2984","labels":["Match"]}},"loser":{"type":"node","id":"2","labels":["Player"],"properties":{"name":"Simona Halep","id":"201594"}},"loserRel":{"id":"7744","type":"relationship","label":"LOSER","start":{"id":"2","labels":["Player"]},"end":{"id":"2984","labels":["Match"]}},"match":{"type":"node","id":"2984","labels":["Match"],"properties":{"score":"7-6(2) 3-6 6-4","round":"F","id":"20180115_701"}}}
Importing data
We’re now ready to import the data into a new database. Let’s first create a new database and start using it:
:use system;
CREATE DATABASE blog;
:use blog
Before we import any data, let’s create the constraints that we had on our original database:
CREATE CONSTRAINT ON (p:Player)
ASSERT p.id IS UNIQUE;
CREATE CONSTRAINT ON (m:Match)
ASSERT m.id IS UNIQUE;
CREATE CONSTRAINT ON (t:Tournament)
ASSERT (t.name, t.year) IS NODE KEY;
We can check that those have been created by running the following command:
CALL db.indexes();
| id | name | state | populationPercent | uniqueness | type | entityType | labelsOrTypes | properties | provider |
|---|---|---|---|---|---|---|---|---|---|
5 |
"constraint_989cb13a" |
"ONLINE" |
100.0 |
"UNIQUE" |
"BTREE" |
"NODE" |
["Tournament"] |
["name", "year"] |
"native-btree-1.0" |
3 |
"constraint_a016a763" |
"ONLINE" |
100.0 |
"UNIQUE" |
"BTREE" |
"NODE" |
["Match"] |
["id"] |
"native-btree-1.0" |
1 |
"constraint_cbe92269" |
"ONLINE" |
100.0 |
"UNIQUE" |
"BTREE" |
"NODE" |
["Player"] |
["id"] |
"native-btree-1.0" |
Alright, it’s time to import the data!
We’re going to use the apoc.load.json procedure to do this, and we’ll start with tournaments.json.
When we’re using the import procedures, we need to specify some properties that enable importing from local files. In our Dockerfile, we have the following lines:
- NEO4J_apoc_import_file_use__neo4j__config=true
- NEO4J_apoc_import_file_enabled=true
These properties are the equivalent of adding the following values to the apoc.conf file:
apoc.import.file.use_neo4j_config=true
apoc.import.file.enabled=true
-
apoc.import.file.enabled=trueenables reading files to disk -
apoc.import.file.use_neo4j_config=truereads those files from theimportdirectory
Let’s quickly run an exploratory query to remind ourselves the data that we’ll be working with:
CALL apoc.load.json("tournaments.json")
YIELD value
RETURN value.t1, value.t2, value.rel
LIMIT 1;
| value.t1 | value.t2 | value.rel |
|---|---|---|
{id: "256", type: "node", properties: {name: "Australian Open", year: 2000}, labels: ["Tournament"]} |
{id: "257", type: "node", properties: {name: "Australian Open", year: 2001}, labels: ["Tournament"]} |
{start: {id: "256", labels: ["Tournament"]}, end: {id: "257", labels: ["Tournament"]}, id: "10266", label: "NEXT_TOURNAMENT", type: "relationship"} |
where:
-
t1andt2representTournamentnodes -
relrepresents theNEXT_TOURNAMENTrelationship
We can recreate the original graph structure by running the following query:
CALL apoc.load.json("tournaments.json") YIELD value
CALL apoc.merge.node(value.t1.labels, value.t1.properties) YIELD node AS t1
CALL apoc.merge.node(value.t2.labels, value.t2.properties) YIELD node AS t2
CALL apoc.merge.relationship(t1, value.rel.label, {}, {}, t2, {}) YIELD rel
RETURN count(*);
We’re using the apoc.merge.node and apoc.merge.relationship procedures to create the nodes and relationships.
These procedures are the equivalent of Cypher’s MERGE clause, and come in handy when we’re importing data from a dynamic source
e.g. a JSON file
| count(*) |
|---|
19 |
Everything looks good, but let’s quickly check the contents of our database by running the apoc.meta.stats procedure that we ran earlier:
CALL apoc.meta.stats()
YIELD nodeCount, relCount, labels, relTypesCount
RETURN nodeCount, relCount, labels, relTypesCount;
| nodeCount | relCount | labels | relTypesCount |
|---|---|---|---|
20 |
19 |
{Tournament: 20} |
{NEXT_TOURNAMENT: 19} |
So we’ve got 20 nodes representing the 20 tournaments from 2000 to 2019 and 19 relationships between them. So far so good.
Time to import finalists.json.
We’ll start with an exploratory query:
CALL apoc.load.json("finalists.json")
YIELD value
RETURN value.t, value.tournRel,
value.winner, value.winnerRel,
value.loser, value.loserRel,
value.match
LIMIT 1;
| value.t | value.tournRel | value.winner | value.winnerRel | value.loser | value.loserRel | value.match |
|---|---|---|---|---|---|---|
{id: "0", type: "node", properties: {name: "Australian Open", year: 2019}, labels: ["Tournament"]} |
{start: {id: "191", labels: ["Match"]}, end: {id: "0", labels: ["Tournament"]}, id: "380", label: "IN_TOURNAMENT", type: "relationship"} |
{id: "34", type: "node", properties: {name: "Naomi Osaka", id: "211768"}, labels: ["Player"]} |
{start: {id: "34", labels: ["Player"]}, end: {id: "191", labels: ["Match"]}, id: "378", label: "WINNER", type: "relationship"} |
{id: "66", type: "node", properties: {name: "Petra Kvitova", id: "201520"}, labels: ["Player"]} |
{start: {id: "66", labels: ["Player"]}, end: {id: "191", labels: ["Match"]}, id: "379", label: "LOSER", type: "relationship"} |
{id: "191", type: "node", properties: {score: "7-6(2) 5-7 6-4", id: "20190114_226", round: "F"}, labels: ["Match"]} |
where:
-
trepresents theTournamentnode -
winnerandloserrepresentPlayernodes -
matchrepresents theMatchnode -
winnerRelrepresents theWINNERrelationship between aPlayerand aMatch -
loserRelrepresents theLOSERrelationship between aPlayerand aMatch -
tournRelrepresents theIN_TOURNAMENTrelationship between aMatchand aTournament
Let’s now run the following query to import our matches:
CALL apoc.load.json("finalists.json") yield value
CALL apoc.merge.node(value.winner.labels, value.winner.properties) YIELD node AS winner
CALL apoc.merge.node(value.loser.labels, value.loser.properties) YIELD node AS loser
CALL apoc.merge.node(value.match.labels, value.match.properties) YIELD node AS match
CALL apoc.merge.node(value.t.labels, value.t.properties) YIELD node AS tournament
CALL apoc.merge.relationship(winner, value.winnerRel.label, {}, {}, match, {}) YIELD rel AS winnerRel
CALL apoc.merge.relationship(loser, value.loserRel.label, {}, {}, match, {}) YIELD rel AS loserRel
CALL apoc.merge.relationship(match, value.tournRel.label, {}, {}, tournament, {}) YIELD rel AS tournRel
return count(*);
| count(*) |
|---|
20 |
Let’s do one last call to apoc.meta.stats:
CALL apoc.meta.stats()
YIELD nodeCount, relCount, labels, relTypesCount
RETURN nodeCount, relCount, labels, relTypesCount;
| nodeCount | relCount | labels | relTypesCount |
|---|---|---|---|
59 |
79 |
{Player: 19, Tournament: 20, Match: 20} |
{WINNER: 20, LOSER: 20, IN_TOURNAMENT: 20, NEXT_TOURNAMENT: 19} |
And finally, let’s run one of the queries from the original QuickGraph post to check that everything’s in order:
MATCH (winner:Player)-[:WINNER]->(match:Match {round: "F"})<-[:LOSER]-(loser),
(match)-[:IN_TOURNAMENT]->(tournament)
RETURN tournament.year AS year, winner.name AS winner,
loser.name AS loser, match.score AS score
ORDER BY tournament.year
| year | winner | loser | score |
|---|---|---|---|
2000 |
"Lindsay Davenport" |
"Martina Hingis" |
"6-1 7-5" |
2001 |
"Jennifer Capriati" |
"Martina Hingis" |
"6-4 6-3" |
2002 |
"Jennifer Capriati" |
"Martina Hingis" |
"4-6 7-6(7) 6-2" |
2003 |
"Serena Williams" |
"Venus Williams" |
"7-6(4) 3-6 6-4" |
2004 |
"Justine Henin" |
"Kim Clijsters" |
"6-3 4-6 6-3" |
2005 |
"Serena Williams" |
"Lindsay Davenport" |
"2-6 6-3 6-0" |
2006 |
"Amelie Mauresmo" |
"Justine Henin" |
"6-1 2-0 RET" |
2007 |
"Serena Williams" |
"Maria Sharapova" |
"6-1 6-2" |
2008 |
"Maria Sharapova" |
"Ana Ivanovic" |
"7-5 6-3" |
2009 |
"Serena Williams" |
"Dinara Safina" |
"6-0 6-3" |
2010 |
"Serena Williams" |
"Justine Henin" |
"6-4 3-6 6-2" |
2011 |
"Kim Clijsters" |
"Na Li" |
"3-6 6-3 6-3" |
2012 |
"Victoria Azarenka" |
"Maria Sharapova" |
"6-3 6-0" |
2013 |
"Victoria Azarenka" |
"Na Li" |
"4-6 6-4 6-3" |
2014 |
"Na Li" |
"Dominika Cibulkova" |
"7-6(3) 6-0" |
2015 |
"Serena Williams" |
"Maria Sharapova" |
"6-3 7-6(5)" |
2016 |
"Angelique Kerber" |
"Serena Williams" |
"6-4 3-6 6-4" |
2017 |
"Serena Williams" |
"Venus Williams" |
"6-4 6-4" |
2018 |
"Caroline Wozniacki" |
"Simona Halep" |
"7-6(2) 3-6 6-4" |
2019 |
"Naomi Osaka" |
"Petra Kvitova" |
"7-6(2) 5-7 6-4" |
Same results as before, which I think classifies as a #win!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.