· neo4j apoc

Neo4j: Exporting a subset of data from one database to another

As part of the preparation for another blog post, I wanted to export a subset of data from one Neo4j database to another one, which seemed like a blog post in its own right.

apoc export json
Figure 1. Exporting data using APOC’s Export JSON

Setting up Neo4j

We’re going to use the following Docker Compose configuration in this blog post:

Dockerfile
version: '3.7'

services:
  neo4j:
    image: neo4j:4.0.0-enterprise
    container_name: "quickgraph-aus-open"
    volumes:
      - ./plugins:/plugins
      - ./data:/data
      - ./import:/var/lib/neo4j/import
    ports:
      - "7474:7474"
      - "7687:7687"
    environment:
      - "NEO4J_ACCEPT_LICENSE_AGREEMENT=yes"
      - "NEO4J_AUTH=neo4j/neo"
      - NEO4J_apoc_import_file_use__neo4j__config=true
      - NEO4J_apoc_import_file_enabled=true
      - NEO4J_apoc_export_file_enabled=true
      - NEO4JLABS_PLUGINS=["apoc"]

Once we’ve created that file we need to open a terminal session where that file lives and then run the following command to spin up Neo4j:

docker-compose up

If we run that command, we’ll see the following (truncated) output:

Started quickgraph-aus-open ... done
Attaching to quickgraph-aus-open
quickgraph-aus-open | Changed password for user 'neo4j'.
quickgraph-aus-open | Fetching versions.json for Plugin 'apoc' from https://neo4j-contrib.github.io/neo4j-apoc-procedures/versions.json
quickgraph-aus-open | Installing Plugin 'apoc' from https://github.com/neo4j-contrib/neo4j-apoc-procedures/releases/download/4.0.0.0/apoc-4.0.0.0-all.jar to /plugins/apoc.jar
quickgraph-aus-open | Applying default values for plugin apoc to neo4j.conf
quickgraph-aus-open | Directories in use:
quickgraph-aus-open |   home:         /var/lib/neo4j
quickgraph-aus-open |   config:       /var/lib/neo4j/conf
quickgraph-aus-open |   logs:         /logs
quickgraph-aus-open |   plugins:      /plugins
quickgraph-aus-open |   import:       /import
quickgraph-aus-open |   data:         /var/lib/neo4j/data
quickgraph-aus-open |   certificates: /var/lib/neo4j/certificates
quickgraph-aus-open |   run:          /var/lib/neo4j/run
quickgraph-aus-open | Starting Neo4j.
quickgraph-aus-open | 2020-01-21 22:24:29.976+0000 INFO  ======== Neo4j 4.0.0 ========
quickgraph-aus-open | 2020-01-21 22:24:29.982+0000 INFO  Starting...
quickgraph-aus-open | 2020-01-21 22:24:35.656+0000 INFO  Called db.clearQueryCaches(): Query cache already empty.
quickgraph-aus-open | 2020-01-21 22:24:40.765+0000 INFO  Sending metrics to CSV file at /var/lib/neo4j/metrics
quickgraph-aus-open | 2020-01-21 22:24:40.790+0000 INFO  Bolt enabled on 0.0.0.0:7687.
quickgraph-aus-open | 2020-01-21 22:24:40.791+0000 INFO  Started.
quickgraph-aus-open | 2020-01-21 22:24:40.879+0000 INFO  Server thread metrics have been registered successfully
quickgraph-aus-open | 2020-01-21 22:24:41.723+0000 INFO  Remote interface available at http://0.0.0.0:7474/

Once we see that last line we’re ready to roll.

Exporting data

In the Australian Open QuickGraph blog post we imported all the matches from the Australian Open tennis tournament. We can see a sample of the imported graph in the Neo4j Browser visualisation below:

aus open preview
Figure 2. Sample of the Australian Open Graph

Let’s quickly check how much data this graph contains. We’ll query the database via the Cypher Shell command, which we can launch by running the following:

$ docker exec -it quickgraph-aus-open cypher-shell -u neo4j -p neo -d womens

Connected to Neo4j 4.0.0 at neo4j://localhost:7687 as user neo4j.
Type :help for a list of available commands or :exit to exit the shell.
Note that Cypher queries must end with a semicolon.
neo4j@womens>

We’ll use APOC’s apoc.meta.stats procedure to get back a summary of what’s in this database:

CALL apoc.meta.stats()
YIELD nodeCount, relCount, labels, relTypesCount
RETURN nodeCount, relCount, labels, relTypesCount
Table 1. apoc.meta.stats
nodeCount relCount labels relTypesCount

9013

27727

{Player: 571, Set: 5882, Tournament: 20, Match: 2540}

{WINNER: 8383, IN_TOURNAMENT: 2540, NEXT_MATCH: 2520, NEXT_TOURNAMENT: 19, LOSER: 8383, IN_MATCH: 5882}

There’s not much data in there, but we only want to export the Tournament nodes, NEXT_TOURNAMENT relationships, Match nodes for the finals, and the Player nodes that competed in those finals. We’re going to export the data with a little help from APOC's export to JSON procedures.

P.S. 👇

When we’re using the export procedures, we need to specify some properties that enable this functionality. In our Dockerfile, we have the following lines:

  - NEO4J_apoc_import_file_use__neo4j__config=true
  - NEO4J_apoc_export_file_enabled=true

These properties are the equivalent of adding the following values to the apoc.conf file:

apoc.import.file.use_neo4j_config=true
apoc.export.file.enabled=true
  • apoc.export.file.enabled=true enables writing export files to disk

  • apoc.import.file.use_neo4j_config=true writes those export files to the import directory

We’ll export the data in two parts using the apoc.export.json.query procedure, writing:

  • one query to export the Tournament nodes and the NEXT_TOURNAMENT relationships between them

  • one query to export the Match nodes where round: "F" and the associated Player and Tournament nodes and any relationships

The following query exports the tournaments to the file tournaments.json
WITH "MATCH path = (t1:Tournament)-[rel:NEXT_TOURNAMENT]->(t2)
      RETURN t1, t2, rel" AS query
CALL apoc.export.json.query(query, "tournaments.json", {format: "plain"})
YIELD file
RETURN file

The contents of tournaments.json, truncated for brevity, are shown below:

import/tournaments.json
{"t1":{"type":"node","id":"256","labels":["Tournament"],"properties":{"year":2000,"name":"Australian Open"}},"t2":{"type":"node","id":"257","labels":["Tournament"],"properties":{"year":2001,"name":"Australian Open"}},"rel":{"id":"10266","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"256","labels":["Tournament"]},"end":{"id":"257","labels":["Tournament"]}}}
{"t1":{"type":"node","id":"257","labels":["Tournament"],"properties":{"year":2001,"name":"Australian Open"}},"t2":{"type":"node","id":"258","labels":["Tournament"],"properties":{"year":2002,"name":"Australian Open"}},"rel":{"id":"10267","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"257","labels":["Tournament"]},"end":{"id":"258","labels":["Tournament"]}}}
{"t1":{"type":"node","id":"258","labels":["Tournament"],"properties":{"year":2002,"name":"Australian Open"}},"t2":{"type":"node","id":"259","labels":["Tournament"],"properties":{"year":2003,"name":"Australian Open"}},"rel":{"id":"10268","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"258","labels":["Tournament"]},"end":{"id":"259","labels":["Tournament"]}}}
...
{"t1":{"type":"node","id":"265","labels":["Tournament"],"properties":{"year":2009,"name":"Australian Open"}},"t2":{"type":"node","id":"266","labels":["Tournament"],"properties":{"year":2010,"name":"Australian Open"}},"rel":{"id":"10275","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"265","labels":["Tournament"]},"end":{"id":"266","labels":["Tournament"]}}}
{"t1":{"type":"node","id":"266","labels":["Tournament"],"properties":{"year":2010,"name":"Australian Open"}},"t2":{"type":"node","id":"267","labels":["Tournament"],"properties":{"year":2011,"name":"Australian Open"}},"rel":{"id":"10276","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"266","labels":["Tournament"]},"end":{"id":"267","labels":["Tournament"]}}}
{"t1":{"type":"node","id":"267","labels":["Tournament"],"properties":{"year":2011,"name":"Australian Open"}},"t2":{"type":"node","id":"268","labels":["Tournament"],"properties":{"year":2012,"name":"Australian Open"}},"rel":{"id":"10277","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"267","labels":["Tournament"]},"end":{"id":"268","labels":["Tournament"]}}}
...
{"t1":{"type":"node","id":"274","labels":["Tournament"],"properties":{"year":2018,"name":"Australian Open"}},"t2":{"type":"node","id":"0","labels":["Tournament"],"properties":{"year":2019,"name":"Australian Open"}},"rel":{"id":"10284","type":"relationship","label":"NEXT_TOURNAMENT","start":{"id":"274","labels":["Tournament"]},"end":{"id":"0","labels":["Tournament"]}}}
The following query exports the final matches and surrounding nodes and relationships to the file finalists.json
WITH "MATCH (t:Tournament)<-[tournRel:IN_TOURNAMENT]-(match:Match {round: 'F'})<-[winnerRel:WINNER]-(winner),
            (loser)-[loserRel:LOSER]->(match)
      RETURN t, tournRel, winner, winnerRel, loser, loserRel, match" AS query
CALL apoc.export.json.query(query, "finalists.json", {format: "plain"})
YIELD file
RETURN file

The contents of finalists.json, truncated for brevity, are shown below:

import/finalists.json
{"t":{"type":"node","id":"0","labels":["Tournament"],"properties":{"year":2019,"name":"Australian Open"}},"tournRel":{"id":"380","type":"relationship","label":"IN_TOURNAMENT","start":{"id":"191","labels":["Match"]},"end":{"id":"0","labels":["Tournament"]}},"winner":{"type":"node","id":"34","labels":["Player"],"properties":{"name":"Naomi Osaka","id":"211768"}},"winnerRel":{"id":"378","type":"relationship","label":"WINNER","start":{"id":"34","labels":["Player"]},"end":{"id":"191","labels":["Match"]}},"loser":{"type":"node","id":"66","labels":["Player"],"properties":{"name":"Petra Kvitova","id":"201520"}},"loserRel":{"id":"379","type":"relationship","label":"LOSER","start":{"id":"66","labels":["Player"]},"end":{"id":"191","labels":["Match"]}},"match":{"type":"node","id":"191","labels":["Match"],"properties":{"score":"7-6(2) 5-7 6-4","round":"F","id":"20190114_226"}}}
{"t":{"type":"node","id":"256","labels":["Tournament"],"properties":{"year":2000,"name":"Australian Open"}},"tournRel":{"id":"887","type":"relationship","label":"IN_TOURNAMENT","start":{"id":"464","labels":["Match"]},"end":{"id":"256","labels":["Tournament"]}},"winner":{"type":"node","id":"401","labels":["Player"],"properties":{"name":"Lindsay Davenport","id":"200128"}},"winnerRel":{"id":"885","type":"relationship","label":"WINNER","start":{"id":"401","labels":["Player"]},"end":{"id":"464","labels":["Match"]}},"loser":{"type":"node","id":"276","labels":["Player"],"properties":{"name":"Martina Hingis","id":"200001"}},"loserRel":{"id":"886","type":"relationship","label":"LOSER","start":{"id":"276","labels":["Player"]},"end":{"id":"464","labels":["Match"]}},"match":{"type":"node","id":"464","labels":["Match"],"properties":{"score":"6-1 7-5","round":"F","id":"20000117_127"}}}
...
{"t":{"type":"node","id":"260","labels":["Tournament"],"properties":{"year":2004,"name":"Australian Open"}},"tournRel":{"id":"2405","type":"relationship","label":"IN_TOURNAMENT","start":{"id":"1054","labels":["Match"]},"end":{"id":"260","labels":["Tournament"]}},"winner":{"type":"node","id":"278","labels":["Player"],"properties":{"name":"Justine Henin","id":"200003"}},"winnerRel":{"id":"2403","type":"relationship","label":"WINNER","start":{"id":"278","labels":["Player"]},"end":{"id":"1054","labels":["Match"]}},"loser":{"type":"node","id":"547","labels":["Player"],"properties":{"name":"Kim Clijsters","id":"200079"}},"loserRel":{"id":"2404","type":"relationship","label":"LOSER","start":{"id":"547","labels":["Player"]},"end":{"id":"1054","labels":["Match"]}},"match":{"type":"node","id":"1054","labels":["Match"],"properties":{"score":"6-3 4-6 6-3","round":"F","id":"20040119_127"}}}
...
{"t":{"type":"node","id":"274","labels":["Tournament"],"properties":{"year":2018,"name":"Australian Open"}},"tournRel":{"id":"7745","type":"relationship","label":"IN_TOURNAMENT","start":{"id":"2984","labels":["Match"]},"end":{"id":"274","labels":["Tournament"]}},"winner":{"type":"node","id":"96","labels":["Player"],"properties":{"name":"Caroline Wozniacki","id":"201496"}},"winnerRel":{"id":"7743","type":"relationship","label":"WINNER","start":{"id":"96","labels":["Player"]},"end":{"id":"2984","labels":["Match"]}},"loser":{"type":"node","id":"2","labels":["Player"],"properties":{"name":"Simona Halep","id":"201594"}},"loserRel":{"id":"7744","type":"relationship","label":"LOSER","start":{"id":"2","labels":["Player"]},"end":{"id":"2984","labels":["Match"]}},"match":{"type":"node","id":"2984","labels":["Match"],"properties":{"score":"7-6(2) 3-6 6-4","round":"F","id":"20180115_701"}}}

Importing data

We’re now ready to import the data into a new database. Let’s first create a new database and start using it:

:use system;
CREATE DATABASE blog;
:use blog

Before we import any data, let’s create the constraints that we had on our original database:

CREATE CONSTRAINT ON (p:Player)
ASSERT p.id IS UNIQUE;

CREATE CONSTRAINT ON (m:Match)
ASSERT m.id IS UNIQUE;

CREATE CONSTRAINT ON (t:Tournament)
ASSERT (t.name, t.year) IS NODE KEY;

We can check that those have been created by running the following command:

CALL db.indexes();
Table 2. CALL db.indexes()
id name state populationPercent uniqueness type entityType labelsOrTypes properties provider

5

"constraint_989cb13a"

"ONLINE"

100.0

"UNIQUE"

"BTREE"

"NODE"

["Tournament"]

["name", "year"]

"native-btree-1.0"

3

"constraint_a016a763"

"ONLINE"

100.0

"UNIQUE"

"BTREE"

"NODE"

["Match"]

["id"]

"native-btree-1.0"

1

"constraint_cbe92269"

"ONLINE"

100.0

"UNIQUE"

"BTREE"

"NODE"

["Player"]

["id"]

"native-btree-1.0"

Alright, it’s time to import the data! We’re going to use the apoc.load.json procedure to do this, and we’ll start with tournaments.json.

When we’re using the import procedures, we need to specify some properties that enable importing from local files. In our Dockerfile, we have the following lines:

  - NEO4J_apoc_import_file_use__neo4j__config=true
  - NEO4J_apoc_import_file_enabled=true

These properties are the equivalent of adding the following values to the apoc.conf file:

apoc.import.file.use_neo4j_config=true
apoc.import.file.enabled=true
  • apoc.import.file.enabled=true enables reading files to disk

  • apoc.import.file.use_neo4j_config=true reads those files from the import directory

Let’s quickly run an exploratory query to remind ourselves the data that we’ll be working with:

CALL apoc.load.json("tournaments.json")
YIELD value
RETURN value.t1, value.t2, value.rel
LIMIT 1;
Table 3. Exploring tournaments.json
value.t1 value.t2 value.rel

{id: "256", type: "node", properties: {name: "Australian Open", year: 2000}, labels: ["Tournament"]}

{id: "257", type: "node", properties: {name: "Australian Open", year: 2001}, labels: ["Tournament"]}

{start: {id: "256", labels: ["Tournament"]}, end: {id: "257", labels: ["Tournament"]}, id: "10266", label: "NEXT_TOURNAMENT", type: "relationship"}

where:

  • t1 and t2 represent Tournament nodes

  • rel represents the NEXT_TOURNAMENT relationship

We can recreate the original graph structure by running the following query:

CALL apoc.load.json("tournaments.json") YIELD value
CALL apoc.merge.node(value.t1.labels, value.t1.properties) YIELD node AS t1
CALL apoc.merge.node(value.t2.labels, value.t2.properties) YIELD node AS t2
CALL apoc.merge.relationship(t1, value.rel.label, {}, {}, t2, {}) YIELD rel
RETURN count(*);

We’re using the apoc.merge.node and apoc.merge.relationship procedures to create the nodes and relationships. These procedures are the equivalent of Cypher’s MERGE clause, and come in handy when we’re importing data from a dynamic source e.g. a JSON file

Table 4. Results
count(*)

19

Everything looks good, but let’s quickly check the contents of our database by running the apoc.meta.stats procedure that we ran earlier:

CALL apoc.meta.stats()
YIELD nodeCount, relCount, labels, relTypesCount
RETURN nodeCount, relCount, labels, relTypesCount;
Table 5. apoc.meta.stats
nodeCount relCount labels relTypesCount

20

19

{Tournament: 20}

{NEXT_TOURNAMENT: 19}

So we’ve got 20 nodes representing the 20 tournaments from 2000 to 2019 and 19 relationships between them. So far so good.

Time to import finalists.json. We’ll start with an exploratory query:

CALL apoc.load.json("finalists.json")
YIELD value
RETURN value.t, value.tournRel,
       value.winner, value.winnerRel,
       value.loser, value.loserRel,
       value.match
LIMIT 1;
Table 6. Exploring finalists.json
value.t value.tournRel value.winner value.winnerRel value.loser value.loserRel value.match

{id: "0", type: "node", properties: {name: "Australian Open", year: 2019}, labels: ["Tournament"]}

{start: {id: "191", labels: ["Match"]}, end: {id: "0", labels: ["Tournament"]}, id: "380", label: "IN_TOURNAMENT", type: "relationship"}

{id: "34", type: "node", properties: {name: "Naomi Osaka", id: "211768"}, labels: ["Player"]}

{start: {id: "34", labels: ["Player"]}, end: {id: "191", labels: ["Match"]}, id: "378", label: "WINNER", type: "relationship"}

{id: "66", type: "node", properties: {name: "Petra Kvitova", id: "201520"}, labels: ["Player"]}

{start: {id: "66", labels: ["Player"]}, end: {id: "191", labels: ["Match"]}, id: "379", label: "LOSER", type: "relationship"}

{id: "191", type: "node", properties: {score: "7-6(2) 5-7 6-4", id: "20190114_226", round: "F"}, labels: ["Match"]}

where:

  • t represents the Tournament node

  • winner and loser represent Player nodes

  • match represents the Match node

  • winnerRel represents the WINNER relationship between a Player and a Match

  • loserRel represents the LOSER relationship between a Player and a Match

  • tournRel represents the IN_TOURNAMENT relationship between a Match and a Tournament

Let’s now run the following query to import our matches:

CALL apoc.load.json("finalists.json") yield value
CALL apoc.merge.node(value.winner.labels, value.winner.properties) YIELD node AS winner
CALL apoc.merge.node(value.loser.labels, value.loser.properties) YIELD node AS loser
CALL apoc.merge.node(value.match.labels, value.match.properties) YIELD node AS match
CALL apoc.merge.node(value.t.labels, value.t.properties) YIELD node AS tournament
CALL apoc.merge.relationship(winner, value.winnerRel.label, {}, {}, match, {}) YIELD rel AS winnerRel
CALL apoc.merge.relationship(loser, value.loserRel.label, {}, {}, match, {}) YIELD rel AS loserRel
CALL apoc.merge.relationship(match, value.tournRel.label, {}, {}, tournament, {}) YIELD rel AS tournRel
return count(*);
Table 7. Results
count(*)

20

Let’s do one last call to apoc.meta.stats:

CALL apoc.meta.stats()
YIELD nodeCount, relCount, labels, relTypesCount
RETURN nodeCount, relCount, labels, relTypesCount;
Table 8. apoc.meta.stats
nodeCount relCount labels relTypesCount

59

79

{Player: 19, Tournament: 20, Match: 20}

{WINNER: 20, LOSER: 20, IN_TOURNAMENT: 20, NEXT_TOURNAMENT: 19}

And finally, let’s run one of the queries from the original QuickGraph post to check that everything’s in order:

MATCH (winner:Player)-[:WINNER]->(match:Match {round: "F"})<-[:LOSER]-(loser),
      (match)-[:IN_TOURNAMENT]->(tournament)
RETURN tournament.year AS year, winner.name AS winner,
       loser.name AS loser, match.score AS score
ORDER BY tournament.year
Table 9. Results
year winner loser score

2000

"Lindsay Davenport"

"Martina Hingis"

"6-1 7-5"

2001

"Jennifer Capriati"

"Martina Hingis"

"6-4 6-3"

2002

"Jennifer Capriati"

"Martina Hingis"

"4-6 7-6(7) 6-2"

2003

"Serena Williams"

"Venus Williams"

"7-6(4) 3-6 6-4"

2004

"Justine Henin"

"Kim Clijsters"

"6-3 4-6 6-3"

2005

"Serena Williams"

"Lindsay Davenport"

"2-6 6-3 6-0"

2006

"Amelie Mauresmo"

"Justine Henin"

"6-1 2-0 RET"

2007

"Serena Williams"

"Maria Sharapova"

"6-1 6-2"

2008

"Maria Sharapova"

"Ana Ivanovic"

"7-5 6-3"

2009

"Serena Williams"

"Dinara Safina"

"6-0 6-3"

2010

"Serena Williams"

"Justine Henin"

"6-4 3-6 6-2"

2011

"Kim Clijsters"

"Na Li"

"3-6 6-3 6-3"

2012

"Victoria Azarenka"

"Maria Sharapova"

"6-3 6-0"

2013

"Victoria Azarenka"

"Na Li"

"4-6 6-4 6-3"

2014

"Na Li"

"Dominika Cibulkova"

"7-6(3) 6-0"

2015

"Serena Williams"

"Maria Sharapova"

"6-3 7-6(5)"

2016

"Angelique Kerber"

"Serena Williams"

"6-4 3-6 6-4"

2017

"Serena Williams"

"Venus Williams"

"6-4 6-4"

2018

"Caroline Wozniacki"

"Simona Halep"

"7-6(2) 3-6 6-4"

2019

"Naomi Osaka"

"Petra Kvitova"

"7-6(2) 5-7 6-4"

Same results as before, which I think classifies as a #win!

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket