Neo4j: Cypher - Getting the hang of MERGE
I’ve been trying to get the hang of cypher’s MERGE function and started out by writing a small file to import some people with random properties using the java-faker library.
public class Merge {
private static Label PERSON = DynamicLabel.label("Person");
public static void main(String[] args) throws IOException {
File dbFile = new File("/tmp/test-db");
FileUtils.deleteRecursively(dbFile);
Faker faker = new Faker();
Random random = new Random();
GraphDatabaseService db = new GraphDatabaseFactory().newEmbeddedDatabase(dbFile.getPath());
Transaction tx = db.beginTx();
for (int i = 0; i < 100000; i++) {
Node person = db.createNode(PERSON);
person.setProperty("name", faker.name());
person.setProperty("firstName", faker.firstName());
person.setProperty("lastName", faker.lastName());
person.setProperty("country", faker.country());
person.setProperty("age", random.nextInt(50));
}
tx.success();
tx.close();
}
}
We can write the following query to get back a sample of the people that have been imported:
$ MATCH (p:Person) RETURN p LIMIT 5;
==> +------------------------------------------------------------------------------------------------------------------+
==> | p |
==> +------------------------------------------------------------------------------------------------------------------+
==> | Node[1344]{name:"Benton Swaniawski",firstName:"Rossie",lastName:"Ankunding",country:"Guadeloupe",age:30} |
==> | Node[1345]{name:"Dagmar Bartell",firstName:"Ashlynn",lastName:"Watsica",country:"French Guiana",age:35} |
==> | Node[1346]{name:"Ms. Missouri Gaylord",firstName:"Muriel",lastName:"Streich",country:"Chile",age:43} |
==> | Node[1347]{name:"Melvina Heathcote",firstName:"Geovanni",lastName:"Marks",country:"United Arab Emirates",age:33} |
==> | Node[1348]{name:"Brendan Schaefer",firstName:"Dayne",lastName:"Haley",country:"Tokelau",age:24} |
==> +------------------------------------------------------------------------------------------------------------------+
We can use the MERGE function to ensure that a node with specific properties exists so we might write something like the following:
MERGE (p:Person {name: "Benton Swaniawski",
firstName:"Rossie",
lastName:"Ankunding",
country:"Guadeloupe",
age:30})
RETURN p
If we have a look at the PROFILE output of the query we’d see something like the following:
UpdateGraph(commands=["
MergeNodeAction(
p,
Map(firstName(1) -> Literal(Rossie), country(3) -> Literal(Guadeloupe),
name(0) -> Literal(Benton Swaniawski),
lastName(2) -> Literal(Ankunding),
age(4) -> Literal(30)),
List(Person(0)),
ArrayBuffer(Property(p,lastName(2)) == Literal(Ankunding),
Property(p,name(0)) == Literal(Benton Swaniawski),
Property(p,age(4)) == Literal(30),
Property(p,country(3)) == Literal(Guadeloupe),
Property(p,firstName(1)) == Literal(Rossie)),
List(LabelAction(p,LabelSetOp,List(Person(0))),
PropertySetAction(Property(p,name),Literal(Benton Swaniawski)),
PropertySetAction(Property(p,country),Literal(Guadeloupe)),
PropertySetAction(Property(p,age),Literal(30)),
PropertySetAction(Property(p,lastName),Literal(Ankunding)),
PropertySetAction(Property(p,firstName),
Literal(Rossie))),
List(),
Some(PlainMergeNodeProducer(<function2>)))"], _rows=1, _db_hits=100219)
The bit that stands out is that there were 100,219 db_hits which slows down the query considerably.
If we want to use MERGE then we need to make sure that we have an index or constraint on one of the properties e.g.
$ CREATE INDEX ON :Person(name);
==> +-------------------+
==> | No data returned. |
==> +-------------------+
==> Indexes added: 1
If we look at the profile of that we’ll see that the number of db_hits has reduced as it’s now using the index to do part of the lookup that the MERGE requires:
UpdateGraph(commands=["
MergeNodeAction(
p,
Map(firstName(1) -> Literal(Rossie), country(3) -> Literal(Guadeloupe),
name(0) -> Literal(Benton Swaniawski),
...
Some(PlainMergeNodeProducer(<function2>)))"], _rows=1, _db_hits=4)
We can go one step further by only including the property that’s acting as our 'key' (i.e. name) in the first part of the statement and setting the other properties only if necessary:
MERGE (p:Person {name: "Benton Swaniawski"})
ON CREATE SET p.firstName="Rossie",
p.lastName="Ankunding",
p.country="Guadeloupe",
p.age=30
RETURN p
If we profile that query we can see that things have improved:
UpdateGraph(commands=["MergeNodeAction(
p,
Map(name(0) -> Literal(Benton Swaniawski)),
List(Person(0)),ArrayBuffer(),
List(LabelAction(p,LabelSetOp,List(Person(0))),
PropertySetAction(Property(p,name),Literal(Benton Swaniawski)),
PropertySetAction(Property(p,firstName),Literal(Rossie)),
PropertySetAction(Property(p,lastName),Literal(Ankunding)),
PropertySetAction(Property(p,country),Literal(Guadeloupe)),
PropertySetAction(Property(p,age),Literal(30))),
List(),
Some(PlainMergeNodeProducer(<function2>)))"], _rows=1, _db_hits=0)
In some cases we might want to update a property every time that 'key' gets matched in a MERGE statement which we could do like so:
MERGE (p:Person {name: "Benton Swaniawski"})
ON MATCH SET p.times = COALESCE(p.times, 0) + 1
RETURN p
You can also use MERGE to create relationships but for now I just wanted to explore how it should be used in the context of nodes which I think I’ve now figured out.
Always happy to take tips on how to do things better though!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.