30 Jul 2015

Neo4j: Cypher - Removing consecutive duplicates

When writing Cypher queries I sometimes find myself wanting to remove consecutive duplicates in collections that I’ve joined together.

e.g we might start with the following query where 1 and 7 appear consecutively:

RETURN [1,1,2,3,4,5,6,7,7,8] AS values

==> +-----------------------+
==> | values                |
==> +-----------------------+
==> | [1,1,2,3,4,5,6,7,7,8] |
==> +-----------------------+
==> 1 row

We want to end up with [1,2,3,4,5,6,7,8]. We can start by exploding our array and putting consecutive elements next to each other:

WITH [1,1,2,3,4,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
RETURN idx, idx+1, values[idx], values[idx+1]

==> +-------------------------------------------+
==> | idx | idx+1 | values[idx] | values[idx+1] |
==> +-------------------------------------------+
==> | 0   | 1     | 1           | 1             |
==> | 1   | 2     | 1           | 2             |
==> | 2   | 3     | 2           | 3             |
==> | 3   | 4     | 3           | 4             |
==> | 4   | 5     | 4           | 5             |
==> | 5   | 6     | 5           | 6             |
==> | 6   | 7     | 6           | 7             |
==> | 7   | 8     | 7           | 7             |
==> | 8   | 9     | 7           | 8             |
==> +-------------------------------------------+
==> 9 rows

Next we can filter out rows which have the same values since that means they have consecutive duplicates:

WITH [1,1,2,3,4,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN a,b

==> +-------+
==> | a | b |
==> +-------+
==> | 1 | 2 |
==> | 2 | 3 |
==> | 3 | 4 |
==> | 4 | 5 |
==> | 5 | 6 |
==> | 6 | 7 |
==> | 7 | 8 |
==> +-------+
==> 7 rows

Now we need to join the collection back together again. Most of the values we want are in field 'b' but we also need to grab the first value from field 'a':

WITH [1,1,2,3,4,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN COLLECT(a)[0] + COLLECT(b) AS noDuplicates

==> +-------------------+
==> | noDuplicates      |
==> +-------------------+
==> | [1,2,3,4,5,6,7,8] |
==> +-------------------+
==> 1 row

What about if we have more than 2 duplicates in a row?

WITH [1,1,1,2,3,4,5,5,6,7,7,8] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN COLLECT(a)[0] + COLLECT(b) AS noDuplicates

==> +-------------------+
==> | noDuplicates      |
==> +-------------------+
==> | [1,2,3,4,5,6,7,8] |
==> +-------------------+
==> 1 row

Still happy, good times! Of course if we have a non consecutive duplicate that wouldn’t be removed:

WITH [1,1,1,2,3,4,5,5,6,7,7,8,1] AS values
UNWIND RANGE(0, LENGTH(values) - 2) AS idx
WITH values[idx] AS a, values[idx+1] AS b
WHERE a <> b
RETURN COLLECT(a)[0] + COLLECT(b) AS noDuplicates

==> +---------------------+
==> | noDuplicates        |
==> +---------------------+
==> | [1,2,3,4,5,6,7,8,1] |
==> +---------------------+
==> 1 row

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.