10 Jul 2014

Neo4j: LOAD CSV - Processing hidden arrays in your CSV documents

I was recently asked how to process an 'array' of values inside a column in a CSV file using Neo4j’s LOAD CSV tool and although I initially thought this wouldn’t be possible as every cell is treated as a String, Michael showed me a way of working around this which I thought was pretty neat.

Let’s say we have a CSV file representing people and their friends. It might look like this:

name,friends
"Mark","Michael,Peter"
"Michael","Peter,Kenny"
"Kenny","Anders,Michael"

And what we want is to have the following nodes:

Mark
Michael
Peter
Kenny
Anders

And the following friends relationships:

Mark -> Michael
Mark -> Peter
Michael -> Peter
Michael -> Kenny
Kenny -> Anders
Kenny -> Michael

We’ll start by loading the CSV file and returning each row:

$ load csv with headers from "file:/Users/markneedham/Desktop/friends.csv" AS row RETURN row;
+------------------------------------------------+
| row                                            |
+------------------------------------------------+
| {name -> "Mark", friends -> "Michael,Peter"}   |
| {name -> "Michael", friends -> "Peter,Kenny"}  |
| {name -> "Kenny", friends -> "Anders,Michael"} |
+------------------------------------------------+
3 rows

As expected the 'friends' column is being treated as a String which means we can use the split function to get an array of people that we want to be friends with:

$ load csv with headers from "file:/Users/markneedham/Desktop/friends.csv" AS row RETURN row, split(row.friends, ",") AS friends;
+-----------------------------------------------------------------------+
| row                                            | friends              |
+-----------------------------------------------------------------------+
| {name -> "Mark", friends -> "Michael,Peter"}   | ["Michael","Peter"]  |
| {name -> "Michael", friends -> "Peter,Kenny"}  | ["Peter","Kenny"]    |
| {name -> "Kenny", friends -> "Anders,Michael"} | ["Anders","Michael"] |
+-----------------------------------------------------------------------+
3 rows

Now that we’ve got them as an array we can use UNWIND to get pairs of friends that we want to create:

$ load csv with headers from "file:/Users/markneedham/Desktop/friends.csv" AS row
  WITH row, split(row.friends, ",") AS friends
  UNWIND friends AS friend
  RETURN row.name, friend;
+-----------------------+
| row.name  | friend    |
+-----------------------+
| "Mark"    | "Michael" |
| "Mark"    | "Peter"   |
| "Michael" | "Peter"   |
| "Michael" | "Kenny"   |
| "Kenny"   | "Anders"  |
| "Kenny"   | "Michael" |
+-----------------------+
6 rows

And now we’ll introduce some MERGE statements to create the appropriate nodes and relationships:

$ load csv with headers from "file:/Users/markneedham/Desktop/friends.csv" AS row
  WITH row, split(row.friends, ",") AS friends
  UNWIND friends AS friend
  MERGE (p1:Person {name: row.name})
  MERGE (p2:Person {name: friend})
  MERGE (p1)-[:FRIENDS_WITH]->(p2);
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 5
Relationships created: 6
Properties set: 5
Labels added: 5
373 ms

And now if we query the database to get back all the nodes + relationships...

$ match (p1:Person)-[r]->(p2) RETURN p1,r, p2;
+------------------------------------------------------------------------+
| p1                      | r                  | p2                      |
+------------------------------------------------------------------------+
| Node[0]{name:"Mark"}    | :FRIENDS_WITH[0]{} | Node[1]{name:"Michael"} |
| Node[0]{name:"Mark"}    | :FRIENDS_WITH[1]{} | Node[2]{name:"Peter"}   |
| Node[1]{name:"Michael"} | :FRIENDS_WITH[2]{} | Node[2]{name:"Peter"}   |
| Node[1]{name:"Michael"} | :FRIENDS_WITH[3]{} | Node[3]{name:"Kenny"}   |
| Node[3]{name:"Kenny"}   | :FRIENDS_WITH[4]{} | Node[4]{name:"Anders"}  |
| Node[3]{name:"Kenny"}   | :FRIENDS_WITH[5]{} | Node[1]{name:"Michael"} |
+------------------------------------------------------------------------+
6 rows

...you’ll see that we have everything.

If instead of a comma separated list of people we have a literal array in the cell...

name,friends
"Mark", "[Michael,Peter]"
"Michael", "[Peter,Kenny]"
"Kenny", "[Anders,Michael]"

...we’d need to tweak the part of the query which extracts our friends to strip off the first and last characters:

$ load csv with headers from "file:/Users/markneedham/Desktop/friendsa.csv" AS row
  RETURN row, split(substring(row.friends, 1, length(row.friends) -2), ",") AS friends;
+-------------------------------------------------------------------------+
| row                                              | friends              |
+-------------------------------------------------------------------------+
| {name -> "Mark", friends -> "[Michael,Peter]"}   | ["Michael","Peter"]  |
| {name -> "Michael", friends -> "[Peter,Kenny]"}  | ["Peter","Kenny"]    |
| {name -> "Kenny", friends -> "[Anders,Michael]"} | ["Anders","Michael"] |
+-------------------------------------------------------------------------+
3 rows

And then if we put the whole query together we end up with this:

$ load csv with headers from "file:/Users/markneedham/Desktop/friendsa.csv" AS row
  WITH row, split(substring(row.friends, 1, length(row.friends) -2), ",") AS friends
  UNWIND friends AS friend
  MERGE (p1:Person {name: row.name})
  MERGE (p2:Person {name: friend})
  MERGE (p1)-[:FRIENDS_WITH]->(p2);;
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 5
Relationships created: 6
Properties set: 5
Labels added: 5

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.