Neo4j: LOAD CSV - The sneaky null character
I spent some time earlier in the week trying to import a CSV file extracted from Hadoop into Neo4j using Cypher’s LOAD CSV command and initially struggled due to some rogue characters.
The CSV file looked like this:
$ cat foo.csv
foo,bar,baz
1,2,3
I wrote the following LOAD CSV query to extract some of the fields and compare others:
load csv with headers from "file:/Users/markneedham/Downloads/foo.csv" AS line
RETURN line.foo, line.bar, line.bar = "2"
==> +--------------------------------------+
==> | line.foo | line.bar | line.bar = "2" |
==> +--------------------------------------+
==> | <null> | "2" | false |
==> +--------------------------------------+
==> 1 row
I had expect to see a "1" in the first column and a 'true' in the third column, neither of which happened.
I initially didn’t have a text editor with hexcode mode available so I tried checking the length of the entry in the 'bar' field:
load csv with headers from "file:/Users/markneedham/Downloads/foo.csv" AS line
RETURN line.foo, line.bar, line.bar = "2", length(line.bar)
==> +---------------------------------------------------------+
==> | line.foo | line.bar | line.bar = "2" | length(line.bar) |
==> +---------------------------------------------------------+
==> | <null> | "2" | false | 2 |
==> +---------------------------------------------------------+
==> 1 row
The length of that value is 2 when we’d expect it to be 1 given it’s a single character.
I tried trimming the field to see if that made any difference...
load csv with headers from "file:/Users/markneedham/Downloads/foo.csv" AS line
RETURN line.foo, trim(line.bar), trim(line.bar) = "2", length(line.bar)
==> +---------------------------------------------------------------------+
==> | line.foo | trim(line.bar) | trim(line.bar) = "2" | length(line.bar) |
==> +---------------------------------------------------------------------+
==> | <null> | "2" | true | 2 |
==> +---------------------------------------------------------------------+
==> 1 row
...and it did! I thought there was probably a trailing whitespace character after the "2" which trim had removed and that 'foo' column in the header row had the same issue.
I was able to see that this was the case by extracting the JSON dump of the query via the Neo4j browser:
{
"table":{
"_response":{
"columns":[
"line"
],
"data":[
{
"row":[
{
"foo\u0000":"1\u0000",
"bar":"2\u0000",
"baz":"3"
}
],
"graph":{
"nodes":[
],
"relationships":[
]
}
}
],
...
}
It turns out there were null characters scattered around the file so I needed to pre process the file to get rid of them:
$ tr < foo.csv -d '\000' > bar.csv
Now if we process bar.csv it’s a much smoother process:
load csv with headers from "file:/Users/markneedham/Downloads/bar.csv" AS line
RETURN line.foo, line.bar, line.bar = "2", length(line.bar)
==> +---------------------------------------------------------+
==> | line.foo | line.bar | line.bar = "2" | length(line.bar) |
==> +---------------------------------------------------------+
==> | "1" | "2" | true | 1 |
==> +---------------------------------------------------------+
==> 1 row
Note to self: don’t expect data to be clean, inspect it first!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.