18 Oct 2014

Neo4j: LOAD CSV - The sneaky null character

I spent some time earlier in the week trying to import a CSV file extracted from Hadoop into Neo4j using Cypher’s LOAD CSV command and initially struggled due to some rogue characters.

The CSV file looked like this:

$ cat foo.csv
foo,bar,baz
1,2,3

I wrote the following LOAD CSV query to extract some of the fields and compare others:

load csv with headers from "file:/Users/markneedham/Downloads/foo.csv" AS line
RETURN line.foo, line.bar, line.bar = "2"

==> +--------------------------------------+
==> | line.foo | line.bar | line.bar = "2" |
==> +--------------------------------------+
==> | <null>   | "2"     | false          |
==> +--------------------------------------+
==> 1 row

I had expect to see a "1" in the first column and a 'true' in the third column, neither of which happened.

I initially didn’t have a text editor with hexcode mode available so I tried checking the length of the entry in the 'bar' field:

load csv with headers from "file:/Users/markneedham/Downloads/foo.csv" AS line
RETURN line.foo, line.bar, line.bar = "2", length(line.bar)

==> +---------------------------------------------------------+
==> | line.foo | line.bar | line.bar = "2" | length(line.bar) |
==> +---------------------------------------------------------+
==> | <null>   | "2"     | false          | 2                |
==> +---------------------------------------------------------+
==> 1 row

The length of that value is 2 when we’d expect it to be 1 given it’s a single character.

I tried trimming the field to see if that made any difference...

load csv with headers from "file:/Users/markneedham/Downloads/foo.csv" AS line
RETURN line.foo, trim(line.bar), trim(line.bar) = "2", length(line.bar)

==> +---------------------------------------------------------------------+
==> | line.foo | trim(line.bar) | trim(line.bar) = "2" | length(line.bar) |
==> +---------------------------------------------------------------------+
==> | <null>   | "2"            | true                 | 2                |
==> +---------------------------------------------------------------------+
==> 1 row

...and it did! I thought there was probably a trailing whitespace character after the "2" which trim had removed and that 'foo' column in the header row had the same issue.

I was able to see that this was the case by extracting the JSON dump of the query via the Neo4j browser:

{
   "table":{
      "_response":{
         "columns":[
            "line"
         ],
         "data":[
            {
               "row":[
                  {
                     "foo\u0000":"1\u0000",
                     "bar":"2\u0000",
                     "baz":"3"
                  }
               ],
               "graph":{
                  "nodes":[

                  ],
                  "relationships":[

                  ]
               }
            }
         ],
      ...
}

It turns out there were null characters scattered around the file so I needed to pre process the file to get rid of them:

$ tr < foo.csv -d '\000' > bar.csv

Now if we process bar.csv it’s a much smoother process:

load csv with headers from "file:/Users/markneedham/Downloads/bar.csv" AS line
RETURN line.foo, line.bar, line.bar = "2", length(line.bar)

==> +---------------------------------------------------------+
==> | line.foo | line.bar | line.bar = "2" | length(line.bar) |
==> +---------------------------------------------------------+
==> | "1"      | "2"      | true           | 1                |
==> +---------------------------------------------------------+
==> 1 row

Note to self: don’t expect data to be clean, inspect it first!

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.