A rogue "\357\273\277" (UTF-8 byte order mark)
We’ve been loading some data into neo4j from a CSV file - creating one node per row and using the value in the first column as the index lookup for the node.
Unfortunately the index lookup wasn’t working for the first row but was for every other row.
By coincidence we started saving each row into a hash map and were then able to see what was going wrong:
require 'rubygems'
require 'fastercsv'
things = FasterCSV.read("things.csv", :col_sep => "|")
saved_things = {}
things do |row|
saved_things[row[0]] = row[1]
end
p saved_things
This is what we saw when we ran the script:
{"\357\273\2771"=>"Thing1", "2" => "Thing2"}
A bit of googling suggests that "\357\273\277" represents a UTF-8 byte order mark which apparently isn’t actually needed anyway:
The Unicode Standard does permit the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8 so in UTF-8 the BOM serves only to identify a text stream or file as UTF-8.
We’re not converting the CSV file back into any other format so the following awk command can be used to cleanup it up:
awk '{if(NR==1)sub(/^\xef\xbb\xbf/,"");print}' things.csv > things.nobom.csv
If we use the hexdump tool we can see that the BOM has been removed:
Before:
$ hexdump things.csv
0000000 ef bb bf 31 7c 50 72 69 6d 61 72 69 65 73 0d 0a
...
After:
hexdump things.nobom.csv
0000000 31 7c 50 72 69 6d 61 72 69 65 73 0d 0a 31 30 7c
I was initially curious why Ruby and the hexdump were printing out different values but it’s just a case of Ruby showing the Octal version of the BOM as compared to the Hexidecimal version. The values translate like so:
Octal | Hexadecimal | Decimal
357 | EF | 239
273 | BB | 187
277 | BF | 191
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.