19 Aug 2015

Unix: Stripping first n bytes in a file / Byte Order Mark (BOM)

I’ve previously written a couple of blog posts showing how to strip out the byte order mark (BOM) from CSV files to make loading them into Neo4j easier and today I came across another way to clean up the file using tail.

The BOM is 3 bytes long at the beginning of the file so if we know that a file contains it then we can strip out those first 3 bytes tail like this:

$ time tail -c +4 Casualty7904.csv > Casualty7904_stripped.csv

real	0m31.945s
user	0m31.370s
sys	0m0.518s

The -c command is described thus;

-c number
             The location is number bytes.

So in this case we start reading at byte 4 (i.e. skipping the first 3 bytes) and then direct the output into a new file.

Although using tail is quite simple, it took 30 seconds to process a 300MB CSV file which might actually be slower than opening the file with a Hex editor and manually deleting the bytes!

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.