Unix: Stripping first n bytes in a file / Byte Order Mark (BOM)
I’ve previously written a couple of blog posts showing how to strip out the byte order mark (BOM) from CSV files to make loading them into Neo4j easier and today I came across another way to clean up the file using tail.
The BOM is 3 bytes long at the beginning of the file so if we know that a file contains it then we can strip out those first 3 bytes tail like this:
$ time tail -c +4 Casualty7904.csv > Casualty7904_stripped.csv
real 0m31.945s
user 0m31.370s
sys 0m0.518s
The -c command is described thus;
-c number
The location is number bytes.
So in this case we start reading at byte 4 (i.e. skipping the first 3 bytes) and then direct the output into a new file.
Although using tail is quite simple, it took 30 seconds to process a 300MB CSV file which might actually be slower than opening the file with a Hex editor and manually deleting the bytes!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.