Neo4j: LOAD CSV - java.io.InputStreamReader there's a field starting with a quote and whereas it ends that quote there seems to be character in that field after that ending quote. That isn't supported.
I recently came across the last.fm dataset via Ben Frederickson's blog and thought it'd be an interesting one to load into Neo4j and explore.
I started with a simple query to parse the CSV file and count the number of rows:
LOAD CSV FROM "file:///Users/markneedham/projects/neo4j-recommendations/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv" AS row FIELDTERMINATOR "\t" return COUNT(*) At java.io.InputStreamReader@4d307fda:6484 there's a field starting with a quote and whereas it ends that quote there seems to be character in that field after that ending quote. That isn't supported. This is what I read: 'weird al"'
This blows up because (as the message says) we've got a field which uses double quotes but then has other characters either side of the quotes.
A quick search through the file reveals one of the troublesome lines:
$ grep "\"weird" lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv | head -n 1 0015371426d2cbef354b2f680340de38d0ebd2f0 7746d775-9550-4360-b8d5-c37bd448ce01 "weird al" yankovic 4099
I ran a file containing only that line through CSV Lint to see what it thought and indeed it is invalid:
Let's clean up our file to use single quotes instead of double quotes and try the query again:
$ tr "\"" "'" < lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv > lastfm-dataset-360K/clean.tsv
LOAD CSV FROM "file:///Users/markneedham/projects/neo4j-recommendations/lastfm-dataset-360K/clean.tsv" as row FIELDTERMINATOR "\t" return COUNT(*) 17559530
And we're back in business! Interestingly Python's CSV reader chooses to strip out the double quotes rather than throw an exception:
import csv with open("smallWeird.tsv", "r") as file: reader = csv.reader(file, delimiter="\t") for row in reader: print row
$ python explore.py ['0015371426d2cbef354b2f680340de38d0ebd2f0', '7746d775-9550-4360-b8d5-c37bd448ce01', 'weird al yankovic', '4099']
I prefer LOAD CSV's approach but it's an interesting trade off I hadn't considred before.