17 Feb 2013

Data Science: Don't filter data prematurely

Last year I wrote a post describing how I’d gone about getting data for my ThoughtWorks graph and one mistake about my approach in retrospect is that I filtered the data too early.

My workflow looked like this:

Scrape internal application using web driver and save useful data to JSON files
Parse JSON files and load nodes/relationships into neo4j

The problem with the first step is that I was trying to determine up front what data was useful and as a result I ended up running the scrapping application multiple times when I realised I didn’t have all the data I wanted.

Since it took a couple of hours to run each time it was tremendously frustrating but it took me a while to realise how flawed my approach was.

For some reason I kept tweaking the scrapper just to get a little bit more data each time!

It wasn’t until Ashok and I were doing some similar work and had to extract data from an existing database that I realised the filtering didn’t need to be done so early in the process.

We weren’t sure exactly what data we needed but on this occasion we got everything around the area we were working in and looked at how we could actually use it at a later stage.

Given that it’s relatively cheap to store the data I think this approach makes sense more often than not - we can always delete the data if we realise it’s not useful to us at a later stage.

It especially makes sense if it’s difficult to get more data either because it’s time consuming or we need someone else to give us access to it and they are time constrained.

If I could rework that work flow it’d now be split into three steps:

Scrape internal application using web driver and save pages as HTML documents
Parse HTML documents and save useful data to JSON files
Parse JSON files and load nodes/relationships into neo4j

I think my experiences tie in reasonably closely with those I heard about at Strata Conf London but of course I may well be wrong so if anyone has other points of view I’d love to hear them.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.