28 Jun 2014

Data Science: Mo' Data Mo' Problems

Over the last couple of years I’ve worked on several proof of concept style Neo4j projects and on a lot of them people have wanted to work with their entire data set which I don’t think makes sense so early on.

In the early parts of a project we’re trying to prove out our approach rather than prove we can handle big data - something that Ashok taught me a couple of years ago on a project we worked on together.

In a Neo4j project that means coming up with an effective way to model and query our data and if we lose track of this it’s very easy to get sucked into working on the big data problem.

This could mean optimising our import scripts to deal with huge amounts of data or working out how to handle different aspects of the data (e.g. variability in shape or encoding) that only seem to reveal themselves at scale.

These are certainly problems that we need to solve but in my experience they end up taking much more time than expected and therefore aren’t the best problem to tackle when time is limited. Early on we want to create some momentum and keep the feedback cycle fast.

We probably want to tackle the data size problem as part of the implementation/production stage of the project to use Michael Nygaard’s terminology.

At this stage we’ll have some confidence that our approach makes sense and then we can put aside the time to set things up properly.

I’m sure there are some types of projects where this approach doesn’t make sense so I’d love to hear about them in the comments so I can spot them in future.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.