9 Dec 2012

Data Science: Discovery work

Aaron Erickson recently wrote a blog post where he talks through some of the problems he’s seen with big data initiatives where organisations end up buying a product and expecting it to magically produce results.

[…] corporate IT departments are suddenly are looking at their long running “Business Intelligence” initiatives and wondering why they are not seeing the same kinds of return on investment. They are thinking… if only we tweaked that “BI” initiative and somehow mix in some “Big Data”, maybe we could become the next Amazon.

He goes on to suggest that a more 'agile' approach might be more beneficial whereby we drive our work from a business problem with a small team in a short discovery exercise. We can then build on top of that if we’re seeing good results.

A few months ago Ashok and I were doing this type of work for one of our clients and afterwards we tried to summarise how it differed to a normal project.

Hacker Mentality

Since the code we’re writing is almost certainly going to be throwaway it doesn’t make sense to spend a lot of time making it beautiful. It just needs to work.

We didn’t spend any time setting up a continuous integration server or a centralised source control repository since there were only two of us. These things make sense when you have a bigger team and more time but for this type of work it feels overkill.

Most of the code we wrote was in Ruby because that was the language in which we could hack together something useful in the least amount of time but I’m sure others could go just as fast in other languages. We did, however, end up moving some of the code to Java later on after realising the performance gains we’d get from doing so.

2 or 3 hour 'iterations'

As I mentioned in a previous post we took the approach of finding questions that we wanted the answers to and then spending a few hours working on those before talking to our client again.

Since we don’t really know what the outcome of our discovery work is going to be we want to be able to quickly change direction and not go down too many rabbit holes.

1 or 2 weeks in total

We don’t have any data to prove this but it seems like you’d need a week or two to iterate through enough ideas that you’d have a reasonable chance of coming up with something useful.

It took us 4 days before we zoomed in on something that was useful to the client and allowed them to learn something that they didn’t previously know.

If we do find something worth pursuing then we’d want to bake that work into the normal project back log and then treat it the same as any other piece of work, driven by priority and so on.

Small team

You could argue that small teams are beneficial all the time but it’s especially the case here if we want to keep the feedback cycle tight and the communication overhead low.

Our thinking was that 2 or 3 people would probably be sufficient where 2 of the people would be developers and 1 might be someone with a UX background to help do any visualisation work.

If the domain was particularly complex then that 3rd person could be someone with experience in that area who could help derive useful questions to answer.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.