· data-science-2

Micro Services Style Data Work Flow

Having worked on a few data related applications over the last ten months or so Ashok and I were recently discussing some of the things that we've learnt

One of the things he pointed out is that it's very helpful to separate the different stages of a data work flow into their own applications/scripts.

I decided to try out this idea with some football data that I'm currently trying to model and I ended up with the following stages:

Data workflow

The stages do the following:

It's reasonably similar to micro services except instead of using HTTP as the protocol between each part we use text files as the interface between different scripts.

In fact it's more like a variation of Unix pipelining as described in The Art of Unix Programming except we store the results of each stage of the pipeline instead of piping them directly into the next one.

If following the Unix way isn't enough of a reason to split up the problem like this there are a couple of other reasons why this approach is useful:

That third advantage became clear to me on Saturday when I realised that waiting 3 minutes for the import stage to run each time was becoming quite frustrating.

All node/relationship creation was happening via the REST interface from a Ruby script since that was the easiest way to get started.

I was planning to plugin some Java code using the batch importer to speed things up until Ashok pointed me to a CSV driven batch importer which seemed like it might be even better.

That batch importer takes CSV files of nodes and edges as its input so I needed to add another stage to the work flow if I wanted to use it:

Data workflow 2

I spent a few hours working on the 'Extract to CSV' stage and then replaced the initial 'Import' script with a call to the batch importer.

It now takes 1.3 seconds to go through the last two stages instead of 3 minutes for the old import stage.

Since all I added was another script that took a text file as input and created text files as output it was really easy to make this change to the work flow.

I'm not sure how well this scales if you're dealing with massive amounts of data but you can always split the data up into multiple files if the size becomes unmanageable.

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket