27 Apr 2013

Treat servers as cattle: Spin them up, tear them down

A few agos I wrote a post about treating servers as cattle, not as pets in which I described an approach to managing virtual machines at uSwitch whereby we frequently spin up new ones and delete the existing ones.

I’ve worked on teams previously where we’ve also talked about this mentality but ended up not doing it because it was difficult, usually for one of two reasons:

Slow spin up - this might be due to the cloud providers infrastructure, doing too much on spin up or I’m sure a variety of other reasons.
Manual steps involved in spin up - the process isn’t 100% automated so we have to do some manual tweaks. Once the machine is finally working we don’t want to have to go through that again.

Martin Fowler wrote a post a couple of years ago where he said the following:

One of my favorite soundbites is: if it hurts, do it more often. It has the happy property of seeming nonsensical on the surface, but yielding some valuable meaning when you dig deeper

I think it applies in this context too and I have noticed that the more frequently we tear down and spin up new nodes the easier it becomes to do so.

Part of this is because there’s been less time for changes to have happened in package repositories but we are also more inclined to optimise things that we have to do frequently so the whole process is faster as well.

For example in one of our sets of machines we need to give one machine a specific tag so that when the application is deployed it sets up a bunch of cron jobs to run each evening.

Initially this was done manually and we were quite reluctant to ever tear down that machine but we’ve now got it all automated and it’s not a big deal anymore - it can be cattle just like the rest of them!

One neat rule of thumb Phil taught me is that if we make major changes to our infrastructure we should spin up some new machines to check that it still actually works.

If we don’t do this then when we actually need to spin up a new node because of a traffic spike or machine corruption problem it’s not going to work and we’re going to have to fix things in a much more stressful context.

For example we recently moved some repositories around in github and although it’s a fairly simple change spinning up new nodes helped us see all the places where we’d failed to make the appropriate change.

While I appreciate taking this approach is more time consuming in the short term I’d argue that if we automate as much of the pain as possible in the long run it will probably be beneficial.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.