Treating servers as cattle, not as pets
Although I didn’t go to Dev Ops Days London earlier in the year I was following the hash tag on twitter and one of my favourites things that I read was the following:
“Treating servers as cattle, not as pets” #DevOpsDays
I think this is particularly applicable now that a lot of the time we’re using virtualised production environments via AWS, Rackspace or
At uSwitch we use AWS and over the last week Sid and I spent some time investigating a memory leak by running our applications against two different versions of Ruby.
One of them was from the Brightbox repository and the other was custom built but they had annoyingly different puppet configurations so we decided to treat them as separate machine types.
We spun up one of the custom built Ruby nodes and put it in the load balancer alongside 11 of the other node types and left it for the day serving traffic.
The next day we had look at the New Relic memory consumption for both node types and it was clear that the custom built one’s memory usage was climbing much more slowly than the other one.
Instead of trying to work out how to change the Ruby version of the 11 existing nodes we realised it would probably be quicker to just spin up 11 new ones with the custom built Ruby and swap them with the existing ones.
This was pretty much as easy as removing the existing nodes from the load balancer and putting the new ones in although we do have one 'special' machine which runs some background jobs.
We needed to make sure there weren’t any jobs on its queue that hadn’t been processed and then make sure that we tagged one of the new machines so that they could take over that role.
One thing that made it particularly easy for us to do this is that spin up of new VMs is extremely quick and completely automated including the installation and start up of applications.
The only manual step we have is to put the new nodes into the load balancer which I think works ok as a manual step because it gives us a chance to quickly scan the box and check everything spun up correctly.
We install all packages/configuration on nodes using puppet headless which makes spin up easier than if you use server/client mode where you have to coordinate node registration with the master on spin up.
I do like this philosophy to machines and although I’m sure it doesn’t apply to all situations we’re almost at the point where if something breaks on a node we might as well spin up a new one while we’re investigating and see which finishes first!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.