Mutli-AZ - the only defensible choice

04/08/2014

Quite a lot is made online by some when AWS has an outage. Even minor disruptions in an availability zone get reported on twitter as if they portend the apocalypse. When an AZ has problems, it can be annoying for certain, but it should not have any actual impact on your application. At Craftsy, we went with multi-AZ almost from the beginning (almost!) and have been pretty resilient to AWS outages ever since. 

Amazon structures their cloud offering around two concepts: Regions and Availability Zones. A Region is a general location (N. Virginia, Oregon, Ireland..). A region contains Availablity Zones. Availability Zones comprise one or more physical datacenters. Certain resources in AWS are constrained to the Availability Zone they were born in (ec2 instances and ebs volumes are two examples). AWS is engineered to contain network or facilities problems within the AZ where they originate. 

Az-template - New Page

In practice you should expect as certainly as you expect the sun to rise:

  • an AZ will have problems, those problems (be they ec2, ebs, network, anything) will affect you
  • a problem in an AZ will have ripple effects in other AZ's as customers flock to launch resources in unaffected AZ's
  • an entire region will go down. Sooner or later a comet will hit northern virginia.

The beginning is a very delicate time

Whether you are just starting out playing with AWS for learning, for your startup, or planning a migration the choices you make first are the choices you will live with. It is easy to throw up an instance and put some code on it, call it a website. This is not a good choice.

Starters - New Page

What happens when the AZ you have resources in has problems? You're down. Why? Because while hosting in AWS (or any cloud offering) is more resilient than a single server in a datacenter, the cloud is not some magical bullet that removes entropy from the system. Things will break, and that's gonna be OK.

The above architecture, while quick, is not a good life choice. Start with something more like this:

Table stakes - New Page

This is a resilient architecture

  • Two app servers, with an Elastic Load Balancer sitting on top to distribute traffic - if either fails, or there is an AZ issue in 1a or 1b, we're not worried
  • A master/slave replication setup is in place, allowing us to quickly failover if the master datastore in 1a goes down
  • We have assets in all 3 AZ's, which provides options in the case of multi-AZ issues.

Whether you're using a conventional SQL, or no-SQL datastore, the idea is the same. Setting up replication is well documented for most datastores. If you don't want to care, checkout Amazon's RDS - it's reliable, easy, and not much more expensive than running on a regular instance. 

It is pretty easy to setup an ELB. You'll have to choose between keeping state with session based sticky, or some other way, which admittedly does make life a little harder. In for a penny, in for a pound. 

The Security Groups (SG) are a Good. Idea. While they don't add any specific AZ failure tolerance, setting up multi-tiered security now will pay dividends down the road. once in place, you can scale as many new instances as you need in the app tier without having to touch the SG. 

Oh god, everything costs money 

A separate argument is occasionally made here that designing for resiliency is expensive because you have to have multiple instances. Folks hem and haw, squirm and whine, and determine that because things cost money, you shouldn't do them. 

If you work with someone like that, quit. Or fire them. You cannot operate a business at $0 operational costs, so don't pretend to. Make the right decisions, and spend wisely. Besides, we're not talking about big bucks here. 

So, you can probably guess that my multi-AZ approach outlined above is more expensive than a single box. It is, in fact, 4 times as expensive! Let's pretend you're using m1.large instance types. 1 box is $175/month.

Hobo

4 hosts, $702.72:

Not-hobo

Now, a couple thoughts here. $700/mo is quite a lot if you're bootstrapping alone, I get that. However, you can structure this so you get quite a bit of additional value from the infastructure to offset that additional cost.

Use the replicated db slave as a reporting host, a nightly rollup box, or an additional read head for your app. Chances are good that in the early early days your coworkers are hammering the DB asking how many tens of dollars you've made. Why not give them somewhere to do that without risking production?

With multiple app servers, you have a built-in safety valve for your code release. Take one of them out of rotation, drop your code there. Test. Put it back. Push to the other. Test. Repeat. 

In a real pinch, you can probably run your DB on your app server, or vice versa to trim the cost down a bit. Certainly, in the early days, we ran on a couple of m1.large's and a lot of hope. 

But really, the point here is simple - the value you derive from doing things the right way is resiliency to an outage. At whatever pricepoint, and at whatever cost, you have to deploy resources in multiple availability zones.

Here at Craftsy, we're an AWS shop. However the ideas I've put together apply to any hosting relationship you have. Rackspace, Google, and Azure all have similar concepts. Even the little VPS outfits usually have more than one datacenter. Standard Co-Lo is going to offer multiple facilities... whatever your hosting choice... 

You must walk the earth utterly certain that whatever provider you choose, there's gonna be trouble.

You must walk the earth, utterly confident that you have planned for this eventuality and when trouble comes, you're not gonna care.

 

Comments (0)

The comments to this entry are closed.