Auto Scaling and Rolling Updates on AWS

Why auto scaling?

Auto Scaling Groups are a really useful and solve several of the cloud promises.

  • High availability by automatically replacing unhealthy instances
  • Elasticity by throwing in more instances when more horsepower is required

These are the two major selling points of Auto Scaling Groups and that is pretty good stuff, but there is more that they can do for you. Together with CloudFormation we can automate our deployment and even perform seamless rolling updates. This opens new doors! Need to deploy a new version of your application or maybe apply security patches? Just replace the instances. Amazon even patches the standard AMIs for you. Is an instance misbehaving? Just terminate it and the Auto Scaling Group will start up a new one for you. Treat your servers as cattle rather than pets. Sounds good? Let’s look at how this is done.

Building blocks

The key component here is the Auto Scaling Group containing the EC2 instances. They sit in two different Availability Zones with an Elastic Load Balancer on top distributing the traffic between them. We use CloudFormation to create a blueprint off our stack. We have an S3 bucket containing the blueprint, our bootstrap script and the application we want to deploy on to the servers. We use a Amazon’s Windows AMI that we put our configuration and application on during first boot.

diagram

Let’s look at some concepts that we need to understand.

Auto Scaling Group

The Auto Scaling Group, ASG, is the key component in our example. It contains information about how to scale your application, what to launch the instances from and how to perform updates.

Launch Configuration

The ASG launches new instances from a Launch Configuration. It contains information about which AMI to use, instance type, which Security Group to put them in and so on. Basically, the same information you need to provide when launching an instance manually.

Elastic Load Balancing

The ELB distributes traffic between the instances in the ASG. It also performs health checks and reports back to ASG if it finds an unhealthy instance.

CloudFormation

CloudFomarmation allows you to build blueprints of your infrastructure in JSON format. You create stacks from your templates, but you can also change your infrastructure by updating the template and performing a stack update. We will trigger a stack update to roll out configuration changes to our servers and deploy new versions of our applications.

Bootstrapping

Bootstrapping can be accomplished by passing User Data to the instances. We will pass a very simple PowerShell-script that just goes to our S3 bucket, downloads our central bootstrap script, runs it and passes a message back to CloudFormation that it has completed. Any configuration we want to make to our instances is kept in our central bootstrap script. If we want to make any changes to our servers we just update the script, trigger a stack update and CloudFormation will take care of the rest for us.

Setting this up

IAM Role

First of all, we will create an AIM role that our instances will be assigned to. This role will give the instances read access to our S3 bucket. I’m starting by creating a a policy which I’m calling S3-Read-Bootstrap. It has the following statement and, as you may notice, my bucket is called “cristian-bootstrap”.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:List*"
            ],
            "Resource": "arn:aws:s3:::cristian-bootstrap/*"
        }
    ]
}

I then create a role called “web-server-role” which I attach the policy to.

iam-role

S3 bucket

I have a bucket called cristian-bootstrap and it only contains 3 files.

  • bootstrap.ps1 – my central bootstrap script which installs IIS and our application
  • index.html – simple web page that will act as our “application”
  • web-stack.json – our Cloudformation template containing the blueprint

CloudFormation template

This is where the meat is. Here we “program” our infrastructure rather than clicking around in the console. If you haven’t used CloudFormation, I strongly recommend that you familiarize yourself with it. There are template snippets which are a good starting point. Head over to the template reference if you need details about specific resource types. You may also want to have a look at a JSON tutorial.

The file contains a section for parameters and one with resources. We are defining one parameter, a launch configuration, a load balancer and an auto scaling group.

Parameters

Here we specify which parameters we need to create our stack. It could be any input you want to pass to your template. In our case we are just taking one parameter and that is build version. The way we are going to use this parameter is to determine if there is a new version of our application we want to deploy. Let’s say that there is nothing to change on our infrastructure but we still want to deploy a new version of our application. This is how we tell CloudFormation that. The parameter is called BuildNumber, of type string an if nothing is specified it will default to “0.1”.

{
  "Parameters": {
    "BuildNumber": {
      "Type": "String",
      "Default": "0.1",
      "Description": "Number of the build being deployed"
  }
},

Next are the resources. Let’s have a look at them one by one.

Launch Configuration

First we are saying that this resource is called LaunchConfig and that it is if type Launch Configuration. We are then specifying the properties like, AMI, instance type, key pair security group and IAM Role (remember the one we created in the beginning?). We also have the user data which may look a little bit funny but makes more sense when we see past the formatting.

"LaunchConfig": {
  "Type": "AWS::AutoScaling::LaunchConfiguration",
  "Properties": {
    "ImageId": "ami-9ebb39ed",
    "InstanceType": "t2.micro",
    "KeyName": "cristian-ew-key",
    "IamInstanceProfile": "web-server-role",
    "SecurityGroups": [ "sg-60ef3904" ],
    "UserData": {
      "Fn::Base64": {
        "Fn::Join": [
          "",
          [
            "<powershell>\n",
            "New-Item -Path C:\\Bootstrap -ItemType directory\n",
            "Copy-S3Object -BucketName cristian-bootstrap -Key bootstrap.ps1 -LocalFile C:\\Bootstrap\\bootstrap.ps1\n",
            "C:\\Bootstrap\\bootstrap.ps1\n",
            "cfn-signal.exe -e 0 --stack ", { "Ref" : "AWS::StackName" }, " --resource AutoScalingGroup --region ", { "Ref" : "AWS::Region" }, "\n",
            "# ", { "Ref": "BuildNumber" }, "\n",
            "</powershell>"
          ]
        ]
      }
    }
  }
},

Bellow you see what the User Data will look like. First we are telling it that this is a PowerShell script. We start by creating a directory called C:\Bootstrap. The next line copies bootstrap.ps1, our central bootstrap script, to the newly created folder from our S3 bucket. We then run the bootstrap script we just downloaded. Once it completes, we send a signal to CloudFormation letting it know. More about this signal later.

<powershell>
New-Item -Path C:\Bootstrap -ItemType directory
Copy-S3Object -BucketName cristian-bootstrap -Key bootstrap.ps1 -LocalFile C:\Bootstrap\bootstrap.ps1
C:\Bootstrap\bootstrap.ps1
cfn-signal.exe -e 0 --stack MyWindowsStack --resource AutoScalingGroup --region eu-west-1
# 0.1
</powershell>

Elastic Load Balancer

Next comes our ELB configuration. It tells CloudFormation to create a loadbalancer and specifies that standard stuff you need to specify whenever to create a load balancer, like which security group, subnets, that it is internet facing, what port to listen on, what port to talk to the instances on and what health check to run. I would normally use SSL offload in the ELB, but in this case we are skipping this for simplicity.

"LoadBalancer": {
  "Type": "AWS::ElasticLoadBalancing::LoadBalancer",
  "Properties": {
    "SecurityGroups": [ "sg-60ef3904" ],
    "Scheme" : "internet-facing",
    "Subnets": [ "subnet-dd2eb784", "subnet-63f7a906" ],
    "Listeners": [
      {
        "InstancePort": "80",
        "InstanceProtocol": "HTTP",
        "LoadBalancerPort": "80",
        "Protocol": "HTTP"
      }
    ],
    "HealthCheck": {
      "Target": "HTTP:80/index.html",
      "Timeout": "5",
      "Interval": "30",
      "UnhealthyThreshold": "2",
      "HealthyThreshold": "2"
    }
  }
},

Auto Scaling Group

Lastly we create the ASG. We tell it which AZs to use, which launch configuration to launch instances from, minimum number of instances and maximum number of instances we want running. We also tell it that instances should register with the load balancer we created above. The HealthCheckType and HealthCheckGracePeriod are important properties. Default health check type is EC2 meaning that the ASG will only look at the EC2 health to determine its health, basically is the server up? That is not good enough as we want to make sure that our application is actually responding as expected so we tell the ASG to use ELB health check instead. The grace period is how long the ASG should give the instance to start up before checking its health. We are bootstraping the machines so we need to give them a little bit of time. This value is important. If it is too short ASG will believe instances are broken even before they have completed the bootstrap process. I once set this far too low, went for lunch and forgot about it. Next day I discovered my ASG had been bringing up hundreds of instances, believing they were broken and throwing them away. Remember that minimum charge for an instance is 1 hour, right?

The creation policy tells CloudFormation that it should wait for the instances to signal completion and wait for our bootstrap process to complete. That is what the cfn-signal does in our User Data. Since we want 2 instances before we consider us up and running we set the count to “2”. We also tell it to wait for up to 30 minutes for this signal to come.

The update policy says that we want to use rolling updates. If we make a change to our instances we want them replaced one by one but do not continue with the next until you have received an ok signal.

"AutoScalingGroup": {
  "Type": "AWS::AutoScaling::AutoScalingGroup",
  "Properties": {
    "AvailabilityZones": [ "eu-west-1a", "eu-west-1b" ],
    "LaunchConfigurationName": { "Ref": "LaunchConfig" },
    "MinSize": "2",
    "MaxSize": "4",
    "HealthCheckType": "ELB",
    "HealthCheckGracePeriod": "1800",
    "LoadBalancerNames": [ { "Ref": "LoadBalancer" } ]
  },
  "UpdatePolicy": {
    "AutoScalingRollingUpdate": {
      "MinInstancesInService": "1",
      "MaxBatchSize": "1",
      "WaitOnResourceSignals": "true",
      "PauseTime": "PT30M"
    }
  },
  "CreationPolicy": {
    "ResourceSignal": {
      "Count": "2",
      "Timeout": "PT30M"
    }
  }
}

Our application

For simplicity, our application is represented by a single html-file. The important thing here is to understand that we are deploying our application from our bootstrap script and that our application package is sitting in our S3 bucket. In a real world scenario this probably means that you have your build system dropping the application packages to an S3 bucket. You could then specify a parameter in your CloudFormation file to tell it which application to install. You could also tell it which environment this is (dev/test/prod) and have your bootstrap script pull down environment specific configuration files. Anyway, in this example it’s just a file called index.html.

Create your stack

Let’s see this in action. Go into CloudFormation in the AWS Console and choose to create a new stack. Specify the URL to your CloudFormation template and click next.

select-template

You will be asked to give your stack a name and specify any parameters that you are requesting in your CloudFormation template.Remember that we had a build number in ours? I’m calling my stack “MyWebStack” and just leaving the default build number.

details

On the next screen you can give your resources tags. Tags can be really useful for finding stuff quickly but also allows you to group costs by tags. This way you can see how much a specific application is costing or how much you different teams spends on infrastructure. You can also set permissions based on tags. You could give developers access to dev environments or Windows admins to Windows servers only. Anyway, we are not using tags in this example so just skip to the next page, review and create your stack.

It will take a few minutes, 7 in this example, so stretch your legs or something. Bellow is an example of how the log looks like. As you see we are receiving success signals from our instances. These are the ones we are sending once the bootstrap completes.

cf-log

If you want to know which resources it created you can click the Resources tab.

resources

We will of course also find the resources in the AWS console. Here is our Auto Scaling Group.

ASG

The Launch Configuration…

LC

The Load Balancer. Here is where we see the DNS name that our ELB was assigned. I would recommend using Route 53 to have prettier DNS names.

ELB

And finally our EC2 instances.

instances

If we point our browser to the ELB adress we should see our application.

app-v1

Yes, it’s working. Let’s see what we can do with this.

Recover from failures

To simulate a failure, I’m jumping on to one of the machines and stopping the web server.

stopiis

It doesn’t take long before the ASG reacts by terminating my failed instance and firing up a new one.

pending

Few minutes later and we are back where we want to be with 2 operational machines.

allgreen

Updating your stack

Since we did enable rolling updates in our ASG, we can seamlessly replace instances. If we for example want to switch from t2.micro to t2.small you just need to update your CloudFormation template and do an update stack and it will all be taken care of for you. But how about new versions of our application or configuration changes to the servers that we do using our central bootstrap script? That’s where the build number parameter comes in. If you didn’t notice, we insert the build number into the User Data as a script comment. It’s not doing anything, but if we update the build number, Cloud Formation will see a change to the launch configuration and do a rolling update. That’s how we will be deploying!

userdata

Let’s pretend that we have a new build of our application in the S3 bucket waiting to be deployed. I’ll kickoff a stack update. Go into CloudFormation, only this time chose Update stack instead and specify the same CloudFormation template. Next yo will see the screen where you can specify the parameters. I’m updating my build number to “2” and then I kick off the update.

details2

You will see one instance at the time being replace.

shuttingdown

A while later your servers will have been replaced with new fresh ones.

updated

Let’s test our application. I’m pointing my browser to the ELB address and we see that our application has been updated.

v2

Considerations

There are a few things to consider when working with this set up.

You need to handle any sessions outside of the servers. If you for example have users login in to your application that sessions needs to be handled outside of the servers, DynamoDB could be a good choice. Otherwise your users will be loged out whenever they hit another server. Any files uploaded need to be saved outside the servers, example S3. Since your servers will be replaced quite frequently, you should push out the logs, example to CloudWatch. Think “my server can be replace any minute – what does this mean for my application?”.

Small instances might take long time to bootstrap depending on what you do during your bootstrap process. You may reach a point where you need to start baking custom AMIs to speed things up. You don’t need to go all the way with complete AMIs but maybe the basic stuff in the AMI and leaving the application deployment for the bootstrap. Going from T2 to other types of instances, example M4 can speed up things significantly but then to a higher cost. It’s all a trade off in the end.

In a Windows world it is quite common to join your servers to an Active Directory. This will slow down your bootstrap process even more. You may also have things you can only do after the mandatory reboot when joining a domain, example specifying domain accounts for your applications pools. This means that you must be able to continue your bootstrap process even after a reboot. I will cover that in an upcoming article.

T2 instances uses CPU credits meaning that they can only peak CPU for certain amount of time and then simply given less CPU time. If you are using scale out policies based on CPU this could potentially be a problem. You can read more about CPU credits here. The nice thing about T2 is that they are so inexpensive but in the end you get what you pay for.

Regional high availability on AWS

Introduction

AWS has excellent infrastructure for building highly available and scalable applications. The infrastructure consists of Regions, which are physical locations around the world from which the services are provided, currently 12 of them. Within regions, there are multiple Availability Zones (AZs) which are clusters of datacenters connected with a high speed and low latency network. Read more about Regions and Availability Zones over at the AWS web site.

The first step in building highly available is to deploy your applications into multiple Availability Zones. EC2 instances can be deployed to different AZs within a Region, Elastic Load Balancers can be used to distribute traffic between them. It can even discover broken instances and stop directing traffic to them. To take this a step further you can use Auto Scaling Groups which can span multiple AZs and have them scale out your application, or replace broken instances – all automatically. RDS can also be deployed to multi-AZs with the click of a button or two. Or was it three? I don’t remember.

Multi-AZs will give you highly available applications and will cover the majority of the cases but what about if you need to take this to the next level and be able to fail over your application to a different region? Let’s look at how this can be accomplished and it’s actually quite easy.

The building blocks

In our example we will setup WordPress on EC2 instances in two different regions. One region will function as primary and the other one will take over in case of failure. The database will be hosted on RDS running MySQL which supports cross-region read replicas – basically an asynchronous replication of your primary RDS instance to another region. Route 53 will be used to point the traffic to your primary region, check the application health and failover to the secondary region in case a failure is detected. Here is a simple diagram over the solution.

regional-ha

 

Route 53

Route 53 sits on top and points the DNS name example.cristian-contreras.me to the Ireland Region in this example. It will also poll the application to check its health. In case a failure is detected, it will automatically switch the traffic to the Frankfurt Region.

Elastic IP

An Elastic IP will be assigned to each EC2 instance. This gives static public IPs as opposed to dynamic public IPs which are default. We want the EC2 instances to keep the same IP even after a shutdown.

EC2 instances

The EC2 instances are the web servers in this case and will have WordPress installed on them. Simply because it’s easy to install and illustrates the example well. You would normally deploy at least two EC2 instnces behind an ELB, one in each AZ but I’m sticking with one for simplicity.

RDS

RDS hosts the WordPress database on MySQL with a read replica to the Frankfurt region. Normally you would probably do a multi-AZ deployment for a production workload but again, this example is about regional high availability.

Setting this up

Setting this up consists of a few steps. Setting up the database, setting up the web servers, assigning Elastic IPs and configuring Route 53. Let’s get started.

Setting up the database

Launch the AWS Console and point it to your primary region, go into RDS and launch an instance. Make sure to choose MySQL as this supports cross-region read replicas which we will use to get the database to another region.

Bellow you see how I’m configuring my instance. I’m using a db.t2.micro instances, no multi-AZ deployment and just 5 GB of disk. The DB Instance Identifier is the name you want to give your instance. I’m using “wordpress-ireland” as that will quickly tells me that this is my WordPress instance in Ireland. The Master Username and Password are the credentials to your database instance. Take a note of theese values as you will need them later!

rds-settings

Next are the advanced settings. I’m deploying this to my default VPC, not publicly accessable and putting it in my default VPC security group. In a real world scenario you may want a dedicated security group for the RDS instance and only allow traffic on port 3306 from the security group containing your EC2. Make sure to also specify a database name, I’m using “wordpressdb”.

rds-adv-settings

Next, click Launch DB Instance to get things going. This will take a few minutes. Go grab a coffee or just roll your thumbs. Once your instance is ready for use it will say status available. It will also give you the endpoint adress for connecting to your instance. Make sure to take a note as you will need this later.

dbinstance-status

To create your read replica, let’s go into Instance Actions and choose, well you guessed it, Create Read Replica.

Here, I give the it the name “wordpress-frankfurt” and choose Frankfurt as the Destination Region. Again no public IP and a db.t2.micro. Click the button to create the read replica.

db-replica-settings

Now switch the console over to your secondary Region, in my case Frankfurt and check the status of your RDS instance. After a few minutes, it should become available and give you an address to the endpoint of your read replica. Take a note!

replica-status

Now we have the database part running with replication to our secondary region.

Set up the web server

Now that we have the database up and running, lets install Wordpress. I’m using Amazon Linux and have used the following guide. Bellow is the short version of the procedure.

Go to your primary Region and launch an Amazon Linux EC2 instance. I’m using a t2.micro. Make sure you deploy it to a public subnet so it can be accessed from the internet. Also, make sure the security group you use will allow http traffic from the internet. Login, and run the following commands.

Install Apache web server

Let’s start by installing Apache.

sudo yum install -y httpd php php-mysqlnd

Make sure that the web servers starts automatically in the future start it manually for this time.

sudo chkconfig httpd on
sudo service httpd start

Add a www group and make yourself member of it to have write access to the Apache document root.

sudo groupadd www
sudo usermod -a -G www ec2-user

Now log out and login again to have the group membership take effect and set file permissions on the Apache document root.

sudo chown -R root:www /var/www
sudo chmod 2775 /var/www
find /var/www -type d -exec sudo chmod 2775 {} \;
find /var/www -type f -exec sudo chmod 0664 {} \;

Install WordPress

Now that we have Apache up and running. Let’s get on with the WordPress installation. Download wordpress:

wget https://wordpress.org/latest.tar.gz
tar -xzf latest.tar.gz

Create your WordPress configuration file:

cd wordpress/
cp wp-config-sample.php wp-config.php
nano wp-config.php

Edit your newly created configuration file:

nano wp-config.php

Find the lines bellow and replace the values in red with your values. You did take notes in the previous section? Here is how my file looks like, apart from the password which I’m keeping to myself. Note that DB_HOST should point to your RDS endpoint in the same region as your EC2 instance.

define('DB_NAME', 'wordpressdb');
define('DB_USER', 'wordpress');
define('DB_PASSWORD', 'password_here');
define('DB_HOST', 'wordpress-ireland.c7zg7vsdpqnc.eu-west-1.rds.amazonaws.com:3306');

WordPress uses some salt and key values to provide a layer of encryption to the browser cookies that WordPress users store on their local machines. Get your own by visiting the WordPress API that randomly generates them for you. Replace the examples in the configuration file with your own. Here is an example:

define('AUTH_KEY', 'o2XCCOwAd)|e}-Qu7E#09qjgw>U|a d|OszfpJRR7w*6V^W=_EF6n$1_DMB28jiz');
define('SECURE_AUTH_KEY', 'u% <{-&_&7StJ=|,2XRNSv4&84IM&nS.l3|q]!J~C^zyQRW?hFUn^hTSdez8?y+%');
define('LOGGED_IN_KEY', 'Nuopj*?pb-=RqHJ35PvqpVB.eoO1:0FxvS xI70L}13y.bDooofB65>o 4vJt|?b');
define('NONCE_KEY', 'W;9--%,ULc(c9g~h+g&|_QtS%g[y|5{_(t|ED:8~e_Gzi!Lz `D_ew|,|,R8w=f-');
define('AUTH_SALT', 'XS-4fOEo],i#`<*qn%xmcf]$ );r+[o)-`75OU[@q@.#fI+2-zb(.m5{LcE*Dr(;');
define('SECURE_AUTH_SALT', '2|^-W{Za]BmBj/^/;-$#Mg81wS|m#s+HpTQ9#fJ+`7.))@g;<G<s2O>fe0F2Mngj');
define('LOGGED_IN_SALT', '>r]-W1Gl|uV9y+DkbC-!:f9mnnU3mr mS CoReKkA+:1L[3CV^-rl]$5ZVk1L1=q');
define('NONCE_SALT', 'm&yP/tYKHk}jxr$]r@Dpj_kEalfn>D&e#%YSy2#-Z=.h$|}9+}|Qk8!6L-RiUKN3');

Now, move the wordpress files to the Apache document root. If this step is not working, you probably missed to logout and log back in when assigning yourself to the www group.

mv * /var/www/html/

WordPress permalinks need to use Apache .htaccess files to work properly, but this is not enabled by default on Amazon Linux.

sudo vim /etc/httpd/conf/httpd.conf

Find the following section in your file and replace AllowOveride with All.

<Directory "/var/www/html">
    # some stuff here...
    AllowOverride All
    # other stuff here...
</Directory>

Some of the available features in WordPress require write access to the Apache document root (such as uploading media though the Administration screens).

sudo usermod -a -G www apache
sudo chown -R apache /var/www
sudo chgrp -R www /var/www
find /var/www -type d -exec sudo chmod 2775 {} \;
find /var/www -type f -exec sudo chmod 0664 {} \;

Let’s give Apache a restart to pick up the new group and permissions

sudo service httpd restart

Assign Elastic IPs

Give the server an Elastic IP. This is a public static IP that won’t change if we shutdown the server. We need this.

elastic-ip

You should now see your Elastic IP.

elastic-ip-status

Under Actions click Assosiate Address and associate it with your WordPress instance.

elastic-ip-assign

That’s it for the EIP! Oh, by the way. Take a note of the IP.

Now repeat!

Don’t start WordPress just yet. First time it runs, it will look at the URL you are using and configure itself with that. Since we haven’t created our DNS record yet, we don’t want that to happen now. But we want a WordPress server in the secondary region So go back and repeat the section “Set up the web server” and “Assign Elastic IPs”. Only this time point your AWS Console to the secondary region and remember to point out the RDS endpoint in your secondary region in the WordPress configuration file.

Configure Route 53

Now we have almost everything up and running. Let’s set up Route 53 to point traffic to the application.

I have a hosted zone in Route 53. If you don’t have one, you can register one quite easily. Here is my zone.

hosted-zone

Health check

First of all, let’s create a health check. It is used by Route 53 to determine if your application is healthy. If not, it will redirect the traffic to your secondary region.

Click Health checks in the Route 53 console and then Create health check. I gave mine the name wordpress. Specify the Elastic IP you assigned to EC2 in your primary region. Enter the hostname you will be using. I’m not creating an alert in the next step, just go ahead and create the healt check.

health-check

Within a few minutes it should go green.

health-check-status

Primary DNS record

Now let’s go into our hosted zone and create a record for our primary region. Specify the name of the record, example in this example. I’m setting TTL to 60 sec to prevent DNS servers to cache the record any longer than that. Remember, we are failing over to the secondary by updating DNS. Set the Routing Policy to Failover, the Failover Record type to Primary and specify which health check to use.

dns-pri

Secondary DNS record

Now we need to create our secondary record. This will point to the IP that we want to failover to in case the health check fails.

Similar to last time, only now specify the Elastic IP in your secondary region and set Failover Record Type to Secondary. No health check necessary.

dns-sec

That’s it!

Testing the magic

Now that we have everything set up, we are ready to test. Let’s go to our Wordpress URL. In my case it’s http://example.cristian-contreras.me. As this is the first time, it will take you to the WordPress installation guide.

wp1

Next page asks you for Site Title, Username, password, etc. Complete the form and click Install WordPress to have the blog installed.

wp2

I post a test message and the blog looks surpringsingly similar to this one. Our WordPress site is working!

wp3

Let’s break things

If this works as expected, we should be able to stop the EC2 instance in our primary region and see a failover take place. Let’s see if that happens!

First of all, let’s have a look at where DNS is pointing. Looks right- that’s my EIP in the primary region.

ns-pre

Now we brutally stop the EC2 instance in our primary region.

stop

Our health check reacts within a few minutes.

hc-fail

DNS now points to our secondary region.

ns-post

Testing the site and it is working! Remember, our EC2 instance in the primary region is stopped so this is actually working.

wp-sec

Yo can browse the site, no problem at all, but if we try to post a comment we get the following error. Remember, we are on a read replica in the secondary region so this is expected.

save-error

So we have failed over to our secondary region now, all automatically, apart from us deliberately breaking our application of course. This is pretty good even if our database is read-only. This might even cover our needs. In a major disaster scenario like this maybe we are ok with this. If not, let’s fail over our database.

Promoting a read replica

Promoting a read replica is really easy. Just highlight it, click Instance Actions and choose Promote Read Replica from the menu and confirm the action on the following screen.

promote

This will take a few minutes and the instance will even have to reboot.

reboot

Once the reboot completes I can post a comment and we are fully operational again.

comment

The big gotcha

There is one important thing to be aware of. Since we promoted our read replica, it is now a stand alone instance. This means that we have two WordPress RDS instances living their seperate lifes, one in each region. If we would bring our web server back up in the primary region, Route 53 would point us there and we would be talking to the primary region again. Any changes made on the secondary region would be lost. This is not a problem, just something to be aware of and in other use cases desired behaviour. I’m sure there are clever ways of failing back but I would probably just terminate my original RDS instance in the primary region, create a read replica from the secondary region back to the primary and fail back in a planned way during off hours.

Also, WordPress stores uploaded media content locally on the server (i believe? ). Meaning that pictures we upload would never make it to our other region. Our applications should put uploaded files in S3 which does support cross region replication. There are actually WordPress plugins that does just that. Maybe a topic for an upcoming post.

Conclusion

So we have successfully failed over to our secondary region. If we are ok with read-only database, failover happens all automatically. Even fail back will accur once Route 53 sees that our application is healthy again in the primary region. It gets a little bit more complicated if we want to fail over the database but this is the nature of asynchronous replication. It requires some manual steps and takes a few minute to promote the read replica. You can of course automate things, AWS has excellent APIs. But is this acceptable or even useful? I say absolutely yes for both scenarios. Remember – running EC2 and RDS in multi AZ deployments protects you against datacenter wide outages and does it pretty well. This is the next level of protection, when not only a datacenter breaks down but a whole region. This is the kind of disaster that most of us, back in the days, would never have the resources to put in place and if it happened would put us in a really bad situation. With that perspective, this is amazing stuff!