Login to AWS using your Active Directory

Introduction

Many of us have an Active Directory and you might want to be able to login to the AWS Console using your Active Directory accounts. This can be done quite easily with the AWS’ Active Directory Connector. It has all the bits and pieces you need to get started and it’s quite inexpensive – around $40 per month at the time of writing for the small version which supports up to 500 users.

Building blocks

I’m assuming you already have an Active Directory in place. In this case I’ve set up an AD on EC2. The AD Connector will then act as a proxy to your AD. It also provides a login page that will authenticate against your AD. You will mapp AD groups (or users) to IAM roles.

Active Directory

I have created an Active Directory called cristian-contreras.local. For the purpose of this demo it sits on a single Domain Controller on an EC2 instance. It’s pretty much blank apart from the following:

  • User ‘Cristian’ which is my demo user that we will login to the AWS console with.
  • A group called ‘EC2ReadOnly’ that the user above is member of and is what we will map to an IAM role.
  • User called ‘DSConnectorService’ which is the user that the AD Connector service will use to connect to the AD. It does not require any special permissions, just a plain user.
ad-userscomputers
The only AD objects needed for this demo.

AWS AD Connector

The AD Connector is an AWS service that is probably best described by AWS’ own words:

AD Connector is a directory gateway with which you can redirect directory requests to your on-premises Microsoft Active Directory without caching any information in the cloud

It also makes it possible to map IAM roles to Active Directory groups (or users) and can even set up a login page for your AD users. Check out the documentation for more details about the service.

Setting it up

I’m assuming you already have an Active Directory. Let’s start with setting things up there. First of all, I’ve created a user for myself called ‘Cristian’ that is member of a group called ‘EC2ReadOnly’. This is the user that I will use to login to the AWS Console and the group is what will give it access.

We then need a user that the AD Connector can use to talk to the AD with. This is a plain user, not special permissions required.

users

We now have everything on the AD side. Before we create the connector, we must make sure that it will be able to talk to our DCs. The AD Connector will have IP addresses in your VPC. I’ve created the following Security Group that my Domain Controller is member of.

sg
Apart from RDP, this is what AWS states that the connector needs access to.

Now we can create the AD Connector. I’m using the small version which supports up to 500 users. If you have a larger AD you can choose the large one which supports up to 5000 users.

ds-conf-1

You can choose where in the network it should sit. I’m just using the default VPC for this demo.

ds-conf-2

After a few minutes it will be created. Now you can create the login page. Go in to to your AD Connector and create an access URL first. You can then enable the AWS Management Console.

Apps.png

You can then create the mapping between your Active Directory groups (or users) to IAM roles. This is how mine looks. It says AD group EC2ReadOnly maps to IAM role EC2ReadOnly

role-assigment

 

Putting it to work

Go to the URL you chose, mine is called https://criconmun.awsapps.com/console, and you will get a login prompt. I’m logging in with my user called ‘Cristian’.

console-login

You will then be logged in to the AWS Console with the permissions of the role you specified in the AD to IAM role mapping. Note that it says EC2ReadOnly/cristian as the logged in user in the upper right hand corner.

console

Conclusion

AWS has made it really simple to set this up using the AD Connector. It didn’t take me more than 30 minutes in total. It is also quite affordable. A small connector supporting up to 500 users for around $40 per month. I think this is quite a reasonable solution for people that only occasionally log in to the AWS Console and maybe also with limited permissions.

Controlling EC2 instances with a button

The Minecraft button

I have a Minecraft server on AWS, but use it quite seldom. It doesn’t make sense to have it turned on constantly as you pay by the hour. On the other hand I don’t want to have to login to the AWS Console and sometimes the kids wants to play Minecraft when I’m not home. That’s where the idea with the Minecraft button came in – basically having a physical button to turn on and off the server.

The bits and pieces

Quite simple actually:

  • A Raspberry Pi
  • Green, yellow, red LED to indicate instance status
  • On/off switch
  • A Python script that checks the status of the server and lights the corresponding led
  • Another Python script that checks the button and sends API calls to AWS when pressed
  • AWS CLI and Boto3 for communication with AWS
20161106_132712
I made holes in the Raspberry Pi case and stuck everything there

Raspberry Pi

The Raspberry Pi runs Raspbian and is generation 1 model B, simply because that was what I had laying around. My gen 1 also has a roomier case than the gen 2 and 3 I have, which came handy to fit the cables, etc.

Electronic components

This is my shopping list. It links to a Swedish site but should give you and idea anyway.

Apart from that I use a 100 ohm resistor not to overload the LEDs.

The circuit looks like this.

curcuit

The stuff is fitted in the Raspberry Pi. Then just plugging in the cables to their corresponding pins. Check out the documentation on PIN numbering.

20161106_131712

AWS access

First of all, we need to create a user and assign it permissions.

Go to the  AWS console and navigate to IAM. Create a user, assign it the AmazonEC2FullAccess policy and create an access key that the Raspberry Pi will use to talk to AWS.

Install the AWS CLI which will allow us to configure credentials for accessing AWS.

sudo apt-get install awscli

Then run the following command and type in the access key you created and region where your EC2 instance is running.

aws configure

You also need to install Boto3 which is AWS’ SDK for Python. Just run a:

pip install boto3

The control script

The control script handles the button and begins with defining two functions

  • start_server – call AWS using Boto3 to start the server
  • stop_server – call AWS using Boto3 to stop the server

Then it just runs an endless loop and checks if the button is pressed every half a second. If it is pressed, it checks if the server is running and then calls the stop_server function, and the other way around. If the server is nog running or stopped, example in the middle of starting or stopping, then it does nothing.

#!/usr/bin/python
import boto3
import RPi.GPIO as GPIO
from datetime import datetime
from time import sleep


def start_server(server_id):
    print('Starting server {}'.format(server_id))
    instance = ec2.Instance(server_id)
    instance.start()

def stop_server(server_id):
    print('Stopping server {}'.format(server_id))
    instance = ec2.Instance(server_id)
    instance.stop()


pin = 17
GPIO.setmode(GPIO.BCM)
GPIO.setup(17, GPIO.IN, pull_up_down=GPIO.PUD_DOWN)

ec2 = boto3.resource('ec2')

while True:
    if GPIO.input(17):
        instance = ec2.Instance('i-dae58150')
        if instance.state['Name'] == 'stopped':
            print('{} - Starting server'.format(datetime.now()))
            instance.start()
        elif instance.state['Name'] == 'running':
            print('{} - Stopping server'.format(datetime.now()))
            instance.stop()
        else:
            print('{} - server is in unknown state, skipping'.format(datetime.now())) 
            sleep(1)
    sleep(0.5)

GPIO.cleanup()

The status script

This script handles the leds and makes a call to AWS, using Boto3, to check the status of the instance and updates the LEDs accordingly. This is done in an endless loop with a 1 second sleep for each iteration.

#!/usr/bin/python
import boto3
import RPi.GPIO as GPIO
from time import sleep

instance_id = 'i-dae58150'
red_pin = 27
green_pin = 22
yellow_pin = 10

GPIO.setmode(GPIO.BCM)
GPIO.setup(red_pin, GPIO.OUT)
GPIO.setup(green_pin, GPIO.OUT)
GPIO.setup(yellow_pin, GPIO.OUT)

ec2 = boto3.resource('ec2')

while True:
    instance = ec2.Instance(instance_id)
    state = instance.state['Name']
    if state == 'running':
        GPIO.output(red_pin, GPIO.LOW)
        GPIO.output(yellow_pin, GPIO.LOW)
        GPIO.output(green_pin, GPIO.HIGH)
    elif state == 'stopped':
        GPIO.output(red_pin, GPIO.HIGH)
        GPIO.output(yellow_pin, GPIO.LOW)
        GPIO.output(green_pin, GPIO.LOW)
    else:
        GPIO.output(red_pin, GPIO.LOW)
        GPIO.output(yellow_pin, GPIO.HIGH)
        GPIO.output(green_pin, GPIO.LOW)
    sleep(1)

Staring the scripts at boot

To make the scripts run automatically when you start the Pi, add the following lines to your crontab (run crontab -e)

@reboot nohup /home/pi/MinecraftButton/server_control.py &
@reboot nohup /home/pi/MinecraftButton/server_status.py &

 

 

Understanding T2 CPU credits

Introduction

The t2 instance family provides inexpensive instances with moderate performance level and the ability to burst. This is accomplished using CPU credits which determines how much CPU you can use and for how long time. This way, AWS can cram more instances onto the same hardware and make them more inexpensive. How much CPU and for how long is determined by the instance type you are using. Depending on your workload type, t2 instances can work very well and be cost effective. But for high loads, they can perform really bad as CPU utilization will be pushed down if you eat up your credits. Let’s look at how this mechanism works.

What are CPU credits?

It’s actually quite simple – a CPU credit gives you the right to run a vCPU at 100% for one minute. So whenever your CPU utilization goes high, you start eating CPU credits. You will also continuously earn new credits. The level of when you start burning CPU credits is called the base performance level. If you use up all your credits, the CPU utilization will be pushed back down to the base performance level. Instances can only accumulate a certain amount of CPU credits and they will also start with some credits when launched. So to summarize:

The level for when instances burn credits is called the base performance level. When above, CPU credits are consumed. You will slowly but surely earn new credits. The instance will launch with an initial amount of credits and can only buffer up a certain amount. Bellow is AWS’ table for the different instance types.

cpucredits

Let’s look at a t2.small as an example. It will burn credits when above 20% CPU. It will earn 12 CPU credits per hour, meaning it can run 100% 12 minutes per hour. If idle for long time enough, it can accumulate a maximum of 288 CPU credits. It will also launch with 30 credits, giving it a little bit of room to boot up and come alive properly.

How does this work in real life?

You can actually see your CPU credits in the AWS Console. Highlight your instance, click the Monitor tab and scroll down to CPU Credit Balance or CPU Credit Usage. You can also create a CloudWatch Dashboard to see this. Bellow is a real world example for my Minecraft server. The blue line is my CPU credit balance, and the yellow line is the CPU utilization. As you can see, I start with a high balance. I then play some Minecraft causing my CPU utilization to go up and credits down. When I’ve finished playing the credits starts to build up again giving me some more credits for the next game. As long as I don’t drop to 0 credits, my server runs perfectly well.

cpu-buffers

Here is another example where I launch a new t2.micro instance. The instance is idle to begin with so it starts building up more credits slowly but surely. I then kick off a process on the machine that flats out the CPU. It starts burning credits and once all credits are burned the CPU is forced down to 10% which is the base performance level for this instance type.

cpu-cred-test

Conclusion

In order to use t2 instances successfully you should know your CPU utilization. Use the build in graphs to keep track of this. Your average utilization should not go above the base performance level for any longer periods of time. The opposite is also true – if you have instances from other instance families that are constantly bellow a t2 base performance level, consider a t2 instance. As a comparison, a t2.large with Windows costs $0.142 per Hour and m4.large costs $0.244 per Hour and they are comparable apart from CPU credits for the t2. Changing instance type is just a reboot and right click away.

Severless backups of your EC2 volumes using Lambda

Introduction

EBS volumes are designed for high durability. According to AWS, you should expect 1-2 failures during one year if you have 1.000 volumes running. So even if they are reliable, they are not unbreakable. Data should not be kept on EBS volumes – it should be stored on S3, RDS or similar. Logs should be shipped out from your instances to CloudWatch Logs or an logging platform you might be using. Regardless of this, we might end up in situations where we wish we could restore an EBS volume. It could be an update script that went wrong, accidental deletion or a volume actually breaking. This can be accomplished using snapshots. They are not application aware so they will not get your data out from your application in a consistent manner, but for typical application servers that does not contain your data they could do the job well. In this article we will look at how you can use AWS Lambda to create snapshots of your volumes automatically.

Since I wrote the article, the function has been improved. Check out the project on Github

Building blocks

The example is quite simple – we use Lambda to create a scheduled job that backs up your EBS volumes.

Lambda

At the heart of this example we have Lambda. It gives you the possibility to run serverless code. You don’t need to care about any servers – just upload your code and don’t worry about the rest. It also has the benefit of being very cost effective. You pay on 100ms intervals that your code is running. You can currently use Java, Node.js and Python. I’m using a simple Python example to take snapshots of our EC2 volumes.

CloudWatch Events

The backup job will be kicked off using a schedule event in CloudWatch Events. The events can be configured to run at specific intervals, example every hour, or att given times. I’m using the latter in the middle of the night when traffic is low.

Volume tags

Tags can be used to put your own metadata on your AWS resources. We will a tag to define which volumes we want to snapshot with this job. The tag I’m using is BackupEnable = True. So whenever new instances are created they can be included in the backup job just by tagging the volumes.

Python

The code is quite simple – it enumerates all the volumes, checks is if they have the tag BackupEnable = True, and if so takes a snapshot if it. Lambda has built in support for Boto 3, the AWS SDK for Python, which we use to interact with AWS.

Identity and access management

Lambda needs the appropriate permissions to do the work. This is accomplished using an IAM role that Lambda can assume. The role will have a policy attached to it that gives it the right permissions to do its things.

Setting it up

Identity and Access Management

Let’s start by creating an AIM policy that will give our backup job the permissions to do what it needs to do. Go into IAM in the Management Console, navigate to Policies and click ‘Create Policy’. In the next page, choose ‘Create Your Own Policy’. Give your Policy a name and paste the following policy document into it. It contains three sections. The first will give our Lambda job the permissions to log output to CloudWatch Logs. The next gives permissions to list and see information about the volumes like which volumes there are and if they contain our BackupEnable = True tag. The third section gives the permissions to take the actual snapshot.

{
 "Version": "2012-10-17",
 "Statement": [
 {
 "Effect": "Allow",
 "Action": [
 "logs:CreateLogGroup",
 "logs:CreateLogStream",
 "logs:PutLogEvents"
 ],
 "Resource": "arn:aws:logs:*:*:*"
 },
 {
 "Effect": "Allow",
 "Action": "ec2:Describe*",
 "Resource": "*"
 },
 {
 "Effect": "Allow",
 "Action": [
 "ec2:CreateSnapshot",
 "ec2:ModifySnapshotAttribute",
 "ec2:ResetSnapshotAttribute"
 ],
 "Resource": [
 "*"
 ]
 }
 ]
}

Next go into Roles and click ‘Create New Role’. Give your role a name. I call mine ‘backup-ec2-volumes’. On the next page, you have to choose role type. This will be a Lambda job so we choose AWS Lambda. Next you have to attach a policy to the role. Choose the policy we created in the previous step. Mine is called ‘ec2-volume-snapshot’.

Your role should look like this with the policy attached to it.

iam-role

Lambda should be listed under trusted Entities. This gives the Lambda service permissions to assume the role.

iam-role-trust

Now that we have created a role, we can go head and create the actual Lambda job.

Lambda

Go to Lambda in the Management Console and choose ‘Create a Lambda function’. You will be presented a series of blueprints as a starting point, click ‘Skip’ and you will be taken to the configuration options on the next page.

Start by giving your function a name and choose Python 2.7 as your runtime. I call mine ‘backup-ec2-volumes’.

lambda-conf-1

Paste the Python code bellow into the code window. It first defines a function called snap_volume that takes a snapshot of the volume that is passed to it as volume_id. The next function is the one that Lambda will call, lambda_handler. Here we put in our main code. It will list all volumes, iterate through them one by one and check if they contain a tag called BackupEnable = True. If so, call the snap_volume function and send along the volume_id. That’s it. The try-except is to catch situations where no tag is set. The print statements will end up as log rows – this is our way of logging what the job is doing.

from __future__ import print_function

from datetime import datetime, date
import boto3

print("Starting job at {}".format(datetime.now()))

def snap_volume(volume_id):
    ec2 = boto3.resource('ec2')
    volume = ec2.Volume(volume_id)
    snap_description = 'Snapshot of {0} {1}'.format(volume_id, date.today())
    volume.create_snapshot(DryRun=False, Description=snap_description)
    print('Snapshot taken of {} at {}'.format(volume_id, datetime.now()))
    return

def lambda_handler(event, context):
    client = boto3.client('ec2')
    response = client.describe_volumes()
    print('Volumes listed at {}'.format(datetime.now()))
    for volume in response['Volumes']:
        try:
            for tag in volume['Tags']:
            if tag['Key'] == 'BackupEnable' and tag['Value'] == 'True':
                snap_volume(volume['VolumeId'])
        except:
            print('No tag found on volume {}'.format(volume['VolumeId']))

Make sure you choose the role we created previously to give the Lambda job the right permissions. I’m happy with the rest of the settings so I just leave the defaults. No, you probably want to increase the timeout to 5 secs. I discovered that it takes around 2-3 secs for the AWS API to return a list of the volumes so the job might take more than 3 secs some times. Review on the next page and go ahead and create the function.

lambda-conf-3

We will be taken to a page showing us our Lambda function. We still need to tell it how to trigger. We will use a CloudWatch Events Schedule to trigger this. Click ‘Event sources’ and then ‘Add event source’. You get a list of available event sources, choose CloudWatch Events – Schedule. Give your rule a name, mine is called ‘daily-volume-snapshots’. Give it a Schedule expression. The schedules are in UTC time. I’m running mine at 3:00 am every day. See AWS docs for more info about cron syntax. Go ahead and create the schedule.

cw-schedule

Putting it to work

Now we have all the bits and pieces set up. Let’s go ahead and test our job. First we need to tag the volumes that we want to backup.

Tag volume

We have the job set up. Now we need to tell it which volumes to backup. We do this by tagging the volume with BackupEnable = True as shown bellow. That’s it.

vol-tag

Test the job

In order to test the job, just go into the Lambda function in the Management Console and click the test button. You will get log output similar to the one bellow. It shows execution time and the log output. Logs will also be output to CloudWatch Logs.

log-output

You should see your snapshot if you head over to EC2 and Snapshots.

snap-example

Conclusion

It is quite simple to set up a Lambda job to create snapshots of your EBS volumes. Snapshots are incremental so only the changes from the last snapshot is saved. This gives you the possibility to take snapshots quite frequently without them becoming too large in size. They are not application aware so not suitable for volumes with frequently updated data unless you stop your application before taking the snapshot. They could be considered crash consistent, same result that you would get if you simply pulled the plug from the server. Anything that is cashed in memory will get lost but for many workloads this is no problem at all.

Auto Scaling and Rolling Updates on AWS

Why auto scaling?

Auto Scaling Groups are a really useful and solve several of the cloud promises.

  • High availability by automatically replacing unhealthy instances
  • Elasticity by throwing in more instances when more horsepower is required

These are the two major selling points of Auto Scaling Groups and that is pretty good stuff, but there is more that they can do for you. Together with CloudFormation we can automate our deployment and even perform seamless rolling updates. This opens new doors! Need to deploy a new version of your application or maybe apply security patches? Just replace the instances. Amazon even patches the standard AMIs for you. Is an instance misbehaving? Just terminate it and the Auto Scaling Group will start up a new one for you. Treat your servers as cattle rather than pets. Sounds good? Let’s look at how this is done.

Building blocks

The key component here is the Auto Scaling Group containing the EC2 instances. They sit in two different Availability Zones with an Elastic Load Balancer on top distributing the traffic between them. We use CloudFormation to create a blueprint off our stack. We have an S3 bucket containing the blueprint, our bootstrap script and the application we want to deploy on to the servers. We use a Amazon’s Windows AMI that we put our configuration and application on during first boot.

diagram

Let’s look at some concepts that we need to understand.

Auto Scaling Group

The Auto Scaling Group, ASG, is the key component in our example. It contains information about how to scale your application, what to launch the instances from and how to perform updates.

Launch Configuration

The ASG launches new instances from a Launch Configuration. It contains information about which AMI to use, instance type, which Security Group to put them in and so on. Basically, the same information you need to provide when launching an instance manually.

Elastic Load Balancing

The ELB distributes traffic between the instances in the ASG. It also performs health checks and reports back to ASG if it finds an unhealthy instance.

CloudFormation

CloudFomarmation allows you to build blueprints of your infrastructure in JSON format. You create stacks from your templates, but you can also change your infrastructure by updating the template and performing a stack update. We will trigger a stack update to roll out configuration changes to our servers and deploy new versions of our applications.

Bootstrapping

Bootstrapping can be accomplished by passing User Data to the instances. We will pass a very simple PowerShell-script that just goes to our S3 bucket, downloads our central bootstrap script, runs it and passes a message back to CloudFormation that it has completed. Any configuration we want to make to our instances is kept in our central bootstrap script. If we want to make any changes to our servers we just update the script, trigger a stack update and CloudFormation will take care of the rest for us.

Setting this up

IAM Role

First of all, we will create an AIM role that our instances will be assigned to. This role will give the instances read access to our S3 bucket. I’m starting by creating a a policy which I’m calling S3-Read-Bootstrap. It has the following statement and, as you may notice, my bucket is called “cristian-bootstrap”.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:Get*",
                "s3:List*"
            ],
            "Resource": "arn:aws:s3:::cristian-bootstrap/*"
        }
    ]
}

I then create a role called “web-server-role” which I attach the policy to.

iam-role

S3 bucket

I have a bucket called cristian-bootstrap and it only contains 3 files.

  • bootstrap.ps1 – my central bootstrap script which installs IIS and our application
  • index.html – simple web page that will act as our “application”
  • web-stack.json – our Cloudformation template containing the blueprint

CloudFormation template

This is where the meat is. Here we “program” our infrastructure rather than clicking around in the console. If you haven’t used CloudFormation, I strongly recommend that you familiarize yourself with it. There are template snippets which are a good starting point. Head over to the template reference if you need details about specific resource types. You may also want to have a look at a JSON tutorial.

The file contains a section for parameters and one with resources. We are defining one parameter, a launch configuration, a load balancer and an auto scaling group.

Parameters

Here we specify which parameters we need to create our stack. It could be any input you want to pass to your template. In our case we are just taking one parameter and that is build version. The way we are going to use this parameter is to determine if there is a new version of our application we want to deploy. Let’s say that there is nothing to change on our infrastructure but we still want to deploy a new version of our application. This is how we tell CloudFormation that. The parameter is called BuildNumber, of type string an if nothing is specified it will default to “0.1”.

{
  "Parameters": {
    "BuildNumber": {
      "Type": "String",
      "Default": "0.1",
      "Description": "Number of the build being deployed"
  }
},

Next are the resources. Let’s have a look at them one by one.

Launch Configuration

First we are saying that this resource is called LaunchConfig and that it is if type Launch Configuration. We are then specifying the properties like, AMI, instance type, key pair security group and IAM Role (remember the one we created in the beginning?). We also have the user data which may look a little bit funny but makes more sense when we see past the formatting.

"LaunchConfig": {
  "Type": "AWS::AutoScaling::LaunchConfiguration",
  "Properties": {
    "ImageId": "ami-9ebb39ed",
    "InstanceType": "t2.micro",
    "KeyName": "cristian-ew-key",
    "IamInstanceProfile": "web-server-role",
    "SecurityGroups": [ "sg-60ef3904" ],
    "UserData": {
      "Fn::Base64": {
        "Fn::Join": [
          "",
          [
            "<powershell>\n",
            "New-Item -Path C:\\Bootstrap -ItemType directory\n",
            "Copy-S3Object -BucketName cristian-bootstrap -Key bootstrap.ps1 -LocalFile C:\\Bootstrap\\bootstrap.ps1\n",
            "C:\\Bootstrap\\bootstrap.ps1\n",
            "cfn-signal.exe -e 0 --stack ", { "Ref" : "AWS::StackName" }, " --resource AutoScalingGroup --region ", { "Ref" : "AWS::Region" }, "\n",
            "# ", { "Ref": "BuildNumber" }, "\n",
            "</powershell>"
          ]
        ]
      }
    }
  }
},

Bellow you see what the User Data will look like. First we are telling it that this is a PowerShell script. We start by creating a directory called C:\Bootstrap. The next line copies bootstrap.ps1, our central bootstrap script, to the newly created folder from our S3 bucket. We then run the bootstrap script we just downloaded. Once it completes, we send a signal to CloudFormation letting it know. More about this signal later.

<powershell>
New-Item -Path C:\Bootstrap -ItemType directory
Copy-S3Object -BucketName cristian-bootstrap -Key bootstrap.ps1 -LocalFile C:\Bootstrap\bootstrap.ps1
C:\Bootstrap\bootstrap.ps1
cfn-signal.exe -e 0 --stack MyWindowsStack --resource AutoScalingGroup --region eu-west-1
# 0.1
</powershell>

Elastic Load Balancer

Next comes our ELB configuration. It tells CloudFormation to create a loadbalancer and specifies that standard stuff you need to specify whenever to create a load balancer, like which security group, subnets, that it is internet facing, what port to listen on, what port to talk to the instances on and what health check to run. I would normally use SSL offload in the ELB, but in this case we are skipping this for simplicity.

"LoadBalancer": {
  "Type": "AWS::ElasticLoadBalancing::LoadBalancer",
  "Properties": {
    "SecurityGroups": [ "sg-60ef3904" ],
    "Scheme" : "internet-facing",
    "Subnets": [ "subnet-dd2eb784", "subnet-63f7a906" ],
    "Listeners": [
      {
        "InstancePort": "80",
        "InstanceProtocol": "HTTP",
        "LoadBalancerPort": "80",
        "Protocol": "HTTP"
      }
    ],
    "HealthCheck": {
      "Target": "HTTP:80/index.html",
      "Timeout": "5",
      "Interval": "30",
      "UnhealthyThreshold": "2",
      "HealthyThreshold": "2"
    }
  }
},

Auto Scaling Group

Lastly we create the ASG. We tell it which AZs to use, which launch configuration to launch instances from, minimum number of instances and maximum number of instances we want running. We also tell it that instances should register with the load balancer we created above. The HealthCheckType and HealthCheckGracePeriod are important properties. Default health check type is EC2 meaning that the ASG will only look at the EC2 health to determine its health, basically is the server up? That is not good enough as we want to make sure that our application is actually responding as expected so we tell the ASG to use ELB health check instead. The grace period is how long the ASG should give the instance to start up before checking its health. We are bootstraping the machines so we need to give them a little bit of time. This value is important. If it is too short ASG will believe instances are broken even before they have completed the bootstrap process. I once set this far too low, went for lunch and forgot about it. Next day I discovered my ASG had been bringing up hundreds of instances, believing they were broken and throwing them away. Remember that minimum charge for an instance is 1 hour, right?

The creation policy tells CloudFormation that it should wait for the instances to signal completion and wait for our bootstrap process to complete. That is what the cfn-signal does in our User Data. Since we want 2 instances before we consider us up and running we set the count to “2”. We also tell it to wait for up to 30 minutes for this signal to come.

The update policy says that we want to use rolling updates. If we make a change to our instances we want them replaced one by one but do not continue with the next until you have received an ok signal.

"AutoScalingGroup": {
  "Type": "AWS::AutoScaling::AutoScalingGroup",
  "Properties": {
    "AvailabilityZones": [ "eu-west-1a", "eu-west-1b" ],
    "LaunchConfigurationName": { "Ref": "LaunchConfig" },
    "MinSize": "2",
    "MaxSize": "4",
    "HealthCheckType": "ELB",
    "HealthCheckGracePeriod": "1800",
    "LoadBalancerNames": [ { "Ref": "LoadBalancer" } ]
  },
  "UpdatePolicy": {
    "AutoScalingRollingUpdate": {
      "MinInstancesInService": "1",
      "MaxBatchSize": "1",
      "WaitOnResourceSignals": "true",
      "PauseTime": "PT30M"
    }
  },
  "CreationPolicy": {
    "ResourceSignal": {
      "Count": "2",
      "Timeout": "PT30M"
    }
  }
}

Our application

For simplicity, our application is represented by a single html-file. The important thing here is to understand that we are deploying our application from our bootstrap script and that our application package is sitting in our S3 bucket. In a real world scenario this probably means that you have your build system dropping the application packages to an S3 bucket. You could then specify a parameter in your CloudFormation file to tell it which application to install. You could also tell it which environment this is (dev/test/prod) and have your bootstrap script pull down environment specific configuration files. Anyway, in this example it’s just a file called index.html.

Create your stack

Let’s see this in action. Go into CloudFormation in the AWS Console and choose to create a new stack. Specify the URL to your CloudFormation template and click next.

select-template

You will be asked to give your stack a name and specify any parameters that you are requesting in your CloudFormation template.Remember that we had a build number in ours? I’m calling my stack “MyWebStack” and just leaving the default build number.

details

On the next screen you can give your resources tags. Tags can be really useful for finding stuff quickly but also allows you to group costs by tags. This way you can see how much a specific application is costing or how much you different teams spends on infrastructure. You can also set permissions based on tags. You could give developers access to dev environments or Windows admins to Windows servers only. Anyway, we are not using tags in this example so just skip to the next page, review and create your stack.

It will take a few minutes, 7 in this example, so stretch your legs or something. Bellow is an example of how the log looks like. As you see we are receiving success signals from our instances. These are the ones we are sending once the bootstrap completes.

cf-log

If you want to know which resources it created you can click the Resources tab.

resources

We will of course also find the resources in the AWS console. Here is our Auto Scaling Group.

ASG

The Launch Configuration…

LC

The Load Balancer. Here is where we see the DNS name that our ELB was assigned. I would recommend using Route 53 to have prettier DNS names.

ELB

And finally our EC2 instances.

instances

If we point our browser to the ELB adress we should see our application.

app-v1

Yes, it’s working. Let’s see what we can do with this.

Recover from failures

To simulate a failure, I’m jumping on to one of the machines and stopping the web server.

stopiis

It doesn’t take long before the ASG reacts by terminating my failed instance and firing up a new one.

pending

Few minutes later and we are back where we want to be with 2 operational machines.

allgreen

Updating your stack

Since we did enable rolling updates in our ASG, we can seamlessly replace instances. If we for example want to switch from t2.micro to t2.small you just need to update your CloudFormation template and do an update stack and it will all be taken care of for you. But how about new versions of our application or configuration changes to the servers that we do using our central bootstrap script? That’s where the build number parameter comes in. If you didn’t notice, we insert the build number into the User Data as a script comment. It’s not doing anything, but if we update the build number, Cloud Formation will see a change to the launch configuration and do a rolling update. That’s how we will be deploying!

userdata

Let’s pretend that we have a new build of our application in the S3 bucket waiting to be deployed. I’ll kickoff a stack update. Go into CloudFormation, only this time chose Update stack instead and specify the same CloudFormation template. Next yo will see the screen where you can specify the parameters. I’m updating my build number to “2” and then I kick off the update.

details2

You will see one instance at the time being replace.

shuttingdown

A while later your servers will have been replaced with new fresh ones.

updated

Let’s test our application. I’m pointing my browser to the ELB address and we see that our application has been updated.

v2

Considerations

There are a few things to consider when working with this set up.

You need to handle any sessions outside of the servers. If you for example have users login in to your application that sessions needs to be handled outside of the servers, DynamoDB could be a good choice. Otherwise your users will be loged out whenever they hit another server. Any files uploaded need to be saved outside the servers, example S3. Since your servers will be replaced quite frequently, you should push out the logs, example to CloudWatch. Think “my server can be replace any minute – what does this mean for my application?”.

Small instances might take long time to bootstrap depending on what you do during your bootstrap process. You may reach a point where you need to start baking custom AMIs to speed things up. You don’t need to go all the way with complete AMIs but maybe the basic stuff in the AMI and leaving the application deployment for the bootstrap. Going from T2 to other types of instances, example M4 can speed up things significantly but then to a higher cost. It’s all a trade off in the end.

In a Windows world it is quite common to join your servers to an Active Directory. This will slow down your bootstrap process even more. You may also have things you can only do after the mandatory reboot when joining a domain, example specifying domain accounts for your applications pools. This means that you must be able to continue your bootstrap process even after a reboot. I will cover that in an upcoming article.

T2 instances uses CPU credits meaning that they can only peak CPU for certain amount of time and then simply given less CPU time. If you are using scale out policies based on CPU this could potentially be a problem. You can read more about CPU credits here. The nice thing about T2 is that they are so inexpensive but in the end you get what you pay for.

Regional high availability on AWS

Introduction

AWS has excellent infrastructure for building highly available and scalable applications. The infrastructure consists of Regions, which are physical locations around the world from which the services are provided, currently 12 of them. Within regions, there are multiple Availability Zones (AZs) which are clusters of datacenters connected with a high speed and low latency network. Read more about Regions and Availability Zones over at the AWS web site.

The first step in building highly available is to deploy your applications into multiple Availability Zones. EC2 instances can be deployed to different AZs within a Region, Elastic Load Balancers can be used to distribute traffic between them. It can even discover broken instances and stop directing traffic to them. To take this a step further you can use Auto Scaling Groups which can span multiple AZs and have them scale out your application, or replace broken instances – all automatically. RDS can also be deployed to multi-AZs with the click of a button or two. Or was it three? I don’t remember.

Multi-AZs will give you highly available applications and will cover the majority of the cases but what about if you need to take this to the next level and be able to fail over your application to a different region? Let’s look at how this can be accomplished and it’s actually quite easy.

The building blocks

In our example we will setup WordPress on EC2 instances in two different regions. One region will function as primary and the other one will take over in case of failure. The database will be hosted on RDS running MySQL which supports cross-region read replicas – basically an asynchronous replication of your primary RDS instance to another region. Route 53 will be used to point the traffic to your primary region, check the application health and failover to the secondary region in case a failure is detected. Here is a simple diagram over the solution.

regional-ha

 

Route 53

Route 53 sits on top and points the DNS name example.cristian-contreras.me to the Ireland Region in this example. It will also poll the application to check its health. In case a failure is detected, it will automatically switch the traffic to the Frankfurt Region.

Elastic IP

An Elastic IP will be assigned to each EC2 instance. This gives static public IPs as opposed to dynamic public IPs which are default. We want the EC2 instances to keep the same IP even after a shutdown.

EC2 instances

The EC2 instances are the web servers in this case and will have WordPress installed on them. Simply because it’s easy to install and illustrates the example well. You would normally deploy at least two EC2 instnces behind an ELB, one in each AZ but I’m sticking with one for simplicity.

RDS

RDS hosts the WordPress database on MySQL with a read replica to the Frankfurt region. Normally you would probably do a multi-AZ deployment for a production workload but again, this example is about regional high availability.

Setting this up

Setting this up consists of a few steps. Setting up the database, setting up the web servers, assigning Elastic IPs and configuring Route 53. Let’s get started.

Setting up the database

Launch the AWS Console and point it to your primary region, go into RDS and launch an instance. Make sure to choose MySQL as this supports cross-region read replicas which we will use to get the database to another region.

Bellow you see how I’m configuring my instance. I’m using a db.t2.micro instances, no multi-AZ deployment and just 5 GB of disk. The DB Instance Identifier is the name you want to give your instance. I’m using “wordpress-ireland” as that will quickly tells me that this is my WordPress instance in Ireland. The Master Username and Password are the credentials to your database instance. Take a note of theese values as you will need them later!

rds-settings

Next are the advanced settings. I’m deploying this to my default VPC, not publicly accessable and putting it in my default VPC security group. In a real world scenario you may want a dedicated security group for the RDS instance and only allow traffic on port 3306 from the security group containing your EC2. Make sure to also specify a database name, I’m using “wordpressdb”.

rds-adv-settings

Next, click Launch DB Instance to get things going. This will take a few minutes. Go grab a coffee or just roll your thumbs. Once your instance is ready for use it will say status available. It will also give you the endpoint adress for connecting to your instance. Make sure to take a note as you will need this later.

dbinstance-status

To create your read replica, let’s go into Instance Actions and choose, well you guessed it, Create Read Replica.

Here, I give the it the name “wordpress-frankfurt” and choose Frankfurt as the Destination Region. Again no public IP and a db.t2.micro. Click the button to create the read replica.

db-replica-settings

Now switch the console over to your secondary Region, in my case Frankfurt and check the status of your RDS instance. After a few minutes, it should become available and give you an address to the endpoint of your read replica. Take a note!

replica-status

Now we have the database part running with replication to our secondary region.

Set up the web server

Now that we have the database up and running, lets install Wordpress. I’m using Amazon Linux and have used the following guide. Bellow is the short version of the procedure.

Go to your primary Region and launch an Amazon Linux EC2 instance. I’m using a t2.micro. Make sure you deploy it to a public subnet so it can be accessed from the internet. Also, make sure the security group you use will allow http traffic from the internet. Login, and run the following commands.

Install Apache web server

Let’s start by installing Apache.

sudo yum install -y httpd php php-mysqlnd

Make sure that the web servers starts automatically in the future start it manually for this time.

sudo chkconfig httpd on
sudo service httpd start

Add a www group and make yourself member of it to have write access to the Apache document root.

sudo groupadd www
sudo usermod -a -G www ec2-user

Now log out and login again to have the group membership take effect and set file permissions on the Apache document root.

sudo chown -R root:www /var/www
sudo chmod 2775 /var/www
find /var/www -type d -exec sudo chmod 2775 {} \;
find /var/www -type f -exec sudo chmod 0664 {} \;

Install WordPress

Now that we have Apache up and running. Let’s get on with the WordPress installation. Download wordpress:

wget https://wordpress.org/latest.tar.gz
tar -xzf latest.tar.gz

Create your WordPress configuration file:

cd wordpress/
cp wp-config-sample.php wp-config.php
nano wp-config.php

Edit your newly created configuration file:

nano wp-config.php

Find the lines bellow and replace the values in red with your values. You did take notes in the previous section? Here is how my file looks like, apart from the password which I’m keeping to myself. Note that DB_HOST should point to your RDS endpoint in the same region as your EC2 instance.

define('DB_NAME', 'wordpressdb');
define('DB_USER', 'wordpress');
define('DB_PASSWORD', 'password_here');
define('DB_HOST', 'wordpress-ireland.c7zg7vsdpqnc.eu-west-1.rds.amazonaws.com:3306');

WordPress uses some salt and key values to provide a layer of encryption to the browser cookies that WordPress users store on their local machines. Get your own by visiting the WordPress API that randomly generates them for you. Replace the examples in the configuration file with your own. Here is an example:

define('AUTH_KEY', 'o2XCCOwAd)|e}-Qu7E#09qjgw>U|a d|OszfpJRR7w*6V^W=_EF6n$1_DMB28jiz');
define('SECURE_AUTH_KEY', 'u% <{-&_&7StJ=|,2XRNSv4&84IM&nS.l3|q]!J~C^zyQRW?hFUn^hTSdez8?y+%');
define('LOGGED_IN_KEY', 'Nuopj*?pb-=RqHJ35PvqpVB.eoO1:0FxvS xI70L}13y.bDooofB65>o 4vJt|?b');
define('NONCE_KEY', 'W;9--%,ULc(c9g~h+g&|_QtS%g[y|5{_(t|ED:8~e_Gzi!Lz `D_ew|,|,R8w=f-');
define('AUTH_SALT', 'XS-4fOEo],i#`<*qn%xmcf]$ );r+[o)-`75OU[@q@.#fI+2-zb(.m5{LcE*Dr(;');
define('SECURE_AUTH_SALT', '2|^-W{Za]BmBj/^/;-$#Mg81wS|m#s+HpTQ9#fJ+`7.))@g;<G<s2O>fe0F2Mngj');
define('LOGGED_IN_SALT', '>r]-W1Gl|uV9y+DkbC-!:f9mnnU3mr mS CoReKkA+:1L[3CV^-rl]$5ZVk1L1=q');
define('NONCE_SALT', 'm&yP/tYKHk}jxr$]r@Dpj_kEalfn>D&e#%YSy2#-Z=.h$|}9+}|Qk8!6L-RiUKN3');

Now, move the wordpress files to the Apache document root. If this step is not working, you probably missed to logout and log back in when assigning yourself to the www group.

mv * /var/www/html/

WordPress permalinks need to use Apache .htaccess files to work properly, but this is not enabled by default on Amazon Linux.

sudo vim /etc/httpd/conf/httpd.conf

Find the following section in your file and replace AllowOveride with All.

<Directory "/var/www/html">
    # some stuff here...
    AllowOverride All
    # other stuff here...
</Directory>

Some of the available features in WordPress require write access to the Apache document root (such as uploading media though the Administration screens).

sudo usermod -a -G www apache
sudo chown -R apache /var/www
sudo chgrp -R www /var/www
find /var/www -type d -exec sudo chmod 2775 {} \;
find /var/www -type f -exec sudo chmod 0664 {} \;

Let’s give Apache a restart to pick up the new group and permissions

sudo service httpd restart

Assign Elastic IPs

Give the server an Elastic IP. This is a public static IP that won’t change if we shutdown the server. We need this.

elastic-ip

You should now see your Elastic IP.

elastic-ip-status

Under Actions click Assosiate Address and associate it with your WordPress instance.

elastic-ip-assign

That’s it for the EIP! Oh, by the way. Take a note of the IP.

Now repeat!

Don’t start WordPress just yet. First time it runs, it will look at the URL you are using and configure itself with that. Since we haven’t created our DNS record yet, we don’t want that to happen now. But we want a WordPress server in the secondary region So go back and repeat the section “Set up the web server” and “Assign Elastic IPs”. Only this time point your AWS Console to the secondary region and remember to point out the RDS endpoint in your secondary region in the WordPress configuration file.

Configure Route 53

Now we have almost everything up and running. Let’s set up Route 53 to point traffic to the application.

I have a hosted zone in Route 53. If you don’t have one, you can register one quite easily. Here is my zone.

hosted-zone

Health check

First of all, let’s create a health check. It is used by Route 53 to determine if your application is healthy. If not, it will redirect the traffic to your secondary region.

Click Health checks in the Route 53 console and then Create health check. I gave mine the name wordpress. Specify the Elastic IP you assigned to EC2 in your primary region. Enter the hostname you will be using. I’m not creating an alert in the next step, just go ahead and create the healt check.

health-check

Within a few minutes it should go green.

health-check-status

Primary DNS record

Now let’s go into our hosted zone and create a record for our primary region. Specify the name of the record, example in this example. I’m setting TTL to 60 sec to prevent DNS servers to cache the record any longer than that. Remember, we are failing over to the secondary by updating DNS. Set the Routing Policy to Failover, the Failover Record type to Primary and specify which health check to use.

dns-pri

Secondary DNS record

Now we need to create our secondary record. This will point to the IP that we want to failover to in case the health check fails.

Similar to last time, only now specify the Elastic IP in your secondary region and set Failover Record Type to Secondary. No health check necessary.

dns-sec

That’s it!

Testing the magic

Now that we have everything set up, we are ready to test. Let’s go to our Wordpress URL. In my case it’s http://example.cristian-contreras.me. As this is the first time, it will take you to the WordPress installation guide.

wp1

Next page asks you for Site Title, Username, password, etc. Complete the form and click Install WordPress to have the blog installed.

wp2

I post a test message and the blog looks surpringsingly similar to this one. Our WordPress site is working!

wp3

Let’s break things

If this works as expected, we should be able to stop the EC2 instance in our primary region and see a failover take place. Let’s see if that happens!

First of all, let’s have a look at where DNS is pointing. Looks right- that’s my EIP in the primary region.

ns-pre

Now we brutally stop the EC2 instance in our primary region.

stop

Our health check reacts within a few minutes.

hc-fail

DNS now points to our secondary region.

ns-post

Testing the site and it is working! Remember, our EC2 instance in the primary region is stopped so this is actually working.

wp-sec

Yo can browse the site, no problem at all, but if we try to post a comment we get the following error. Remember, we are on a read replica in the secondary region so this is expected.

save-error

So we have failed over to our secondary region now, all automatically, apart from us deliberately breaking our application of course. This is pretty good even if our database is read-only. This might even cover our needs. In a major disaster scenario like this maybe we are ok with this. If not, let’s fail over our database.

Promoting a read replica

Promoting a read replica is really easy. Just highlight it, click Instance Actions and choose Promote Read Replica from the menu and confirm the action on the following screen.

promote

This will take a few minutes and the instance will even have to reboot.

reboot

Once the reboot completes I can post a comment and we are fully operational again.

comment

The big gotcha

There is one important thing to be aware of. Since we promoted our read replica, it is now a stand alone instance. This means that we have two WordPress RDS instances living their seperate lifes, one in each region. If we would bring our web server back up in the primary region, Route 53 would point us there and we would be talking to the primary region again. Any changes made on the secondary region would be lost. This is not a problem, just something to be aware of and in other use cases desired behaviour. I’m sure there are clever ways of failing back but I would probably just terminate my original RDS instance in the primary region, create a read replica from the secondary region back to the primary and fail back in a planned way during off hours.

Also, WordPress stores uploaded media content locally on the server (i believe? ). Meaning that pictures we upload would never make it to our other region. Our applications should put uploaded files in S3 which does support cross region replication. There are actually WordPress plugins that does just that. Maybe a topic for an upcoming post.

Conclusion

So we have successfully failed over to our secondary region. If we are ok with read-only database, failover happens all automatically. Even fail back will accur once Route 53 sees that our application is healthy again in the primary region. It gets a little bit more complicated if we want to fail over the database but this is the nature of asynchronous replication. It requires some manual steps and takes a few minute to promote the read replica. You can of course automate things, AWS has excellent APIs. But is this acceptable or even useful? I say absolutely yes for both scenarios. Remember – running EC2 and RDS in multi AZ deployments protects you against datacenter wide outages and does it pretty well. This is the next level of protection, when not only a datacenter breaks down but a whole region. This is the kind of disaster that most of us, back in the days, would never have the resources to put in place and if it happened would put us in a really bad situation. With that perspective, this is amazing stuff!