Setting up BuildBot with Windows Slave on AWS EC2

Not long ago I set up a BuildBot instance for wxHaskell and thought about sharing my experience with configuring all the elements of the system. Johan Tibell’s commentary to the State of Haskell, 2011 survey, and his call for Windows bots to test Hackage libraries, prompted me to finally put my notes together into tutorial of a sort. There is nothing Haskell-specific, so if you are looking to automate Windows builds of apps written in any other language this guide should provide all the info you need.

The Setup

BuildBot is a build server designed specifically for automating builds of software that targets multiple platforms. The central server (master) sends build requests to multiple slaves, each of which can be configured differently. The server receives the build results from slaves, aggregates them and displays via a web interface. While the server should be accessible at all times, the slaves are only required when there is something to build.

In order for the build to be repeatable and to avoid random errors introduced by unrelated configuration changes, it would be nice for the slaves to be dedicated machines with a known initial configurations. This sounds like an ideal setting for using virual servers created from a disk image specifically for the build. Thanks to services like Amazon EC2 it is relatively easy and inexpensive to set up such a box.

Helpfully, BuildBot supports so called “latent slaves”, which are started by the server when there is a build to run, and are stopped when there is no more work to do. Furthermore, there is an implementation of a latent slave that works with EC2.

What we will need

For the BuildBot master, we’ll need a server with shell access, which is permanently online and where you can open a port to a non-HTTP traffic (this is required for accepting slave connections). I’ve used a WebFaction shared hosting with a dedicated IP (additional 5$ per month), which is required for the aforementioned port.

We’ll set up the slave on Amazon EC2. I’m not going to cover how to set up an account there as the process is straightforward and there are numerous resources on how to do this. Hosting of the disk image (AMI) will set us back a couple of dollars per month; on top of that we’ll need to pay for the time our slave is running, which for typical builds that take less than an hour and one build every day running on a micro instance will add another couple of dollars.

The Master

All of the steps below happen on the always-online server. They assume that python on our server runs Python version 2.7.

Virtualenv

Download the tarball from [http://pypi.python.org/pypi/virtualenv/] and untar to a temporary directory. In there run python setup.py instal --dry-run. If it complains about missing directories, create them. Continue running with –dry-run until it passe, then run without –dry-run. Once done, we should end up with virtualenv script in ~/bin. We can then get rid of the tarball and temporary directory.

BuildBot

Follow the BuildBot tutorial. I installed BuildBot in ~/opt/buildbot instead of suggested ~/tmp/buildbot, other than that I followed it to the letter.

Opening the BuildBot website to the outside world depends on how our server is set up. For WebFaction it involved creating a custom web application via the control panel. This web application got assigned a port number, which then needed to be set in BuildBot’s config by editing ~/opt/buildbot/sandbox/master/master.cfg and changing the http_port in c['status'].append(html.WebStatus(http_port=8010, authz=authz_cfg)) to the assigned port.

Finally, we might want to set up a cron job to automatically start the master should it die for any reason; here’s the script I used:

#!/bin/bash

# starts the buildbot master if it's not running already

HOME=/home/mmakowski

ps -ef | grep 'buildbot start master' | grep -vq grep
RUNNING=$?
if [ $RUNNING -ne 0 ]
then
    cd $HOME/opt/buildbot
    source sandbox/bin/activate
    cd sandbox
    buildbot start master
fi

Then crontab -e and add entry:

0       *       *       *       *       /home/mmakowski/bin/buildbot-start-master-if-not-running &

which will ensure that the script will run once every hour.

The Slave

If we have a Windows box handy we might want to test the slave setup on it before playing with AWS instance, which is billed by the hour.

Log on to AWS Management Console and set up a Windows instance using one of the Windows AMIs provided by Amazon((I’ll assume we used a 64-bit OS in the instructions that follow; if not, 32 bit versions of relevant software should be used instead of the ones mentioned)) (see EC2 Gettings Started guide for step-by-step instructions). After a couple of minutes we will be able to retrieve the administrator password and connect using Remote Desktop Connection (if on Windows) or rdesktop (on Linux or Mac).

Once logged on, we need to change the administrator password to something we’ll remember and install the following software:

On top of that we will need to install any other software required to build the project we’re setting the slave up for (VCS client, compiler, libraries etc.)

Once it is done, we can follow the slave part of the BuildBot tutorial to make sure that the slave can communicate with the master and can handle build requests.

Note that so far the slave we set up is a regular (non-latent) one, i.e. master will assume that it is always online and available to handle build requests. We’ll address this shortly; for now it’s important to make sure that the builds work as expected on our newly set up server.

Making it Latent

The Slave

After we are happy that the master can invoke builds, one final piece of slave configuration is ensuring that the slave process starts up automatically after the server is brought on line; on Windows this is done using services. Making slave run as a service will require a piece of custom Python code. I’ve used the script by Ira Pfeifer, modified slightly to work with recent version of BuildBot. Here it is, for your convenience; make sure to update the environment variables set in the script to contain everything your build needs:

""" buildbot_slave_service.py

Original Author: Ira Pfeifer
Email: ipfeifer -dot- tech -at- gmail
"""

import sys
import os
import subprocess
import win32serviceutil
import win32service
import win32event
import win32api
import time

from buildslave.scripts import runner
from buildslave.scripts.startup import start

# change paths as appropriate
slavepath = "c:\\buildbot\\slave-wxhaskell"
homepath = "\\"
os.environ['HOMEDRIVE'] = "C:"
os.environ['HOMEPATH'] = homepath
os.environ['PATH'] = r"C:\Python27;C:\Python27\Scripts;C:\Program Files (x86)\Haskell\bin;C:\Program Files (x86)\Haskell Platform\2011.2.0.1\lib\extralibs\bin;C:\Program Files (x86)\Haskell Platform\2011.2.0.1\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;c:\program files (x86)\wx-config;c:\program files (x86)\Darcs;C:\Users\Administrator\AppData\Roaming\cabal\bin"
os.environ['WXWIN'] = r"C:\wxWidgets2.8"
os.environ['WXCFG'] = r"gcc_dll\mswu"


class BuildSlaveService(win32serviceutil.ServiceFramework):
    _svc_name_ = "BuildBot_Slave_wxHaskell"
    _svc_display_name_ = "BuildBot Slave wxHaskell"
    _svc_description_ = "Buildbot slave based in " + slavepath

    def __init__(self, args):
        win32serviceutil.ServiceFramework.__init__(self, args)
        self.stop_event = win32event.CreateEvent(None, 0, 0, None)

    def SvcDoRun(self):
        # The service starts a subprocess that will run the actual buildbot
        # so that it can be stopped by simply killing off the subprocess.
        self.child = subprocess.Popen(["python", __file__, "start"])
        isAlive = True
        while isAlive:
            time.sleep(10)

    def SvcStop(self):
        self.ReportServiceStatus(win32service.SERVICE_STOP_PENDING)
        handle = win32api.OpenProcess(1,0,self.child.pid)
        # returns exit code - wrap w/ error handling?
        win32api.TerminateProcess(handle, 0)
        win32api.CloseHandle(handle)
        isAlive = False
        self.ReportServiceStatus(win32service.SERVICE_STOP_PENDING)
        win32.SetEvent(self.hWaitStop)

        SvcShutdown = SvcStop

def start_buildslave():
    config = runner.Options()
    config.parseOptions(['start', slavepath])
    so = config.subOptions
    start(so)

if __name__ == '__main__':
    if len(sys.argv) > 1 and sys.argv[1] == "start":
        start_buildslave()
    else:
        win32serviceutil.HandleCommandLine(BuildSlaveService)

Put this somewhere on the server and run buildbot_slave_service.py install. Then go to Services application and change the startup to run automatically (delayed start), as Administrator.

After this is done we are ready to take a snapshot of the servers hard drive, so that we can start clones later on. Once the AMI appears as available in AMIs section of the EC2 console we will terminate the instance we’ve just configured.

The Master

We’ll need one additional Python library to enable BuildBot master to control EC2 instances: boto. On the master server, cd to ~/opt/buildbot, activate the sandbox (. sandbox/bin/activate) and run easy_install boto.

It’s now time to tell master to treat our slave as latent. The setup is described in the BuildBot docs, and it boils down to updating c['slaves'] like this:

####### BUILDSLAVES

# The 'slaves' list defines the set of recognized buildslaves. Each element is
# a BuildSlave object, specifying a unique slave name and password.  The same
# slave name and password must be configured on the slave.
from buildbot.buildslave import BuildSlave
from buildbot.ec2buildslave import EC2LatentBuildSlave
c['slaves'] = [EC2LatentBuildSlave('<slave name>', '<slave password>', 't1.micro',
                                   region=u'eu-west-1',
                                   ami='ami-12345678',
                                   identifier='<key identifier>',
                                   secret_identifier='<secret key>')]

Where:

  • <slave name> and <slave password> are the same as the ones we used so far
  • t1.micro is the type of EC2 instance; change it to a bigger one if that’s what you require for your build (check the prices before doing that though)
  • region should be the region where you set up your AMI (can be seen in the AWS console)
  • ami-12345678 is the id of the AMI we have created from the slave instance
  • <key identifier> and <secret key> can be generated and obtained from the Security Credentials section of your AWS account.

Finally, in BuildBot 0.8.4 there is an issue with EC2 instances not being terminated. If left unfixed it can lead to each build’s disk image being left active indefinitely and charging to our account! The bug should be fixed in BuildBot 0.8.5; in the mean time we can patch it by hand, by replacing instance.stop() with instance.terminate() in ~/opt/buildbot/sandbox/lib/python2.7/site-packages/buildbot-0.8.4p2-py2.7.egg/buildbot/ec2buildslave.py.

Once this is done, a buildbot restart master should provide us with a BuildBot process running with the final set up and starting up EC2 slave when required.

Fixing the Slave

If it turns out that our slave configuration needs an update, here is what we do:

  • in AWS console, create a new instance using our slave AMI
  • connect to the instance using Remote Desktop Connection/rdesktop and make the necessary config changes
  • create a new AMI from the reconfigured instance and then termiante the instance
  • update the master config with the new AMI id

Debugging the Master

While the source code of BuildBot might be a bit difficult to follow for someone not versed in Twisted, I found the manhole interface invaluable in tracking down what turned out to be a trivial config issue.

Gotchas

Two things I got bitten by and which took a little bit to investigate and fix are:

  • AWS instance reboot on startup. The Windows box is rebooted by AWS immediately after being started; apparently this is required for some TCP/IP-related reconfiguration to take effect. As a result it is crucial that our BuildBot slave service is set up with delayed start. Otherwise the service will start and report to the master; it can even start running the build, when suddenly the server will reboot. After the reboot the service will come back up and try to connect again, but the master will reject this connection as unsolicited.
  • The instance.stop() issue mentioned earlier. Strangely, I would swear that my setup had been working for about a week without any patches, with instances being terminated; the “orphaned” stopped instances seemed to have started appearing suddenly and without any changes in either the slave or the server code or config. I can only assume I must have been seeing things, since instance.stop() has been there all the time and I’d never expect it to terminate an instance.

Hopefully you won’t come across any more nasty surprises. Happy building!