The Scenario

You work on a software team and someone decides that a continuous build would be a good idea. Well, they are right! Someone should give that person a raise. The only problem is, who is going to build this thing? Software developers typically just want to write code, not build information systems. The IT department typically doesn’t know enough about the software development and testing process to configure such a beast. That’s where this article helps. This is a real world case study of the build system at my workplace. This document will hopefully ease your pain into the wonderful world of continuous builds with Hudson.

The Evolution of our Build Server

When we started with continuous builds, I built a homegrown server (best bang for the buck at the time) to handle the work. That was about 6 years ago. Our build job consisted of compiling Java code and creating a war file for our testers to deploy. This ran in a few minutes and everything was fine, life was good. The testers no longer needed to build code manually on their workstation. The build server would send an email to the committers and tell them if the build passed or not. Since at the time this was just a compile it passed 99.9% of the time.

This was before we had a suite of JUnits to run. We had a JUnit here and there but nothing that consisted of a full suite of tests. Our boss at the time (Mike Cohn) set out to change that. We started in on a test 1st methodology and created a top-level test suite and of course added that to the build. Our build time was starting to increase. And surprisingly (or not so) the successful build percentage started to fall.

Our whiz-bang tester, Lisa Crispin, was using a technology from a company called Canoo. Its an open source project named WebTest and it’s a functional web app testing tool. It got to the point where Lisa wanted the WebTest test integrated into the build. I was able to add it and write a little plug-in for our build system at the time (CruiseControl) to capture the results. This was the beginning of the end of that homegrown server. The build time was up to around 10 minutes and that was with a watered down list of WebTests to run. We created a Nightly build to run all of the tests at night as this now took almost an hour. Our build server also crashed. We were feeling some pain, for the 1st time.

Build Server Number Two and Three

Again I built up a homegrown server using 2 AMD 2600MP CPUs (I know this because we still use it, albeit in a different configuration). We call this server “Build 1”. Our regular continuous build times started to creep up as we added more and more JUnits and we hit the ten-minute mark. We made an internal goal to keep the “build” down to 7 minutes, however to accomplish this we had to buy another server. This one we call “Build 2” It got the continuous build down to 7 minutes and we used “Build 1” for the nightly. Which at this point was not ran nightly it just ran whenever it could after it saw a check in. It took about an hour for the “nightly’ to run.

Failure After Failure

We beat the crap out of these servers. During the workday these boxes run non-stop, typically what fails are the drives. Builds are both IO and CPU extensive. The servers get extremely hot and components start to fail. Each time they would fail, we would feel it. At this point we rely on the build so much that not only do we have to recover the hardware, our code starts to degrade as regressions are not caught quickly and therefore it takes even more time to find the bugs as they are “stacked” and its not clear what check in caused the failure. These usually caused our release to slip, as time is critical when you are doing 2-week iterations. However this is something we had to live with as money was also allocated to other functions in the company and not so much to build servers.

Migration to Hudson

One thing I could never get to work in CruiseControl was the “distributed” build. This seemed to be a critical piece if we wanted to get the build time down. After our build started to get out of hand again in terms of time, I started looking at other build systems. I played with Hudson a year before this time and thought it was the way to go back then. I just didn’t have the time to convert our build to it. Most things I do I have to make seem transparent to everyone else or they will not want me doing it, as time is again critical in a 2-week sprint.

Every six months or so we get what we call a “refactoring” or engineering sprint where we spend time to upgrade tools and libraries and to refactor really bad code that has been bugging us. Leading up to this sprint I installed Hudson on my Linux workstation and started to play with our build and Hudson. We also had a new coding project with its own source tree to deploy that needed its own build but had no home. I would send out some eye candy screen shots and it didn’t take long before the team was sold. This would be an office project for the engineering sprint.

During the engineering sprint we successfully converted Build 1 and Build 2 into Hudson Build Slaves and left my Linux box as the Hudson Master, until we got a new master purchased which was a Dell PE 2850. This was the 1st real server ever in our build farm.

Hudson is too good

Now that we had Hudson and everyone liked it, requests started coming in. First was Lisa Crispin, she wanted the Fitnesse tests fully integrated. We had a somewhat cheesy way of kicking off the Fitnesse tests, but not a good way to get the results. She polled her community and got a hold of a plug-in that someone wrote to present the Fitnesse tests into Hudson. At the same time we upgraded Fitnesse, the upgrade changed the format of the test results (Damn you Bob!). We had to use someone else’s XSL style sheet to convert the results back to the previous release’s results format, then life was good again. Now that Fitnesse tests could be integrated into the Build, the system started to get overloaded yet again.

The Obvious way to configure a Hudson Build System

Our tests did what you might expect and had to use external servers outside the control of Hudson. For example, for the Canoo WebTest tests to run they had to run against a web app server that had the latest code deployed. We used a single web app server and a single suite of WebTest scripts and this would take hours to run. The Fitnesse test was a similar process, with a single external Fitnesse server outside of the control of the Hudson system. Parallelizing the test became necessary to keep the test results feed back loop short. However this presented a problem. The tests were never meant to run in parallel. Refactoring the tests to work in this fashion didn’t sound like a good idea. They worked as is right now, rewriting tests kind of invalidates them. And it was simply going to be a lot of work. Problem two was that we would need more Backend External servers to run the Web App and to run Fitnesse.

The solution that seem to work the best and I like to call it the “Obvious Way to Configure a Hudson Build System, but we don’t do it that way for some strange reason”, was to create a generic slave that had all of the software needed to run any test without hitting an external server. This means that instead of the build slave running a WebTest against an external server it would start up its own Web Application Server (At this time it was Tomcat) and run the tests against “Localhost”. This was brilliant and it work well, Creating a new build slave was simple, they key was just to have all of the tools installed the same way with the same usernames and passwords.

Victims of our own success

This method worked too well. The testers started splitting up their test suites and were creating new build jobs left and right mainly because it was so simple. They didn’t have to deal with hostname issues and whatnot. They simply had to create the new job from an existing Hudson job and change whatever they needed in order to run a different set of tests. When the job ran it did not care what else was running because it was running in its own separate environment.

The build environment at this time consisted of the Dell PE 2850 as the Hudson Master and Build 1, Build 2 and my old Linux Workstation as build slaves. That workstation called Mojito was working so hard now that its big fans would turn up to full speed and make a ruckus by my desk. I would have the entire team looking for me with pitchforks, as this thing was loud. I eventually moved it to the server room and got a new Linux Workstation (A Dell Precision). Anyway the problem now was that there wasn’t enough build slaves to handle the load and builds were getting queued up. I sent a screen shot of Hudson’s home page to my boss to show the problem. At that moment in time four builds were going on (we had 4 machines in the build farm) and there was 7 jobs queued up waiting for a place to run. Our feedback loop was long.

If you checked in some code to fix a WebTest before lunch you would be lucky to find out if that had fixed the problem by the time you were going to go home. If it were after lunch already, it would be the next day before you knew. Something had to happen.

Birth of the Virtualized Hudson Build System

We have been using server virtualization around the office for about 3 years now. We even had some virtualized servers in our production environment. This technology is great and works as advertised. The idea came about to buy a single 8 core machine and split it out into 8 virtual build slaves. On paper this seems like a perfect solution to our problem. So it was a surprise to me that we just couldn’t get the money approved to do this. Eight core servers (2 CPUs 4 cores each) are now the norm and are pretty dang cheap right now, especially considering the cost of having highly paid engineers wait for a build. These boxes literally cost about $3000. However, this just seemed to be put always on the back burner. Until it happened again.

Here We Go Again

By this time our main compile build was generating 738MB (Yes Seven Hundred Megs) of data. This build ran in isolation on the Master server as moving that much data across the wire back to the master from a slave would have added to the build time, which was now at 15 minutes. On Aug 2 2010 the Dell PE 2850 also known as HUDSON (The Master) started to crash. Lisa sent out an email at 8pm to the team that said “Hudson just start freaking out”. Our main Linux guy (Steve Kives) sends a response that says “The server is seriously ill) and included the following log information

Aug 2 19:57:19 hudson syslogd: /var/log/messages: Read-only file system
Aug 2 19:57:19 hudson kernel: megaraid: aborting-216341995 cmd=2a
Aug 2 19:57:19 hudson kernel: megaraid abort: 216341995:19[255:128], fw owner
Aug 2 19:57:21 hudson kernel: megaraid mbox: critical hardware error!
Aug 2 19:57:21 hudson kernel: megaraid: hw error, cannot reset
Aug 2 19:57:21 hudson kernel: megaraid: hw error, cannot reset
Aug 2 19:57:21 hudson kernel: sd 0:2:0:0: timing out command, waited 360s
Aug 2 19:57:24 hudson kernel: journal commit I/O error
Aug 2 19:57:24 hudson kernel: journal commit I/O error
Aug 2 19:57:24 hudson kernel: sd 0:2:0:0: rejecting I/O to offline device
Aug 2 19:57:24 hudson kernel: EXT3-fs error (device dm-0): ext3_find_entry: reading directory #15958056 offset 0

I read the emails and knew what had happen. We just lost the disks. What the hell, I thought. This thing is Hardware RAID 5, but it was no use. In the morning Steve tried to restart the box, but no worky. The controller (Dell PERC 4) just started to re-initialize the drives. We just now officially lost our entire configuration (Hudson’s is hard to back up because they have Configuration and Data in the same directory).

We had an old Dell PE 850 lying in the rack powered off and I decided to rebuild on that, while the rest of the team was sharpening the pitchforks. It took about 1 day to get just the compile build back working again. And this was a slower machine so the build time went up to 17 minutes, but at least the pitchforks got put away.

Time to Implement Something New

It took a long time to rebuild everything and at the same time we had some major software architecture changes that made it hard to determine if it a build was failing because it was new Hudson configuration issue or because of our code changes.

The good news is that this failure prompted management to not only approve our original request but to also approve a new Hudson master to replace the failed box. After some debates and a lot planning we decided to make everything virtual, even the Master. This was to guard against another hardware failure that we knew was going to happen at sometime in the future. When the system crashes next, VMs that was on the crashed box can migrate over to the running box. If we do this right we should no longer have any downtime due to hardware failures.

The Dawn of a New Generation

Before I could 100% commit to the virtualization path I needed some numbers. I needed the performance data to back up this decision. Recall that our pre crash Hudson server could do the compile in 15 minutes; the Post crash old server could do it in 17. But what overhead did virtualization cause? I needed to know. And without further ado, here it is.

Server Time
HUDSON Pre Crash 15 Minutes
HUDSON Post Crash 17 Minutes
New 8 Core Server (Non Virtualized) 10 Minutes
New 8 Core Server (Virtualized) 12 Minutes
New 8 Core Server (Virtualized with iSCSI SAN for VM Storage) 13 Minutes

The holy grail of virtualization (in my mind) is to have the VMs be able to move from server to server without stopping the VM. To do this you need some sort of shared storage between Virtualized Hosts. This is where the last numbers come from. It’s a Virtualized Host running its VMs on an ISCSI SAN. So with that, 13 minutes is an awesome feat considering what we are gaining. I think the overhead of virtualization is well worth it. We will be able to further decrease our build time by parallelizing the builds even more. Adding capacity is pretty simple as well. We just have to add more Virtual Hosts.

Conclusion

We didn’t make our 7-minute build time goal and I’m not sure we will ever see that short of time again. We probably could have if we didn’t virtualized any of the build servers however that is the price I think we are willing to pay to have a more reliable build system. Overall our build will be faster as our queue should not be that deep anymore. This solution is very affective at getting every single once of capacity out of a server (The bosses will like that). Even though we didn’t spend a lot on this system and therefore it’s not the biggest baddest servers on the block, it’s the only thing we have for now and it works well. And as long as our Chuck Norris Plug-in is happy we are happy.