At Huddle we take the availability of our service very seriously. Users of Huddle rely on our site to be able to do their work and collaborate with their team on a daily basis and any outage, however brief, could disrupt their working day.

We also have a global user base, which means that there is no such thing as “out of hours” for the Huddle service; users around the world expect access to documents or project information at any time of the day or night. For this reason we monitor the availability of our service very closely using two different external monitoring tools, and include scheduled maintenance as downtime in our availability statistics. This is a key difference between providing an enterprise tool, and and casual consumer service – if twitter is down, you can still carry on working, and some would argue you might even get more work done. here’s an example of Huddle’s uptime:

Not everyone agrees with this approach though. Our friends over at 37 Signals wrote a blog post on this topic a few years back saying that shooting for super-high levels of availability was unnecessary. “I think our uptime is more like 98% or 99%. Guess what, we’re still here!”, writes David. While I agree that it is important to assess the Criticality of your web site, I happen to disagree with his implicit assessment of the criticality of sites like Basecamp or Huddle.net.

So to the numbers. While our SLA offers Enterprise customers 99.5% availability, we consistently exceed this, and our availability over the last 3 months has hit 99.99%. That’s four nines – an achievement not to be sniffed at. The only downtime was a 15 minute scheduled (Saturday night) database change. How do we know? Well, as I mentioned we have two different external monitoring tools monitoring our site: Pingdom, and Gomez. These perform slightly different functions for us. Pingdom generates an HTTP request every minute from one of a number of sites around the world that hits a page that tests all layers of the application stack. Pingdom then checks for a successful response (matching a string that appears on the resulting page). This is our primary uptime measure and alerting mechanism. If Pingdom records our site as down it sends an SMS to key people at Huddle who jump to fix the problem no matter what time of day or night it is. Gomez on the other hand tests the site every two hours, rather than every minute, but logs on to the site and walks through a typical transaction much as a real user would, recording response times for every single page element as it goes. This is less useful for alerting us to any downtime (the site could be down for an hour and Gomez would not notice), but it does give us performance trends over time and allows us to drill down to get more details on pages or days that appear slow.

Huddle.net has not always recorded four nines of uptime – we have put a lot of effort in in the last year to get to this point (and there’s more to come to keep us there). Things we have done include:

  • Eliminate single points of failure in the infrastructure. As part of the infrastructure refresh and performance testing with did in preparation for the launch of our LinkedIn application, we made sure we had suitable redundancy at every layer.
  • Profile resource utilisation (CPU, memory etc.) under load and eliminate issues. For example we use Lucene for our searches, and in early 2008 we had problems with Lucene memory usage, which caused an outage. Not only is this now resolved, but profiling ahead of release ensures the problem will not recur.
  • Change control. Nothing to do with the underlying technology, but probably the best way to ensure high availability of your application. As I have mentioned previously, the single thing most likely to break a web site is a new software release. So we test our releases thoroughly before putting them live. And when we are happy it is of suitably high quality, we test it again just to make sure. We keep a record of the changes we make, test the release once it is live, and our change log allows us to roll back if there any problems.

In fact on reflection, if you want to improve your website’s uptime, my advice would be to start with the last one of those – put some controls around changes to your production environment, and keep a log of everything you do so you can reverse it if you later find it broke your site. It’s not exciting, it’s not cool, and it’s not very startup or Web 2.0, but really, this is the single best thing you can do to improve your site’s uptime.


Request a Demo

© 2006 - 2019. All Rights Reserved.