Huddle » Blog » Huddle news » True enterprise content management reliability in the cloud and what went wrong at Box yesterday

True enterprise content management reliability in the cloud and what went wrong at Box yesterday

Posted on 23 May, 2012 by in Huddle news | 1 comment

Yesterday morning, Box.net customers woke up to something unplanned and potentially very costly: beginning at 5:51am EST, those trying to access their files discovered that they were inaccessible. The message that greeted them was: “Your Box account is temporarily down – but you shouldn’t be – because we’ll be up and running soon.” Not what you want to see from your enterprise content management tool. The outage lasted for 3 hours and during that time no information was available to Box’s customer about the nature of the outage or when access to files would resume.

“Pardon the Pause” may be cute, but it won’t resonate with global enterprises working across multiple timezones whose operations don’t pause and whose workers needs access to the information in their enterprise content management system to get their job done round the clock.

Nearly six hours after Box was notified of the outage, Box Support said at 11:45am EST, “Our engineering team pushed several updates to the site yesterday which appear to have caused these issues. These changes have now been reverted and you should now be able to access the site again.”

Box’s service outage isn’t the first of its kind; we discussed Google’s service outage that occurred just before the launch of Google Drive.

Just to be clear: there is no such thing as 100% uptime. Nothing is perfect and no system is flawless—and that’s why it’s important for cloud software vendors to be very transparent about uptime, provide money-back uptime guarantees to customers (SLAs) and be informative and responsive during downtime events.

Businesses don’t ‘pause’ and neither should your enterprise content management system.

Today’s IT departments are responsible for supporting two unique sets of customers with different sets of needs: internal employees and external customers, partners, and suppliers. But these constituencies are more similar than you think: Just as your internal employees increasingly expect to perform their jobs anytime and anywhere, your external stakeholders share the same expectations in their ability to purchase, receive support, or access your data and systems. Forrester refers to this concept as the extended enterprisebecause a business function is rarely, if ever, a self-contained workflow within the infrastructure confines of the company. There is no easy button when it comes to running always-on, always-available services; a blend of a mature and stable process, people, and, of course, technologies are required. For companies that have matured their approach to high availability and disaster recovery to the point where they are one and the same — a concept that Forrester refers to as business technology resiliency — it has taken years of refining policies, adapting responses to downtime, and securing the appropriate levels of investment.

—Rachel Dines, Analyst, Forrester Research

Huddle believes that our customers should have 24/7 access to their content. And we put our money where our mouth is. Huddle’s Service Level Agreement (SLA) guarantees uptime of 99.9% over any three-month period, but we regularly overachieve.

Our only pauses are planned maintenance, scheduled for the time when the least number of our customers are expected to be working, and obviously our customers are notified. No surprises.

You can check out our real-time uptime report for the last 90 days here: http://uptime.awaremonitoring.com/uptime/huddle.

Uptime is extremely important for business-critical cloud applications and Huddle employs two separate external monitoring systems to track and record availability and response time from various locations around the globe. We have a 24×7 team available to respond in the unlikely event of a serious application issue.

Huddle’s CTO Jonathan Howell is offering a bit of insight at what might have happened at Box yesterday:

“My suspicion is that Box didn’t suffer from infrastructure failure, but a bad software release. They put out a new release on Monday night, then headed home. Sadly it then turned out that the release had a major issue that meant that most of their customers could not access their files. What’s worse, timezones acted against them, as the rest of the world struggled with this problem before the engineering or customer support teams in San Francisco were awake and aware of the problem.

Once Box’s staff were awake and up to date on the situation they rolled back the release.

You could say: this can happen to any enterprise content management system. It happened to Dropbox where a bad software release was responsible for DropBox leaving the door wide open, and allowing users log in without a password for four hours.

Here’s how at Huddle we do everything to prevent this happening:

Automated testing

We use Test Driven Development as an approach to writing software, so all the code we write is automatically tested at a low level every time we build it. That is not sufficient to give us confidence to release it though. In addition, for all Huddle functionality has automated integration tests, and the web user interface has another set of automated test that drive the browser to check it behaves correctly for all test cases. These tests are not written by the same people that write the code, but by our highly skilled QA and Test Automation Engineers, who determine all the possible test cases before development begins. A release must pass all of these tests before we consider it a candidate to go live.

Regression and release testing

Once we have a candidate release, we then deploy it to our test environment in exactly the same way that we will deploy it to production. Testing that not only is the software good, but our process for releasing it is going to go smoothly too. All our releases go out without interruption to the service so testing the process is key. Once in test we then regression test the release over a number of days before deploying production, and test again once the release is live.

Monitoring

Running a service that is up more than 99.99% of the time requires sophisticated monitoring to alert us to the first signs of any trouble. This includes external monitoring from Pingdom and Aware Monitoring that check the site every minute, a raft of internal monitors, and importantly alerts that trigger if we see a rise in the number of errors being logged by our servers. All of these monitors alert our operations team instantly, as well automatically escalating issues via SMS and phone using Pager Duty. We are on hand 24/7 to respond to any problems our users are seeing, without waiting for them to contact us.

Release timing

Architecting Huddle to allow releases without downtime (we have had just one “scheduled maintenance” in the last two years) means that we can release during the working day – and early in the working day – to give us maximum chance of catching any issues that have slipped through our many rounds of testing. We certainly don’t clock off once the release is out – we hang around to check that all is well. As a global business there is no such thing as “out of hours” for us.

Of course this doesn’t mean an issue like the one that affected Box this week could never happen to us – but we work very hard to make sure it is very unlikely.”

 

1 Comment

  1. Andrew

    May 24, 2012 at 6:28 pm

    Interesting. I was expecting you guys to slam Box with this blog post, instead it’s very well done and informative.

    Good job.

Join the discussion