It was an ordinary Tuesday, the last day of February 2017. It was getting late in the day and I was online doing my monthly management accounts and was just finishing the data entry for a very long Invoice when refreshing the expenses Tab on my browser showed this delightful picture.
Woah, I thought, someone has updated the package and not done their QA properly. I waited 5 minutes and then the inevitable 500 error arrived when I tried to save my invoice. Ho hum, all the data was lost so at 6pm precisely the only way forward was to pour a gin & tonic and investigate.
I then learned that Strava was down so I could see who had done some sneaky Tuesday cycling but so were Docker's Registry Hub, Trello, Travis CI, GitHub and GitLab, Quora, Medium, Signal, Slack, Imgur, Twitch.tv, Razer, anything with images in S3, Adobe's cloud, Zendesk, Heroku, Nest and at least 2 clients' media repositories.
15 minutes later we learned that the Amazon's S3 storage in the East coast data centre had gone down. Companies who had a mixed cloud strategy or those that followed the rules and used multiple AWS (Amazon Web Services) data centres to replicate their data and compute loads showed signs of slowing down, but they did not crash.
To Amazon's credit, they found the problem and fixed it incredibly fast. One can only imagine the disaster movie type scenes in the data centre with virtual klaxons wailing and for some unknown Hollywood reason the front panels exploding from overloaded servers. Has anyone actually seen the "Exploding front panel bolts" option on the HP Server Configuration sheet?
Last week's Hollywood Professional Alliance Technical Retreat in Indian Wells saw many of the big thinkers in the industry discuss the future and the impact of the cloud. Some viewpoints focussed on new opportunities that arose by being free of owning and operating an in-house data centre. Other viewpoints focussed on how a geographically diverse Disaster Recovery setup could be implemented across multiple data centres giving protection should the main broadcast / cable / satellite / origin server disappear. Others talked about what could be done if the network to the cloud were compromised.
The underlying impression from everyone is that the cloud won't go away.
If cloud resilience is part of your day to day business, then it is really worthwhile reviewing tweets and articles about the impact of the outage on Tuesday 28th Feb 2017. If a consumer can't change channels on their TV because their Smart Remote Control has not been able to authenticate against WiFi for 30 minutes and suddenly their heating refuses to come on because the Smart Controller cannot get a response from the database telling it what temperature to set, then maybe the fact that your cloud Business Continuity Solution is down will not be noticed because your viewers are sitting in the dark, huddled in a blanket reading books with a bike light.
The cloud is undoubtedly one of the most promising ways of deploying compute technology so far devised, but without each and every application having a plan for both security and resilience mean that the pitfalls of unlimited trust in any cloud vendors uptime could be both large and unpredictable.
When you're wandering around the halls of NAB 2017, I guarantee that this will be a good topic of conversation for the software vendors, especially if you start the conversation "What would happen to your application if....."
Until next month - Enjoy!