This week Amazon Web Services (AWS) storage services in one of their regions experienced issues for the better part of a day that impacted a large number of companies in the northeast such as Adobe, MailChimp, Quora, Slack, Trello, Zendesk, and even the SEC, which were either working slowly or were down all together.
The Cloud isn’t infallible and from time to time, outages have happened, though for the most part they are quite reliable. There was a five-hour AWS outage in 2015. And Microsoft Azure experienced an outage in February that lastest about five hours. The major impact of this week’s disruption of AWS S3 lasted about four hours, with most services returning to normal operation in that window, and all were resolved completely after twelve hours. The reliability isn’t in question here. What’s important is that if your business is offline for a half a day, that’s a big deal to your customers. Too often companies get lulled to the fact that since they are now in the Cloud that they are immune to any type of downtime or catastrophic failure. This is not true. Regardless of environment, Cloud or a Legacy data center, things can happen, and having a solid disaster recovery plan is vital nowadays to a company’s success.
Here are some things you can do to protect yourself.
Assess the Risk
Many companies use AWS as a data backup for their on-premise hosted services. Others may present a SaaS-based service to clients hosted entirely on S3. For the former, there is very real risk in not backing up some transactions that occurred during the outage due to incomplete synch when service is restored. These issues can be difficult to detect. The latter group of companies face the more ominous challenge of a disruption of service to customers and a resulting loss of revenue. Proper risk assessment will determine where and how systems will be put into place, and how those activities will be budgeted.
Have a Plan
If continually servicing your clients is important, and why wouldn’t it be, then a solid disaster recovery plan is essential. Always plan and test for a disaster because no hardware or land mass for that matter is immune to failure.
As we mention in our Disaster Recovery case study a solid DR process is vital to a company’s success.
The plan must have buy in from upper management and must be rehearsed on a quarterly basis.
Build in Redundancy
AWS provides ‘availability zones’ which are really disaster recovery zones. Availability zones are distinct locations within a region that are engineered to be isolated from failures in other Availability Zones.
AWS will automatically replicate data to these zones so that in the case of a data center failure the data is replicated. However, it’s key that any application is aware of these zones and is designed accordingly.
Test your plan each quarter. This way once a disaster strikes the company is able to flawlessly execute the plan causing minimal downtime to their customers.
Review the Plan
At a minimum, the complete plan should be reviewed once a year, starting from a re-assessment of risk and how that might have changed.
The main lesson learned from this incident is that no system is infallible. As of this writing, it does not appear that AWS was the victim of any sort of cyber-attack. It just went down. It is the responsibility of each company to consider the worst case and put into place those methods, assets and procedures that will protect the business when the impossible occurs.