01 October 2014 Posted by Anonymous

On September 24th, 2014, Amazon Web Services began notifying customers of a mandatory update that would require roughly 10% of EC2 instances globally to reboot. The maintenance window was scheduled from September 26th through the 30th, leaving no choice and little time to prepare. The company later acknowledged that the update was required to fix a problem with the Xen hypervisor.

Two days after AWS, Rackspace released a similar notification to customers – also with the intent of updating the flawed Xen hypervisor. The Rackspace notification went out late on a Friday evening, well after most regular IT staff had left work for the week. Eventually, Rackspace issued a formal apology for the way it handled the reboot.

IBM was even later to the party, notifying customers of its SoftLayer cloud today (seven days after AWS) that they will have only 12 hours to prepare for their own reboot.

Is There a Problem?

Negative feedback about what has amounted to a forced reboot has been widespread. Some critics are adamant that AWS and Rackspace are at fault, and believe customers shouldn’t be forced to reboot if they don’t want to. But isn’t it true that on-premises IT shops also have to address security concerns sometimes? The critics also insist that customers should be given more time to prepare for a reboot. But don’t most IT organizations already push out mandatory patches on short notice?

On the other end of the spectrum, a different group of critics is saying that cloud customers are to blame for their own setbacks – that they should already know scheduled maintenance is a part of public IaaS clouds. But what about the many cloud SLAs that promote 99.99% or even 100% uptime? Are service providers misleading their customers? The ones criticizing cloud customers also assert that customers should be responsible for designing highly available, resilient applications before deploying to the cloud. But what about the providers who claim their clouds are enterprise-ready and that you can run mission-critical applications with the same reliability as an on-premises environment? Are those providers stretching the truth?

The fact is there are many problems with forced cloud reboots, especially on such a massive scale. And, while it isn’t helpful to simply point fingers, both cloud providers and cloud customers have a role to play in resolving these issues. Let’s consider what both of these groups can do.

Recommendations for Cloud Customers

Cloud customers must understand that downtime in public clouds happens for a variety of reasons, including hardware failures, operational mistakes, hypervisor patching, planned maintenance, hardware upgrades, and more. So, remember: statements about “guaranteed uptime” in cloud SLAs don’t account for scheduled maintenance. They also don’t mean that the cloud services will actually meet the stated availability objectives.

Scheduled maintenance and other service interruptions will happen, and you may not have much, if any, time to prepare. So it is important to be ready before downtime actually occurs. Cloud customers should start preparing by putting the right applications in the cloud. If unexpected downtime is unacceptable for your application, you probably need a high-availability (HA) configuration. If you can build an HA environment in your public cloud, great. If not, the application should not be placed there. When it comes to new applications, they should be designed for resiliency against service disruptions.

Even with the right applications in the cloud, you still need to be prepared for inevitable downtime. Be sure to regularly test and verify that your applications can overcome service interruptions. They must restart after a service interruption without data loss and without excessive impact to end user experience, revenue, and other business objectives.

Recommendations for Service Providers

When it comes to guaranteeing uptime, promise what you can deliver, not what you think customers want to hear. This means not using SLAs as marketing pieces. It also means communicating to customers up front that planned and unplanned maintenance will cause service interruptions from time to time. Every provider should publish specific best practices for customers to follow when downtime is required.

Don’t try to convince customers, or prospective customers, that any application can run in your cloud. Not every application is suited for a public cloud environment, and some applications simply run best on-premises.

Cloud providers should also focus on innovation to both minimize downtime and grant customers more flexibility in choosing their window of downtime. Consider live migration with additional automation to rapidly move instances to patched systems – a feature that cloud providers like Google and VMware already offer their customers. Look for ways to enable customers to postpone maintenance such as hypervisor patches without impacting other customers.

The Cloud Isn’t Perfect, and Neither Is On-Premises IT

Some amount of downtime should be expected in every IT environment. However, maintenance processes are different in public clouds. With on-premises IT, operations teams can decide which security patches to apply, as well as when to apply them. In a public cloud, customers have little or no choice about whether and when to accept security patches.

To reduce the impact of downtime, cloud customers should run the right applications in public clouds, and use proven operational processes to handle forced reboots, unscheduled downtime and other service interruptions. Service providers should not only work on better, timelier communications – they should also work on innovating new ways to further reduce downtime.

Over time, the problems associated with forced reboots in public clouds are likely to be reduced. Until then, it will take more effort from both customers and providers.

]]>

Recent blog posts

More

Leave a Reply

Your email address will not be published. Required fields are marked *