Microsoft Azure Service Degradation - RESOLVED

Microsoft Azure experienced a service degradation affecting certain Channeltivity customers. 


Thu, 30 Apr at 3:57 PM EDT: 

Channeltivity is currently experiencing a connectivity issue for certain customers. We're investigating and will post further updates shortly.


Update 4:02 PM: 

There is a reported service interruption in Microsoft Azure Service Bus that is causing Channeltivity downtime.


Update 4:08 PM: 

We are working with Microsoft to resolve the issue as quickly as possible.


Update 4:50 PM:

Update posted 9 minutes ago to the Azure status dashboard:

Starting at 16:20 UTC on 30 Apr 2015 a subset of Service Bus customers in North Central US, and Visual Studio Online customers in multiple regions, may be experiencing intermittent timeouts or errors. We have started deploying a mitigation and are actively monitoring service status.


Update 5:38 PM:

Microsoft has deployed a partial mitigation and is working on a full mitigation. 


Update 6:42 PM:

Microsoft has deployed a partial mitigation which has resulted in a reduction in the observed error rate. They are now deploying a full mitigation to fully restore normal service availability. 


Update 6:51 PM:

We're still waiting on confirmation from Microsoft that the issue has been fully mitigated, but some affected customer instances are coming back online. 


Update 7:27 PM:

Microsoft reports that the mitigation effort is still incomplete.


Update 7:41 PM:

Microsoft is in the process of deploying a full mitigation.


Update 9:37 PM:

Microsoft deployed a further mitigation that has reduced error rates and is examining options to restore full service availability.


Update 11:36 PM:

Microsoft is still examining options to restore full service availability.


Update 12:09 AM EDT, Friday, May 1st:

All customer portals are back online. We're waiting for the all-clear from Microsoft before marking this post as resolved.


Update 9:10 AM:

Although Microsoft still hasn't given the green light, all customer portals have been stable for the last 9 hours and we've resumed normal operations. We are continuing to monitor the situation and will post any updates as they come in.


Update 3:07 PM:

Microsoft confirms that the issue has been completely mitigated as of 8:15 AM EDT.


Update 12:22 AM EDT, Saturday May 2nd:

The issue has returned. We are working with Microsoft to restore full service availability.


Update 6:47 AM:

All customer portals are back online. We're waiting for the all-clear from Microsoft before marking this post as resolved.


Update 5:05 PM:

Service availability has been reliable and we are marking this post as resolved. We are waiting for a full root cause analysis from Microsoft and will update this post as we hear more.


Update 12:11 PM EDT, Wednesday May 6th:

We are waiting for a full root cause analysis from Microsoft and will update this post as we find out more.



Final Update Tuesday May 12th:

Microsoft published the full RCA:


From 16:20 to 18:15 UTC on 30th Apr, 2015, a subset of Service Bus customers were experiencing issues due to an underlying Azure SQL Database issue (9736560) in North Central US. After Azure SQL Database issue was mitigated at 18:15 UTC, Service Bus customers may have continued to experience intermittent timeouts or errors because Service Bus Frontend machines were experiencing high CPU. This high CPU was caused by a Service Bus client bug present on SDK 2.3, 2.4 and 2.5, which resulted in Service Bus clients with certain usage to constantly reconnect to Service Bus, in particular SSL connection attempts. 


The Azure team detected and mitigated this issue by the following combination of mitigations due to the complexity of this issue.

1. Identify clients creating high traffic, deployed a service fix to reduce traffic load from clients generating high traffic with SDKs 2.3, 2.4 and 2.5 versions.

2. Scaled out Service Bus Frontends to handle excessive requests volume.

3. Optimized requests processing.


At 12:15 PM on May 1st, 2015, the Azure team confirmed that full functionality of Service Bus was restored.