As of 2:34 pm EST, November 6, 2017, Channeltivity is experiencing downtime caused by a disruption of Microsoft Azure Cloud Services. We are working with Microsoft to determine the cause of the issue and will post updates as we find out more.
Update 3:03 pm EST:
Microsoft engineers are investigating an Azure service disruption at their North Central US data center.
Update 3:20 pm EST:
Microsoft has identified a network infrastructure causing the downtime and is actively investigating.
Update 4:04 pm EST:
Microsoft engineers are currently investigating previous updates and deployments to the region along with other possible network level issues, and are taking additional steps to mitigate impact.
Update 4:10 pm EST:
Channeltivity services came back online just now. We're monitoring our services and will post updates until we receive final resolution from Microsoft and mark this issue as resolved.
Update 4:45 pm EST:
Microsoft reports the network issue as mitigated. We will post the root cause analysis in the next few days as it becomes available.
Update November 9:
All Clear from Microsoft. Here's the official report with the root cause analysis:
Summary of impact: Between 19:33 UTC and 21:07 UTC on 6 Nov 2017, a subset of customers in North Central US experienced degraded performance, network drops, or timeouts when accessing Azure resources hosted in this region. A subset of customers based in the region may have experienced issues accessing non-regional resources, such as Azure Active Directory and the Azure portal. In addition, a subset of customers using App Service / Web Apps in South Central US would have experienced similar issues due to a bug discovered in how Web Apps infrastructure components utilize North Central US resources for geo-redundancy purposes. During a planned capacity addition maintenance window to the North Central US region, a tag was introduced to aggregate routes for a subset of the North Central region. This tagging caused affected prefixes to be withdrawn from regions outside of North Central US. Automated alerting detected the drop in availability and engineering teams correlated the issue with the capacity addition maintenance, however automated rollback did not mitigate the issue. The rollback steps were manually validated which fixed the route advertisement and restored the connectivity between the North Central US facilities.
Root cause and mitigation: A capacity augmentation workflow in a datacenter in the North Central US region contained an error that resulted in too many routing announcements being assigned a Border Gateway Protocol (BGP) community tag that should have only been used for routes that stay within the North Central US region. When aggregate IP prefixes intended for distribution to the Microsoft core network were incorrectly tagged for local distribution only, it interrupted communication between this datacenter and all our other Azure regions. The addition of new capacity to a region has been the cause of incidents in the past, so both automated systems and manual checks are used to verify that there is no impact. The tool used in this turn up did attempt to rollback, but due to a second error the rollback workflow also applied the incorrectly tagged prefix policy such that the rollback did not resolve the problem. Because rollback did not resolve the incident, engineers working to resolve the incident lost time looking for other potential causes before determining the true root cause and fully rolling back the capacity addition work. This increased the time required to mitigate the issues.
Zach Smith
As of 2:34 pm EST, November 6, 2017, Channeltivity is experiencing downtime caused by a disruption of Microsoft Azure Cloud Services. We are working with Microsoft to determine the cause of the issue and will post updates as we find out more.
Update 3:03 pm EST:
Microsoft engineers are investigating an Azure service disruption at their North Central US data center.
Update 3:20 pm EST:
Microsoft has identified a network infrastructure causing the downtime and is actively investigating.
Update 4:04 pm EST:
Microsoft engineers are currently investigating previous updates and deployments to the region along with other possible network level issues, and are taking additional steps to mitigate impact.
Update 4:10 pm EST:
Channeltivity services came back online just now. We're monitoring our services and will post updates until we receive final resolution from Microsoft and mark this issue as resolved.
Update 4:45 pm EST:
Microsoft reports the network issue as mitigated. We will post the root cause analysis in the next few days as it becomes available.
Update November 9:
All Clear from Microsoft. Here's the official report with the root cause analysis:
Summary of impact: Between 19:33 UTC and 21:07 UTC on 6 Nov 2017, a subset of customers in North Central US experienced degraded performance, network drops, or timeouts when accessing Azure resources hosted in this region. A subset of customers based in the region may have experienced issues accessing non-regional resources, such as Azure Active Directory and the Azure portal. In addition, a subset of customers using App Service / Web Apps in South Central US would have experienced similar issues due to a bug discovered in how Web Apps infrastructure components utilize North Central US resources for geo-redundancy purposes. During a planned capacity addition maintenance window to the North Central US region, a tag was introduced to aggregate routes for a subset of the North Central region. This tagging caused affected prefixes to be withdrawn from regions outside of North Central US. Automated alerting detected the drop in availability and engineering teams correlated the issue with the capacity addition maintenance, however automated rollback did not mitigate the issue. The rollback steps were manually validated which fixed the route advertisement and restored the connectivity between the North Central US facilities.
Root cause and mitigation: A capacity augmentation workflow in a datacenter in the North Central US region contained an error that resulted in too many routing announcements being assigned a Border Gateway Protocol (BGP) community tag that should have only been used for routes that stay within the North Central US region. When aggregate IP prefixes intended for distribution to the Microsoft core network were incorrectly tagged for local distribution only, it interrupted communication between this datacenter and all our other Azure regions. The addition of new capacity to a region has been the cause of incidents in the past, so both automated systems and manual checks are used to verify that there is no impact. The tool used in this turn up did attempt to rollback, but due to a second error the rollback workflow also applied the incorrectly tagged prefix policy such that the rollback did not resolve the problem. Because rollback did not resolve the incident, engineers working to resolve the incident lost time looking for other potential causes before determining the true root cause and fully rolling back the capacity addition work. This increased the time required to mitigate the issues.