Major Outage on Cloud Provider Storage

Update

12/08/2023 at 1:30:00 PM

Update

12/08/2023 at 1:30:00 PM

Postmortem 8 december Cloud Incident

The morning of 8 dec. 2023 the VRA Cloud infrastructure experienced downtime due to our Cloud Provider having issues with the storage solution. At first this only impacted basic communication to the Cloud dashboard, but it later on could also have resulted in out-of-date Visual Player content in the Output Players. We scaled the incident to Major Outage because users were not able to access the Cloud at some moments during the incident. The incident did not affect any running applications (Core/Output Player/Audio Manager). Internal communication between clients (for Audio triggers, Output Playout etc.) was not interrupted, but external control from the Cloud had partial outage.

Unfortunately the root cause of the issue was external at our cloud provider. Our apologies for the inconvenience of the resulted downtime of the VRA Cloud. We are working on a faster failover switch for these major outages, which will shorten the time the Cloud is inaccessible.

Summary & Timeline

Our monitoring began alerting about increasing API response times and decreasing success rate at 10:59 CET. While investigating the issues, the main Cloud API, responsible for the communication between the Cloud dashboard and local applications, started showing signs of degraded performance with as root cause an unstable database.

During the degradation of the Cloud API, the database became completely unavailable at 11:24 CET, around the time that our cloud provider declared a major incident on the storage side of the cloud infrastructure in region AMS-1. This storage incident caused multiple components of the Cloud to fail with problems accessing storage. We immediately started escalating the incident, this means that our backup solution is being prepared for failover.

At 12:30 CET our backup was ready for complete failover, because this is an very invasive measure we waited on the recent updates from the provider that was confirming recovery of the situation in ams-1. When we got confirmation (12:50) that the recovery of the outage in ams-1 would take extra time, we directly triggered the failover scenario, which made the backup scenario active. After an initial Load Balancing issue, the failover infrastructure became generally available at 13:20 CET.

The main infrastructure became partially stable again at 14:00 CET. While we were monitoring the situation at our backup infra, we prepared to fail back to normal asap (to decrease the size of the sync back). All tests and monitoring at our regular NL-AMS-1 infrastructure went green again at 14:08 CET, which initiated a direct recovery of the complete Cloud service in the following minutes.

Throughout the remainder of the day, we maintained vigilant monitoring of the infrastructure without any service interruptions following the complete recovery at 14:08.

Response

During these kinds of major incidents we immediately take action in activating and preparing our failover infrastructure. Due to the high probability of just a short outage and quick recovery at the cloud provider, the failover scenario was activated at a later point in time.

When finally activating the scenario DNS settings issues, combined with an internal networking issue that prevented the Load Balancer to route traffic correctly, resulted in a slower failover.

Recommendations

We’ll continue to bi-monthly test our failover scenario’s and have already mitigated the issues with our Load Balancing set-up and DNS this afternoon.

In the future we want to be able to recover service in case of a major outage even faster by creating more flexible ways to route traffic. We’ve also seen a slow recovery of Studio Bus connections with the Cloud which should be improved with resilliant connection recovery.

status.visualradio.cloud will be always up to date with current status and incident reports. Subscribe with your email to stay up to date with the current status of VRA.

Resolved

12/08/2023 at 1:09:17 PM

Resolved

12/08/2023 at 1:09:17 PM

We are currently scaling back to our original production infrastructure. Some connections will be discarded in this process and will be recovered shortly.

Update

12/08/2023 at 12:31:12 PM

Update

12/08/2023 at 12:31:12 PM

We currently see positive test results on our main infrastructure. We'll give an update when we start downscaling our backup infra.

Update

12/08/2023 at 12:19:49 PM

Update

12/08/2023 at 12:19:49 PM

The backup infrastructure is completely rolled out and available. We'll keep on monitoring the infrastructure for and degradation.

Update

12/08/2023 at 12:12:59 PM

Update

12/08/2023 at 12:12:59 PM

The backup infrastructure is in place and will be functional with the incremental DNS updates that are being processed.

Monitoring

12/08/2023 at 10:29:00 AM

Monitoring

12/08/2023 at 10:29:00 AM

The backup scenario is in place, we are currently monitoring the stability of the backup.

Identified

12/08/2023 at 9:48:00 AM

Identified

12/08/2023 at 9:48:00 AM

Our Cloud Provider has issues with an increased latency on certain storage components. This is impacting all communication to VRA Cloud currently. We are activating our backup scenario ASAP and will be providing updates here.

Investigating

12/08/2023 at 9:06:00 AM

Investigating

12/08/2023 at 9:06:00 AM

We're currently experiencing issues with this service. We're working on it and will update you as soon as possible.

Visual Radio Assist - Major Outage on Cloud Provider Storage – Incident details

All systems operational

Postmortem 8 december Cloud Incident

Summary & Timeline

Response

Recommendations