Notice history

May 2025

Jun 2025

Jul 2025

May 2025

Jun 2025

Jul 2025

May 2025

Jun 2025

Jul 2025

May 2025

Jun 2025

Jul 2025

May 2025

Jun 2025

Jul 2025

May 2025

Jun 2025

Jul 2025

May 2025

Jun 2025

Jul 2025

Jul 2025

Postmortem
07/04/2025 at 7:54:32 AM
Postmortem
07/04/2025 at 7:54:32 AM
Postmortem Cloud Incident - Tuesday, July 1, 2025
Summary
On Tuesday, July 1, 2025, at 16:45 UTC+2, VRA Cloud experienced a major service disruption due to an infrastructure outage at our cloud provider, Scaleway. The incident was classified as a Major Outage due to complete loss of access to the Cloud dashboard and Live Playout control functionality. While all running applications (Core/Output Player/Audio Manager) continued to operate normally, and internal client communication remained uninterrupted, external control from the Cloud was completely unavailable.
The root cause was a cooling system failure at our cloud provider's colocation datacenter, which resulted in the loss of an entire Availability Zone. Recovery was further complicated by an inaccessible backup component that extended the restoration timeline beyond our standard recovery procedures.
Full service restoration was achieved at 00:17 UTC+2 on July 2, 2025, representing a total outage duration of approximately 7 hours and 32 minutes.
Incident Timeline
16:45 UTC+2 - Initial Detection
- Monitoring systems detected elevated error rates across infrastructure components
- Users began experiencing difficulties accessing the Cloud and controlling VRA applications
- Investigation initiated
17:00 UTC+2 - Escalation to P1 Incident
- Activated Priority 1 incident response procedures
- Cloud provider (Scaleway) communicated cooling system issues in Amsterdam datacenter
- Initiated recovery procedures to alternate availability zone
- Reference: https://status.scaleway.com/incidents/1vz4xfgy2gcl
17:08 UTC+2 - Initial Recovery Attempts
- Services began recovering in alternate infrastructure region
- Cloud provider confirmed instabilities related to extreme weather conditions
- Estimated recovery time: 15 minutes
- Activated secondary recovery procedure (Procedure 2) in alternate European location as contingency
17:45 UTC+2 - Recovery Complications
- Cloud provider confirmed weather-related infrastructure problems and initiated preventive service shutdowns
- Primary recovery procedure (Procedure 1) blocked due to outage characteristics
- Proceeded with alternative recovery efforts (Procedure 2) - regularly tested bi-weekly process
18:36 UTC+2 - Strategy Pivot
- Abandoned Procedure 1 due to severity of Amsterdam region situation
- Focused resources on Procedure 2 recovery efforts
- Initiated comprehensive testing to verify full infrastructure recovery
19:35 UTC+2 - Extended Recovery Phase
- Procedure 2 restoration taking longer than expected due to infrastructure complexity
- Cloud provider monitoring temperature improvements for service restoration
20:00 UTC+2 - Data Integrity Issues Identified
- Analysis revealed database inconsistencies in recovered infrastructure
- Data integrity errors traced to initial outage preventing proper restoration
22:53 UTC+2 - Continued Recovery Efforts
- Unable to complete data recovery from temperature-affected infrastructure region
- Cloud provider confirmed clearance for affected region recovery
- Monitoring situation for stable service restoration
00:17 UTC+2 - Full Service Restoration
- Confirmed complete recovery of VRA Cloud infrastructure
- All services operating normally
- No server application restarts required
- Users advised to refresh Cloud browser sessions for latest state
Impact Assessment
Services Affected:
- VRA Cloud dashboard access
- Live Playout control functionality
- Visual Player content updates (potential impact)
Services Maintained:
- All running applications (Core/Output Player/Audio Manager)
- Internal client communication
- Audio triggers and Output Playout internal operations
Duration: 7 hours 32 minutes (16:45 - 00:17 UTC+2)
Root Cause Analysis
The incident originated from a cooling system failure at our cloud provider's colocation datacenter, resulting in the complete loss of an Availability Zone. This infrastructure failure was beyond our direct control and affected multiple services within the facility.
Recovery was significantly prolonged due to a critical gap in our backup procedures. Following infrastructure performance changes implemented in January 2025, our Procedure 2 recovery process encountered unexpected obstacles when attempting to access backup data from our High Availability storage location. The complete loss of storage access to the Amsterdam availability region prevented successful execution of our standard recovery protocols.
Response and Recovery Actions
Our incident response immediately implemented our established failover procedures:
1. Parallel Recovery Approach: Executed both Procedure 1 (10-minute recovery target) and Procedure 2 (60-80 minute recovery target) simultaneously
2. Resource Allocation: Dedicated efforts to each recovery track to maximize restoration speed
3. Continuous Monitoring: Maintained real-time assessment of both recovery efforts and cloud provider status
4. Communication: Provided regular updates on recovery progress and service status
Immediate Remediation
Completed Actions:
- July 2, 12:00 UTC+2: Resolved critical backup procedure blocking issues
- Enhanced High Availability backup data storage redundancy
- Verified recovery procedures for complete datacenter loss scenarios
Long-term Improvements
Process Enhancements:
- Strengthened backup redundancy across multiple geographic regions
- Enhanced monitoring and alerting for backup system accessibility
- Improved failover testing to include extreme failure scenarios
Testing Protocol (continuing existing efforts):
- Continued bi-weekly failover scenario testing
- Extended test scenarios to include complete availability zone loss
- Regular validation of backup data accessibility across all storage locations
Lessons Learned
1. Infrastructure Dependencies: Single points of failure in cloud provider infrastructure can significantly impact service availability
2. Backup Strategy: Backup accessibility must be tested under various failure scenarios, not just standard conditions
3. Recovery Procedures: Multiple recovery strategies should be maintained and regularly tested for different failure modes
Our Apology and Next Steps
We know this 7+ hour outage seriously disrupted your work, and we're genuinely sorry about that. We recognize the significant impact this incident had on your operations and sincerely apologize for the disruption to your Visual Radio production workflows.
We've already fixed the backup issue that made this take so much longer than it should have, and we're continuing our regular testing to catch problems before they become outages like this one.
If you have any questions about what happened or concerns about your setup, just reach out to us directly.
Resolved
07/01/2025 at 10:17:20 PM
Resolved
07/01/2025 at 10:17:20 PM
We've confirmed full recovery of VRA Cloud Infrastructure - all services are now operating normally. It is not required to restart any Server Applications but refreshing the Cloud in your browser ensures you have the latest state. We sincerely apologize for this severe outage and understand how it disrupted your Visual Radio production workflows.
We're implementing additional monitoring and failover procedures within the next 24 hours to prevent similar incidents related to our cloud provider reliability.
A detailed postmortem including root cause analysis and our prevention measures will be shared by Friday this week.
Update
07/01/2025 at 8:53:31 PM
Update
07/01/2025 at 8:53:31 PM
Due to the complexity of the outage we are still not able to provide a fully completed data recovery out of the temperature affected infrastructure region, this significantly slows down recovery in the alternative location.
Our cloud provider just confirmed green light on the recovery of the affected region, we'll be monitoring the situation closely to bring back Cloud as soon as possible in a stable manner.
Monitoring
07/01/2025 at 5:35:24 PM
Monitoring
07/01/2025 at 5:35:24 PM
We're still waiting on the green light of the full recovery of our restore procedure 2, which is taking longer than expected due to the size of the infrastructure.
In the meantime our cloud provider is monitoring the temperature improvement and will be recovering services as soon as all indicators allow them to.
Update
07/01/2025 at 4:36:00 PM
Update
07/01/2025 at 4:36:00 PM
Due to the severity of the situation in the Amsterdam region we're abandoning procedure 1 to quickly recover services ASAP.
Our parallel efforts with procedure two are getting to a point of working recovery, we'll proceed with several tests to confirm full recovery of the VRA Cloud infrastructure.
Next update will follow in 60 minutes on status.visualradio.cloud.
Update
07/01/2025 at 3:45:43 PM
Update
07/01/2025 at 3:45:43 PM
Our cloud provider officially confirmed the abnormal weather problems and is preventively shutting down services. We're continuously monitoring the recovery of the infrastructure in a close backup region. Due to the stagnation of this process we've ramped up the recovery scenario 2, as mentioned earlier.
Update
07/01/2025 at 3:08:38 PM
Update
07/01/2025 at 3:08:38 PM
We are currently successfully seeing services recover in another infrastructure region. Our cloud provider is confirming instabilities due to the severe heat (weather related). The continuation of the recovery will take approx. 15 minutes.
In parallel to internal region infrastructure migration, we are activating a secondary recovery procedure in another Europe based Infrastructure location, to prevent any delay due to a failing procedure 1.
Updates will follow every 30 minutes on status.visualradio.cloud.
Identified
07/01/2025 at 3:00:59 PM
Identified
07/01/2025 at 3:00:59 PM
We are currently activating P1 scenarios to recover critical parts of the infrastructure at another location as soon as possible.

Jun 2025

Resolved
06/12/2025 at 8:28:54 PM
Resolved
06/12/2025 at 8:28:54 PM
We have successfully verified the restoration of the upstream provider services. VRA services should be working as expected.
If you still experience problems accessing VRA Cloud – don't hesitate to contact support@visualradioassist.live.
Monitoring
06/12/2025 at 7:20:00 PM
Monitoring
06/12/2025 at 7:20:00 PM
We're notified about instabilities with our upstream provider for the VRA Cloud dashboard and video provider. The upstream provider identified the issue and is monitoring recovery of the situation.
Access to video media and some pages of the visualradio.cloud can be degraded or non available at the moment.
Running VRA installations are not affected and we will continue monitoring the situation.

May 2025

No notices reported this month

May 2025 to Jul 2025

Visual Radio Assist - Notice history

Experiencing partially degraded performance

Notice history

Jul 2025

Postmortem Cloud Incident - Tuesday, July 1, 2025

Summary

Incident Timeline

Impact Assessment

Root Cause Analysis

Response and Recovery Actions

Immediate Remediation

Long-term Improvements

Lessons Learned

Our Apology and Next Steps

Jun 2025

May 2025