visualradio.cloud - Operational
visualradio.cloud
api.visualradio.cloud - Operational
api.visualradio.cloud
live.visualradio.cloud - Operational
live.visualradio.cloud
Core - Operational
Core
Audio Manager - Operational
Audio Manager
Output Player - Operational
Media Server - Operational
Notice history
Jul 2025
- PostmortemPostmortem
Postmortem Cloud Incident - Tuesday, July 1, 2025
Summary
On Tuesday, July 1, 2025, at 16:45 UTC+2, VRA Cloud experienced a major service disruption due to an infrastructure outage at our cloud provider, Scaleway. The incident was classified as a Major Outage due to complete loss of access to the Cloud dashboard and Live Playout control functionality. While all running applications (Core/Output Player/Audio Manager) continued to operate normally, and internal client communication remained uninterrupted, external control from the Cloud was completely unavailable.
The root cause was a cooling system failure at our cloud provider's colocation datacenter, which resulted in the loss of an entire Availability Zone. Recovery was further complicated by an inaccessible backup component that extended the restoration timeline beyond our standard recovery procedures.
Full service restoration was achieved at 00:17 UTC+2 on July 2, 2025, representing a total outage duration of approximately 7 hours and 32 minutes.
Incident Timeline
16:45 UTC+2 - Initial Detection
Monitoring systems detected elevated error rates across infrastructure components
Users began experiencing difficulties accessing the Cloud and controlling VRA applications
Investigation initiated
17:00 UTC+2 - Escalation to P1 Incident
Activated Priority 1 incident response procedures
Cloud provider (Scaleway) communicated cooling system issues in Amsterdam datacenter
Initiated recovery procedures to alternate availability zone
Reference: https://status.scaleway.com/incidents/1vz4xfgy2gcl
17:08 UTC+2 - Initial Recovery Attempts
Services began recovering in alternate infrastructure region
Cloud provider confirmed instabilities related to extreme weather conditions
Estimated recovery time: 15 minutes
Activated secondary recovery procedure (Procedure 2) in alternate European location as contingency
17:45 UTC+2 - Recovery Complications
Cloud provider confirmed weather-related infrastructure problems and initiated preventive service shutdowns
Primary recovery procedure (Procedure 1) blocked due to outage characteristics
Proceeded with alternative recovery efforts (Procedure 2) - regularly tested bi-weekly process
18:36 UTC+2 - Strategy Pivot
Abandoned Procedure 1 due to severity of Amsterdam region situation
Focused resources on Procedure 2 recovery efforts
Initiated comprehensive testing to verify full infrastructure recovery
19:35 UTC+2 - Extended Recovery Phase
Procedure 2 restoration taking longer than expected due to infrastructure complexity
Cloud provider monitoring temperature improvements for service restoration
20:00 UTC+2 - Data Integrity Issues Identified
Analysis revealed database inconsistencies in recovered infrastructure
Data integrity errors traced to initial outage preventing proper restoration
22:53 UTC+2 - Continued Recovery Efforts
Unable to complete data recovery from temperature-affected infrastructure region
Cloud provider confirmed clearance for affected region recovery
Monitoring situation for stable service restoration
00:17 UTC+2 - Full Service Restoration
Confirmed complete recovery of VRA Cloud infrastructure
All services operating normally
No server application restarts required
Users advised to refresh Cloud browser sessions for latest state
Impact Assessment
Services Affected:
VRA Cloud dashboard access
Live Playout control functionality
Visual Player content updates (potential impact)
Services Maintained:
All running applications (Core/Output Player/Audio Manager)
Internal client communication
Audio triggers and Output Playout internal operations
Duration: 7 hours 32 minutes (16:45 - 00:17 UTC+2)
Root Cause Analysis
The incident originated from a cooling system failure at our cloud provider's colocation datacenter, resulting in the complete loss of an Availability Zone. This infrastructure failure was beyond our direct control and affected multiple services within the facility.
Recovery was significantly prolonged due to a critical gap in our backup procedures. Following infrastructure performance changes implemented in January 2025, our Procedure 2 recovery process encountered unexpected obstacles when attempting to access backup data from our High Availability storage location. The complete loss of storage access to the Amsterdam availability region prevented successful execution of our standard recovery protocols.
Response and Recovery Actions
Our incident response immediately implemented our established failover procedures:
Parallel Recovery Approach: Executed both Procedure 1 (10-minute recovery target) and Procedure 2 (60-80 minute recovery target) simultaneously
Resource Allocation: Dedicated efforts to each recovery track to maximize restoration speed
Continuous Monitoring: Maintained real-time assessment of both recovery efforts and cloud provider status
Communication: Provided regular updates on recovery progress and service status
Immediate Remediation
Completed Actions:
July 2, 12:00 UTC+2: Resolved critical backup procedure blocking issues
Enhanced High Availability backup data storage redundancy
Verified recovery procedures for complete datacenter loss scenarios
Long-term Improvements
Process Enhancements:
Strengthened backup redundancy across multiple geographic regions
Enhanced monitoring and alerting for backup system accessibility
Improved failover testing to include extreme failure scenarios
Testing Protocol (continuing existing efforts):
Continued bi-weekly failover scenario testing
Extended test scenarios to include complete availability zone loss
Regular validation of backup data accessibility across all storage locations
Lessons Learned
Infrastructure Dependencies: Single points of failure in cloud provider infrastructure can significantly impact service availability
Backup Strategy: Backup accessibility must be tested under various failure scenarios, not just standard conditions
Recovery Procedures: Multiple recovery strategies should be maintained and regularly tested for different failure modes
Our Apology and Next Steps
We know this 7+ hour outage seriously disrupted your work, and we're genuinely sorry about that. We recognize the significant impact this incident had on your operations and sincerely apologize for the disruption to your Visual Radio production workflows.
We've already fixed the backup issue that made this take so much longer than it should have, and we're continuing our regular testing to catch problems before they become outages like this one.
If you have any questions about what happened or concerns about your setup, just reach out to us directly.
- ResolvedResolved
We've confirmed full recovery of VRA Cloud Infrastructure - all services are now operating normally. It is not required to restart any Server Applications but refreshing the Cloud in your browser ensures you have the latest state. We sincerely apologize for this severe outage and understand how it disrupted your Visual Radio production workflows.
We're implementing additional monitoring and failover procedures within the next 24 hours to prevent similar incidents related to our cloud provider reliability.
A detailed postmortem including root cause analysis and our prevention measures will be shared by Friday this week.
- UpdateUpdate
Due to the complexity of the outage we are still not able to provide a fully completed data recovery out of the temperature affected infrastructure region, this significantly slows down recovery in the alternative location.
Our cloud provider just confirmed green light on the recovery of the affected region, we'll be monitoring the situation closely to bring back Cloud as soon as possible in a stable manner.
- MonitoringMonitoring
We're still waiting on the green light of the full recovery of our restore procedure 2, which is taking longer than expected due to the size of the infrastructure.
In the meantime our cloud provider is monitoring the temperature improvement and will be recovering services as soon as all indicators allow them to.
- UpdateUpdate
Due to the severity of the situation in the Amsterdam region we're abandoning procedure 1 to quickly recover services ASAP.
Our parallel efforts with procedure two are getting to a point of working recovery, we'll proceed with several tests to confirm full recovery of the VRA Cloud infrastructure.
Next update will follow in 60 minutes on status.visualradio.cloud.
- UpdateUpdate
Our cloud provider officially confirmed the abnormal weather problems and is preventively shutting down services. We're continuously monitoring the recovery of the infrastructure in a close backup region. Due to the stagnation of this process we've ramped up the recovery scenario 2, as mentioned earlier.
- UpdateUpdate
We are currently successfully seeing services recover in another infrastructure region. Our cloud provider is confirming instabilities due to the severe heat (weather related). The continuation of the recovery will take approx. 15 minutes.
In parallel to internal region infrastructure migration, we are activating a secondary recovery procedure in another Europe based Infrastructure location, to prevent any delay due to a failing procedure 1.
Updates will follow every 30 minutes on status.visualradio.cloud.
- IdentifiedIdentified
We are currently activating P1 scenarios to recover critical parts of the infrastructure at another location as soon as possible.
Jun 2025
- ResolvedResolved
We have successfully verified the restoration of the upstream provider services. VRA services should be working as expected.
If you still experience problems accessing VRA Cloud – don't hesitate to contact support@visualradioassist.live.
- MonitoringMonitoring
We're notified about instabilities with our upstream provider for the VRA Cloud dashboard and video provider. The upstream provider identified the issue and is monitoring recovery of the situation.
Access to video media and some pages of the visualradio.cloud can be degraded or non available at the moment.
Running VRA installations are not affected and we will continue monitoring the situation.
May 2025
No notices reported this month