Visual Radio Assist - Notice history

Experiencing partially degraded performance

visualradio.cloud - Operational

100% - uptime
May 2025 · 100.0%Jun · 100.0%Jul · 100.0%
May 2025
Jun 2025
Jul 2025

api.visualradio.cloud - Operational

100% - uptime
May 2025 · 100.0%Jun · 100.0%Jul · 99.02%
May 2025
Jun 2025
Jul 2025

live.visualradio.cloud - Operational

100% - uptime
May 2025 · 100.0%Jun · 100.0%Jul · 99.02%
May 2025
Jun 2025
Jul 2025
100% - uptime

Core - Operational

100% - uptime
May 2025 · 100.0%Jun · 100.0%Jul · 100.0%
May 2025
Jun 2025
Jul 2025

Audio Manager - Operational

100% - uptime
May 2025 · 100.0%Jun · 100.0%Jul · 100.0%
May 2025
Jun 2025
Jul 2025

Output Player - Operational

100% - uptime
May 2025 · 100.0%Jun · 100.0%Jul · 100.0%
May 2025
Jun 2025
Jul 2025

Media Server - Operational

100% - uptime
May 2025 · 100.0%Jun · 100.0%Jul · 100.0%
May 2025
Jun 2025
Jul 2025
100% - uptime

Cloudflare → R2 - Operational

Cloudflare → Images - Operational

Cloudflare → Stream - Operational

Block Storage - Operational

Primary AZ nl-ams-1 - Operational

Cloudflare → Europe → Amsterdam, Netherlands - (AMS) - Operational

Third Party: Cloudflare → Cloudflare Sites and Services → Image Resizing - Operational

Third Party: Cloudflare → Cloudflare Sites and Services → Infrastructure - Operational

Secondary AZ → nl-ams-2 - Operational

Secondary AZ → nl-ams-3 - Operational

Notice history

Jul 2025

Cloud inaccessible due to problems at infrastructure provider
  • Postmortem
    Postmortem

    Postmortem Cloud Incident - Tuesday, July 1, 2025

    Summary

    On Tuesday, July 1, 2025, at 16:45 UTC+2, VRA Cloud experienced a major service disruption due to an infrastructure outage at our cloud provider, Scaleway. The incident was classified as a Major Outage due to complete loss of access to the Cloud dashboard and Live Playout control functionality. While all running applications (Core/Output Player/Audio Manager) continued to operate normally, and internal client communication remained uninterrupted, external control from the Cloud was completely unavailable.

    The root cause was a cooling system failure at our cloud provider's colocation datacenter, which resulted in the loss of an entire Availability Zone. Recovery was further complicated by an inaccessible backup component that extended the restoration timeline beyond our standard recovery procedures.

    Full service restoration was achieved at 00:17 UTC+2 on July 2, 2025, representing a total outage duration of approximately 7 hours and 32 minutes.

    Incident Timeline

    16:45 UTC+2 - Initial Detection

    • Monitoring systems detected elevated error rates across infrastructure components

    • Users began experiencing difficulties accessing the Cloud and controlling VRA applications

    • Investigation initiated

    17:00 UTC+2 - Escalation to P1 Incident

    • Activated Priority 1 incident response procedures

    • Cloud provider (Scaleway) communicated cooling system issues in Amsterdam datacenter

    • Initiated recovery procedures to alternate availability zone

    • Reference: https://status.scaleway.com/incidents/1vz4xfgy2gcl

    17:08 UTC+2 - Initial Recovery Attempts

    • Services began recovering in alternate infrastructure region

    • Cloud provider confirmed instabilities related to extreme weather conditions

    • Estimated recovery time: 15 minutes

    • Activated secondary recovery procedure (Procedure 2) in alternate European location as contingency

    17:45 UTC+2 - Recovery Complications

    • Cloud provider confirmed weather-related infrastructure problems and initiated preventive service shutdowns

    • Primary recovery procedure (Procedure 1) blocked due to outage characteristics

    • Proceeded with alternative recovery efforts (Procedure 2) - regularly tested bi-weekly process

    18:36 UTC+2 - Strategy Pivot

    • Abandoned Procedure 1 due to severity of Amsterdam region situation

    • Focused resources on Procedure 2 recovery efforts

    • Initiated comprehensive testing to verify full infrastructure recovery

    19:35 UTC+2 - Extended Recovery Phase

    • Procedure 2 restoration taking longer than expected due to infrastructure complexity

    • Cloud provider monitoring temperature improvements for service restoration

    20:00 UTC+2 - Data Integrity Issues Identified

    • Analysis revealed database inconsistencies in recovered infrastructure

    • Data integrity errors traced to initial outage preventing proper restoration

    22:53 UTC+2 - Continued Recovery Efforts

    • Unable to complete data recovery from temperature-affected infrastructure region

    • Cloud provider confirmed clearance for affected region recovery

    • Monitoring situation for stable service restoration

    00:17 UTC+2 - Full Service Restoration

    • Confirmed complete recovery of VRA Cloud infrastructure

    • All services operating normally

    • No server application restarts required

    • Users advised to refresh Cloud browser sessions for latest state

    Impact Assessment

    Services Affected:

    • VRA Cloud dashboard access

    • Live Playout control functionality

    • Visual Player content updates (potential impact)

    Services Maintained:

    • All running applications (Core/Output Player/Audio Manager)

    • Internal client communication

    • Audio triggers and Output Playout internal operations

    Duration: 7 hours 32 minutes (16:45 - 00:17 UTC+2)

    Root Cause Analysis

    The incident originated from a cooling system failure at our cloud provider's colocation datacenter, resulting in the complete loss of an Availability Zone. This infrastructure failure was beyond our direct control and affected multiple services within the facility.

    Recovery was significantly prolonged due to a critical gap in our backup procedures. Following infrastructure performance changes implemented in January 2025, our Procedure 2 recovery process encountered unexpected obstacles when attempting to access backup data from our High Availability storage location. The complete loss of storage access to the Amsterdam availability region prevented successful execution of our standard recovery protocols.

    Response and Recovery Actions

    Our incident response immediately implemented our established failover procedures:

    1. Parallel Recovery Approach: Executed both Procedure 1 (10-minute recovery target) and Procedure 2 (60-80 minute recovery target) simultaneously

    2. Resource Allocation: Dedicated efforts to each recovery track to maximize restoration speed

    3. Continuous Monitoring: Maintained real-time assessment of both recovery efforts and cloud provider status

    4. Communication: Provided regular updates on recovery progress and service status

    Immediate Remediation

    Completed Actions:

    • July 2, 12:00 UTC+2: Resolved critical backup procedure blocking issues

    • Enhanced High Availability backup data storage redundancy

    • Verified recovery procedures for complete datacenter loss scenarios

    Long-term Improvements

    Process Enhancements:

    • Strengthened backup redundancy across multiple geographic regions

    • Enhanced monitoring and alerting for backup system accessibility

    • Improved failover testing to include extreme failure scenarios

    Testing Protocol (continuing existing efforts):

    • Continued bi-weekly failover scenario testing

    • Extended test scenarios to include complete availability zone loss

    • Regular validation of backup data accessibility across all storage locations

    Lessons Learned

    1. Infrastructure Dependencies: Single points of failure in cloud provider infrastructure can significantly impact service availability

    2. Backup Strategy: Backup accessibility must be tested under various failure scenarios, not just standard conditions

    3. Recovery Procedures: Multiple recovery strategies should be maintained and regularly tested for different failure modes

    Our Apology and Next Steps

    We know this 7+ hour outage seriously disrupted your work, and we're genuinely sorry about that. We recognize the significant impact this incident had on your operations and sincerely apologize for the disruption to your Visual Radio production workflows.

    We've already fixed the backup issue that made this take so much longer than it should have, and we're continuing our regular testing to catch problems before they become outages like this one.

    If you have any questions about what happened or concerns about your setup, just reach out to us directly.

  • Resolved
    Resolved

    We've confirmed full recovery of VRA Cloud Infrastructure - all services are now operating normally. It is not required to restart any Server Applications but refreshing the Cloud in your browser ensures you have the latest state. We sincerely apologize for this severe outage and understand how it disrupted your Visual Radio production workflows.

    We're implementing additional monitoring and failover procedures within the next 24 hours to prevent similar incidents related to our cloud provider reliability.

    A detailed postmortem including root cause analysis and our prevention measures will be shared by Friday this week.

  • Update
    Update

    Due to the complexity of the outage we are still not able to provide a fully completed data recovery out of the temperature affected infrastructure region, this significantly slows down recovery in the alternative location.

    Our cloud provider just confirmed green light on the recovery of the affected region, we'll be monitoring the situation closely to bring back Cloud as soon as possible in a stable manner.

  • Monitoring
    Monitoring

    We're still waiting on the green light of the full recovery of our restore procedure 2, which is taking longer than expected due to the size of the infrastructure.

    In the meantime our cloud provider is monitoring the temperature improvement and will be recovering services as soon as all indicators allow them to.

  • Update
    Update

    Due to the severity of the situation in the Amsterdam region we're abandoning procedure 1 to quickly recover services ASAP.

    Our parallel efforts with procedure two are getting to a point of working recovery, we'll proceed with several tests to confirm full recovery of the VRA Cloud infrastructure.

    Next update will follow in 60 minutes on status.visualradio.cloud.

  • Update
    Update

    Our cloud provider officially confirmed the abnormal weather problems and is preventively shutting down services. We're continuously monitoring the recovery of the infrastructure in a close backup region. Due to the stagnation of this process we've ramped up the recovery scenario 2, as mentioned earlier.

  • Update
    Update

    We are currently successfully seeing services recover in another infrastructure region. Our cloud provider is confirming instabilities due to the severe heat (weather related). The continuation of the recovery will take approx. 15 minutes.

    In parallel to internal region infrastructure migration, we are activating a secondary recovery procedure in another Europe based Infrastructure location, to prevent any delay due to a failing procedure 1.

    Updates will follow every 30 minutes on status.visualradio.cloud.

  • Identified
    Identified

    We are currently activating P1 scenarios to recover critical parts of the infrastructure at another location as soon as possible.

Jun 2025

Issues accessing VRA Cloud & video media
  • Resolved
    Resolved

    We have successfully verified the restoration of the upstream provider services. VRA services should be working as expected.

    If you still experience problems accessing VRA Cloud – don't hesitate to contact support@visualradioassist.live.

  • Monitoring
    Monitoring

    We're notified about instabilities with our upstream provider for the VRA Cloud dashboard and video provider. The upstream provider identified the issue and is monitoring recovery of the situation.

    Access to video media and some pages of the visualradio.cloud can be degraded or non available at the moment.

    Running VRA installations are not affected and we will continue monitoring the situation.

May 2025

No notices reported this month

May 2025 to Jul 2025

Next