Instability in Appfarm Customer Solutions

Incident Report for Appfarm AS

Postmortem

Incident Summary

On March 8, 2025, some customers started experiencing issues with our platform due to an underlying change in Google Kubernetes Engine (GKE). Some solutions in non-production environments became unavailable, while scheduled tasks in production were disrupted. The issue was caused by a breaking change introduced by Google in their GKE infrastructure, which prevented pods from authenticating correctly during startup, causing continuous crash loops.

Impact

The incident was classified as critical due to the potential for widespread service disruption across customer workloads.
Some solutions experienced downtime in non-production environments.
Production environments could experience disrupted schedules.

Root Cause

The root cause was traced to an unannounced update in GKE nodes within the europe-west1-c zone. These new nodes no longer supported the previous authentication header format, causing authentication failures when pods attempted to access Google Cloud metadata services. When pods failed to authenticate, they entered an unrecoverable crash loop state. This rollout introduced a breaking change that only supported the latest version of their Node.js authentication library.

Detection and Resolution

On Saturday morning, our monitoring systems detected a raised level of pod restarts. Upon investigation, we traced the problem specifically to the europe-west1-c zone. To mitigate the problem, we reconfigured our node pools to operate in other available zones within the europe-west1 region, effectively avoiding the affected zone.

During the following week, the issue started occurring in the other zones. We therefore rolled out a patch that upgraded the relevant library.

Throughout the incident, we collaborated with Google Cloud Support to identify the root cause and to prevent further rollout of the update causing the issue.

Preventive Measures:

We are upgrading the Node.js Google authentication library across all services to prevent recurrence.
We have improved monitoring and alerting mechanisms to enable rapid detection of similar issues in the future.

We sincerely apologize for any inconvenience caused by this incident. We are committed to maintaining the stability and reliability of our platform and are taking steps to prevent similar issues in the future.

Posted Mar 26, 2025 - 13:54 CET

Resolved

This incident has been resolved.

Posted Mar 25, 2025 - 08:55 CET

Monitoring

The workaround has been successfully implemented, and all services are now restored. We continue to collaborate with the Google Cloud team to identify the root cause. We are also closely monitoring service stability to ensure the issue does not recur.

Posted Mar 09, 2025 - 15:48 CET

Update

A workaround has been identified and is being implemented. We anticipate full service restoration within the next few hours. We are continuing to collaborate with Google Cloud support to determine the root cause.

Posted Mar 09, 2025 - 14:59 CET

Update

We are actively engaged with Google Cloud support to diagnose the ongoing issue. Simultaneously, we are exploring potential workarounds to mitigate impact.

Current Impact:

- Production environments are operational, but experiencing reduced capacity.
- Some development, test, and staging environments are currently unavailable.
- Create and other core services remain unaffected.

We will provide further updates as they become available.

Posted Mar 09, 2025 - 10:02 CET

Investigating

We are currently investigating this issue.

Posted Mar 08, 2025 - 15:05 CET

This incident affected: Customer Applications.