Instability in Appfarm Customer Solutions Prod Environments

Incident Report for Appfarm AS

Postmortem

Incident Summary

On September 2, 2025, several customer solutions experienced a service disruption. The issue began when a bug in our configuration handling system caused Kubernetes pods to fail to start correctly. This led to some customer solutions being unavailable and others experiencing reduced performance. The issue was resolved by manually correcting the configuration and deploying a fix.

Impact

  • Service Downtime: Several customer solutions were completely unavailable until the configuration was corrected.
  • Performance Degradation: Some solutions were unable to automatically scale to meet demand, which led to reduced performance and potential slowness for customers.

Root Cause

The core issue was a bug in the service that applies Kubernetes configurations. This bug generated and applied an invalid configuration with incorrect resource requests. As a result, the Kubernetes scheduler was unable to place new pods on the cluster, preventing them from starting and causing the service disruption.

Detection and Resolution

The problem was detected through a combination of failing pod scheduling events and direct reports from affected customers. Our team quickly identified the invalid configuration as the source of the problem. We manually corrected the configuration to immediately restore affected services. Following this manual intervention, we deployed a permanent fix to the configuration service. All services returned to normal operation.

Preventive Measures

To prevent this type of incident from happening again, we have taken the following steps:

  • Bug Fix: The specific bug in the configuration handling service has been patched and deployed.
  • Enhanced Validation: We are implementing additional safeguards and regression tests to automatically block invalid configurations from being applied to production environments in the future.
  • Improved Monitoring: We have enhanced our monitoring and alerting systems to provide earlier detection of pod scheduling failures and other related issues. This will help our team identify and respond to similar problems more quickly.

We sincerely apologize for the disruption and are confident that these measures will prevent a recurrence.

Posted Sep 04, 2025 - 09:56 CEST

Resolved

This incident has been resolved.
Posted Sep 02, 2025 - 18:27 CEST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Sep 02, 2025 - 13:47 CEST

Identified

The issue has been identified and a fix is being implemented.
Posted Sep 02, 2025 - 12:32 CEST

Update

We are continuing to investigate this issue.
Posted Sep 02, 2025 - 12:31 CEST

Investigating

Some customer solutions are expiriencing instability in their prod environments. We are currently investigating the issue.
Posted Sep 02, 2025 - 11:59 CEST
This incident affected: Customer Applications.