Incident Summary
On September 2, 2025, several customer solutions experienced a service disruption. The issue began when a bug in our configuration handling system caused Kubernetes pods to fail to start correctly. This led to some customer solutions being unavailable and others experiencing reduced performance. The issue was resolved by manually correcting the configuration and deploying a fix.
Impact
Root Cause
The core issue was a bug in the service that applies Kubernetes configurations. This bug generated and applied an invalid configuration with incorrect resource requests. As a result, the Kubernetes scheduler was unable to place new pods on the cluster, preventing them from starting and causing the service disruption.
Detection and Resolution
The problem was detected through a combination of failing pod scheduling events and direct reports from affected customers. Our team quickly identified the invalid configuration as the source of the problem. We manually corrected the configuration to immediately restore affected services. Following this manual intervention, we deployed a permanent fix to the configuration service. All services returned to normal operation.
Preventive Measures
To prevent this type of incident from happening again, we have taken the following steps:
We sincerely apologize for the disruption and are confident that these measures will prevent a recurrence.