Incident Summary
On March 8, 2025, some customers started experiencing issues with our platform due to an underlying change in Google Kubernetes Engine (GKE). Some solutions in non-production environments became unavailable, while scheduled tasks in production were disrupted. The issue was caused by a breaking change introduced by Google in their GKE infrastructure, which prevented pods from authenticating correctly during startup, causing continuous crash loops.
Impact
Root Cause
The root cause was traced to an unannounced update in GKE nodes within the europe-west1-c
zone. These new nodes no longer supported the previous authentication header format, causing authentication failures when pods attempted to access Google Cloud metadata services. When pods failed to authenticate, they entered an unrecoverable crash loop state. This rollout introduced a breaking change that only supported the latest version of their Node.js authentication library.
On Saturday morning, our monitoring systems detected a raised level of pod restarts. Upon investigation, we traced the problem specifically to the europe-west1-c
zone. To mitigate the problem, we reconfigured our node pools to operate in other available zones within the europe-west1
region, effectively avoiding the affected zone.
During the following week, the issue started occurring in the other zones. We therefore rolled out a patch that upgraded the relevant library.
Throughout the incident, we collaborated with Google Cloud Support to identify the root cause and to prevent further rollout of the update causing the issue.
Preventive Measures:
We sincerely apologize for any inconvenience caused by this incident. We are committed to maintaining the stability and reliability of our platform and are taking steps to prevent similar issues in the future.