Partial sandbox outage

Incident Report for Appfarm AS

Postmortem

Cause:

An automatic update from Google on the Kubernetes cluster caused a restart of too many services associated with the cluster within a short period. The services that restarted demanded more resources than they were allocated. When too many services restart simultaneously and demand excessive resources, the cluster's overhead is exhausted, leaving no resources available. Services then enter a crash loop/restart loop, causing the solutions to become slow or inaccessible.

Measures taken:

Future automatic updates from Google will be scheduled during more appropriate timeslots to ensure that these upgrades do not disrupt customers.
We will upgrade the configuration of the given cluster to ensure that updates are executed in an even more secure manner.

Posted Jun 28, 2023 - 14:42 CEST

Resolved

On Thursday, June 22nd, an automatic upgrade of a customer cluster holding sandbox environments resulted in downtime for solutions related to this Kubernetes cluster.

Posted Jun 22, 2023 - 09:00 CEST