Partial database outage

Incident Report for Appfarm AS

Postmortem

On Monday, April 17 at 09:20 CEST, some of the Appfarm Customer solutions running on a shared database cluster began experiencing availability issues. Most of the affected solutions returned to an operational state at 11:20, full recovery of all affected solutions occurred approximately 12:55.

We have identified the root cause and are taking steps to prevent this from happening again in the future.

After investigating the incident, we have identified that the root cause was a suddenly increased workload for the database cluster, resulting in the inability to process incoming traffic. All database clusters in Appfarm have autoscaling enabled, but this time the cluster was not able to scale in time because of the sudden increase in load.

Some of the affected solutions were shut down by the operations team to lighten the load of the database. Full recovery of these solutions did not occur until 12:55 even though the database had fully recovered at 11:20. The reason for this was a cascading effect of the outage that created a large buildup of messages for the service that configures and deploys solutions.

To prevent this from happening again we have identified some concrete measures that will be implemented. Short term measures are to prevent an uncontrolled buildup of messages to the deployment service which will allow for faster recovery.

We are working with MongoDB engineers to further optimize our database architecture for multi tenancy and creating early warning systems for detecting and acting on load anomalies in all solutions.

We are sorry for any inconvenience this may have caused.

‌

Sincerely,

Ole Borgersen, CTO

Posted May 02, 2023 - 13:57 CEST

Resolved

The incident has been resolved. We will update with a post-mortem report as soon as we have finished our investigation.

Posted Apr 17, 2023 - 13:15 CEST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Apr 17, 2023 - 13:14 CEST

Identified

We are currently investigating an outage on one of our shared database clusters, which is impacting the performance of customer solutions running on this cluster. Our technical team has identified the issue, and we are actively working on resolving it as quickly as possible. We apologize for any inconvenience this may have caused and appreciate your patience as we work to restore service.

Posted Apr 17, 2023 - 11:56 CEST

Investigating

We are currently investigating this issue.

Posted Apr 17, 2023 - 10:04 CEST

This incident affected: Customer Applications.