On Monday, April 17 at 09:20 CEST, some of the Appfarm Customer solutions running on a shared database cluster began experiencing availability issues. Most of the affected solutions returned to an operational state at 11:20, full recovery of all affected solutions occurred approximately 12:55.
We have identified the root cause and are taking steps to prevent this from happening again in the future.
After investigating the incident, we have identified that the root cause was a suddenly increased workload for the database cluster, resulting in the inability to process incoming traffic. All database clusters in Appfarm have autoscaling enabled, but this time the cluster was not able to scale in time because of the sudden increase in load.
Some of the affected solutions were shut down by the operations team to lighten the load of the database. Full recovery of these solutions did not occur until 12:55 even though the database had fully recovered at 11:20. The reason for this was a cascading effect of the outage that created a large buildup of messages for the service that configures and deploys solutions.
To prevent this from happening again we have identified some concrete measures that will be implemented. Short term measures are to prevent an uncontrolled buildup of messages to the deployment service which will allow for faster recovery.
We are working with MongoDB engineers to further optimize our database architecture for multi tenancy and creating early warning systems for detecting and acting on load anomalies in all solutions.
We are sorry for any inconvenience this may have caused.
Sincerely,
Ole Borgersen, CTO