(All times UTC+1)
On Saturday Nov 13, 10:00 PM - the central database - hosting all the data of all the nodes - was upgraded to a new version. This is an automated process and performed regularly and unattended. During this failover - services are being failed over to another instance to guarantee continuity.
The fail over to the new instance generated a connectivity problem between the nodes and the database. Nodes could still read the data but not write data. This root cause has been identified and backlogged as a critical bug
Whenever a node tries to write to the database - the write operation got stuck. The consequence was that the http request related to the write operation also got stuck, building up http request queues on the reverse proxy infrastructure.
The monitoring which is using the same proxy infrastructure to check on the health of the nodes also got stuck, leading to a failure of the monitoring infrastructure itself.
The impact is large and affected the whole infrastructure
Customers started to notify about stuck syncs and availability problems from Sunday 3:00 am (by email), which got amplified during the day. It was not possible to interprete the individual incident notifications from an infrastructure wide incident. Early Monday morning - support tickets started to be raised.
All individual support tickets are answered. A status page opened at 8:00 am CET.
To solve the problem, all the nodes needed to be restarted, which was scheduled and took around 8 hours to complete. During the restart additional infrastructure problems were detected and resolved - leading to additional restarts.
There are multiple root cause.
If you have any questions on this postmortem - please raise a request on our support portal (here)