Services disturbed

Incident Report for Exalate

Postmortem

(All times UTC+1)

Lead-up

On Saturday Nov 13, 10:00 PM - the central database - hosting all the data of all the nodes - was upgraded to a new version. This is an automated process and performed regularly and unattended. During this failover - services are being failed over to another instance to guarantee continuity.

Fault

The fail over to the new instance generated a connectivity problem between the nodes and the database. Nodes could still read the data but not write data. This root cause has been identified and backlogged as a critical bug

Whenever a node tries to write to the database - the write operation got stuck. The consequence was that the http request related to the write operation also got stuck, building up http request queues on the reverse proxy infrastructure.

The monitoring which is using the same proxy infrastructure to check on the health of the nodes also got stuck, leading to a failure of the monitoring infrastructure itself.

Impact

The impact is large and affected the whole infrastructure

The nodes could not write transactions anymore missing on sync events and change events
The reverse proxy infrastructure got stuck affecting the monitoring capabilities and the automatic escalation
Because the escalation failed to work, support engineers on call were not notified.

Detection

Customers started to notify about stuck syncs and availability problems from Sunday 3:00 am (by email), which got amplified during the day. It was not possible to interprete the individual incident notifications from an infrastructure wide incident. Early Monday morning - support tickets started to be raised.

Response

All individual support tickets are answered. A status page opened at 8:00 am CET.

Recovery

To solve the problem, all the nodes needed to be restarted, which was scheduled and took around 8 hours to complete. During the restart additional infrastructure problems were detected and resolved - leading to additional restarts.

Root cause

There are multiple root cause.

The product must be able to survive flaky database connections.
The escalation path was not robust enough to ensure that the team was aware about the impact and severity of the problem

‌

Lessons Learned

Whenever there are infrastructure upgrades, support team needs to be aware and validate that everything is still working correctly
A big portion of our customer base is not aware of status page

‌

Corrective Actions

Fix the product bug and ensure that failing database actions can not clog up the http request processing engine
Make the monitoring paths more robust, and start to monitor the monitoring services
Ensure an on call service during the infrastructure component upgrades

‌

If you have any questions on this postmortem - please raise a request on our support portal (here)

Posted Nov 19, 2021 - 17:24 CET

Resolved

This incident has been resolved.

Posted Nov 15, 2021 - 22:20 CET

Monitoring

All exalate nodes are back online. We will continue to monitor the situation to make sure everything is ok.

Posted Nov 15, 2021 - 20:44 CET

Update

Internal connectivity has been restored, exalate nodes are being restarted individually to take into account the new configuration.

We'll keep you posted around the progress ...

Posted Nov 15, 2021 - 18:59 CET

Update

This issue is taken at the highest priority - internal connectivity issues are degrading the functionality of a number of exalate nodes.

We'll increase our reporting on this channel to keep everyone informed

Posted Nov 15, 2021 - 16:18 CET

Update

We are continuing to work on a fix for this issue.

Posted Nov 15, 2021 - 12:53 CET

Update

We are continuing to work on a fix for this issue.

Posted Nov 15, 2021 - 09:06 CET

Identified

The issue has been identified and a fix is being implemented.

Posted Nov 15, 2021 - 08:21 CET

Investigating

We are currently investigating this issue.

Posted Nov 15, 2021 - 08:03 CET

This incident affected: Exalate Cloud (connect.exalate.net (mapper), Hosting platform, connect.exalate.cloud), Zendesk (Exalate Console), Jira Cloud (Synchronisation node), Azure DevOps (Exalate for Azure DevOps), Service Now (Exalate for ServiceNow in Exalate Cloud), and GitHub (Exalate for GitHub).