Services disturbed
Incident Report for idalko
Postmortem

(All times UTC+1)

Lead-up

On Saturday Nov 13, 10:00 PM - the central database - hosting all the data of all the nodes - was upgraded to a new version. This is an automated process and performed regularly and unattended. During this failover - services are being failed over to another instance to guarantee continuity.

Fault

The fail over to the new instance generated a connectivity problem between the nodes and the database. Nodes could still read the data but not write data. This root cause has been identified and backlogged as a critical bug

Whenever a node tries to write to the database - the write operation got stuck. The consequence was that the http request related to the write operation also got stuck, building up http request queues on the reverse proxy infrastructure.

The monitoring which is using the same proxy infrastructure to check on the health of the nodes also got stuck, leading to a failure of the monitoring infrastructure itself.

Impact

The impact is large and affected the whole infrastructure

  • The nodes could not write transactions anymore missing on sync events and change events
  • The reverse proxy infrastructure got stuck affecting the monitoring capabilities and the automatic escalation
  • Because the escalation failed to work, support engineers on call were not notified.

Detection

Customers started to notify about stuck syncs and availability problems from Sunday 3:00 am (by email), which got amplified during the day. It was not possible to interprete the individual incident notifications from an infrastructure wide incident. Early Monday morning - support tickets started to be raised.

Response

All individual support tickets are answered. A status page opened at 8:00 am CET.

Recovery

To solve the problem, all the nodes needed to be restarted, which was scheduled and took around 8 hours to complete. During the restart additional infrastructure problems were detected and resolved - leading to additional restarts.

Root cause

There are multiple root cause.

  • The product must be able to survive flaky database connections.
  • The escalation path was not robust enough to ensure that the team was aware about the impact and severity of the problem

Lessons Learned

  • Whenever there are infrastructure upgrades, support team needs to be aware and validate that everything is still working correctly
  • A big portion of our customer base is not aware of status page

Corrective Actions

  • Fix the product bug and ensure that failing database actions can not clog up the http request processing engine
  • Make the monitoring paths more robust, and start to monitor the monitoring services
  • Ensure an on call service during the infrastructure component upgrades

If you have any questions on this postmortem - please raise a request on our support portal (here)

Posted Nov 19, 2021 - 17:24 CET

Resolved
This incident has been resolved.
Posted Nov 15, 2021 - 22:20 CET
Monitoring
All exalate nodes are back online. We will continue to monitor the situation to make sure everything is ok.
Posted Nov 15, 2021 - 20:44 CET
Update
Internal connectivity has been restored, exalate nodes are being restarted individually to take into account the new configuration.

We'll keep you posted around the progress ...
Posted Nov 15, 2021 - 18:59 CET
Update
This issue is taken at the highest priority - internal connectivity issues are degrading the functionality of a number of exalate nodes.

We'll increase our reporting on this channel to keep everyone informed
Posted Nov 15, 2021 - 16:18 CET
Update
We are continuing to work on a fix for this issue.
Posted Nov 15, 2021 - 12:53 CET
Update
We are continuing to work on a fix for this issue.
Posted Nov 15, 2021 - 09:06 CET
Identified
The issue has been identified and a fix is being implemented.
Posted Nov 15, 2021 - 08:21 CET
Investigating
We are currently investigating this issue.
Posted Nov 15, 2021 - 08:03 CET
This incident affected: Zendesk (Exalate Console), GitHub (Exalate for GitHub), Exalate Cloud (connect.exalate.net (mapper), Hosting platform, connect.exalate.cloud), Service Now (Exalate for ServiceNow in Exalate Cloud), Jira Cloud (Synchronisation node), and Azure DevOps (Exalate for Azure DevOps).