Temporary outage for Exalate nodes
Incident Report for Exalate
Postmortem

(All times UTC+1)

Lead-up

On Tuesday, May 23rd at 9:21 A.M., a series of database restart events occurred, leading to the outage of all nodes in the system. 

Fault

The failure generated a connectivity problem between the nodes and the database. Nodes couldn’t read and write the data. This root cause has been identified and backlogged as a critical bug

Whenever a node tries to read from or write to the database - the operation got stuck. The consequence was that the HTTP request related to the read/write operation also got stuck, building up HTTP request queues on the reverse proxy infrastructure.

Impact

The impact is large and affected the whole infrastructure

  • The nodes could not read/write transactions anymore missing on sync events and change events
  • The reverse proxy infrastructure got stuck affecting the monitoring capabilities and the automatic escalation
  • Because the escalation failed to work, support engineers on the call were not notified

Detection

May 23rd at 9:21 AM:

  • The database unexpectedly restarted, causing all nodes to go down simultaneously.

May 23rd at 10:35 AM:

  • A critical ticket was raised with Google Cloud to report the issue and seek assistance.

May 23rd at 11:00 AM:

  • A call was initiated with the Google Cloud team to provide real-time updates and discuss the ongoing issue.

May 23rd at 11:25 AM:

  • To address the situation, a decision was made to double the allocated resources, increasing the capacity of the affected system.

May 23rd at 12:02 PM:

  • Google Cloud confirmed that the root cause of the database restarts was due to over-usage of allocated resources during the pick load

May 23rd at 12:19 PM:

  • After increasing resources and implementing the necessary adjustments, all nodes were brought back online, resolving the outage.

Response

All individual support tickets are answered. A status page opened on May 23rd at 10:09 AM

Recovery

To solve the problem, the amount of allocated resources was increased and all necessary adjustments were done

Root cause

There are multiple root causes.

  • The product must be able to survive flaky database connections.
  • Automated resource scaling protocol should be adjusted to cover such cases

Lessons Learned

  • Resource monitoring: Implement robust resource monitoring and alerting mechanisms to proactively identify potential over-usage scenarios
  • A big portion of our customer base is not aware of the status page

Corrective Actions

  • Fix the product bug and ensure that failing database actions can not clog up the HTTP request processing engine
  • Make the monitoring paths more robust, and start to monitor the monitoring services
  • Adjust the automated resource scaling protocol

If you have any questions on this postmortem - please raise a request on our support portal (here)

Posted May 26, 2023 - 10:52 CEST

Resolved
RCA and Postmortem to follow
Posted May 26, 2023 - 08:18 CEST
Update
During the outage, some sync events could be missed. To recover the missed sync events, please follow the next steps:

1. Create a search query that would contain the existing trigger query (if available) + the update date being between "2023-05-23 9:21" and time of the outage resolution "2023-05-23 12:23"
2. Find how many issues were affected.
3. Split the results into batches of 100 (500 for EPSO customers) by limiting the query results by issue key numbers being between given bounds or ticket creation date being between given bounds.
4. For every batch create a disabled trigger.
5. Run Bulk Exalate on one of the triggers, as soon as the sync queue is empty, repeat the step for other disabled triggers from step 4.
Posted May 23, 2023 - 12:37 CEST
Monitoring
The fix has been deployed and the infrastructure has been restarted. We are monitoring the infrastructure.
Posted May 23, 2023 - 12:23 CEST
Update
The database is restarted.
Posted May 23, 2023 - 11:02 CEST
Update
A critical issue Has been raised with Google cloud.
Posted May 23, 2023 - 10:37 CEST
Identified
Exalate infrastructure experienced a DB restart. The reason of the restart is under investigation.
We are applying a fix on all Exalate instances.
All Exalate nodes will be down during this process.
Posted May 23, 2023 - 10:09 CEST
This incident affected: Exalate Cloud (connect.exalate.net (mapper), connect.exalate.cloud).