(All times UTC+1)
Lead-up
On Tuesday, May 23rd at 9:21 A.M., a series of database restart events occurred, leading to the outage of all nodes in the system.
Fault
The failure generated a connectivity problem between the nodes and the database. Nodes couldn’t read and write the data. This root cause has been identified and backlogged as a critical bug
Whenever a node tries to read from or write to the database - the operation got stuck. The consequence was that the HTTP request related to the read/write operation also got stuck, building up HTTP request queues on the reverse proxy infrastructure.
Impact
The impact is large and affected the whole infrastructure
- The nodes could not read/write transactions anymore missing on sync events and change events
- The reverse proxy infrastructure got stuck affecting the monitoring capabilities and the automatic escalation
- Because the escalation failed to work, support engineers on the call were not notified
Detection
May 23rd at 9:21 AM:
- The database unexpectedly restarted, causing all nodes to go down simultaneously.
May 23rd at 10:35 AM:
- A critical ticket was raised with Google Cloud to report the issue and seek assistance.
May 23rd at 11:00 AM:
- A call was initiated with the Google Cloud team to provide real-time updates and discuss the ongoing issue.
May 23rd at 11:25 AM:
- To address the situation, a decision was made to double the allocated resources, increasing the capacity of the affected system.
May 23rd at 12:02 PM:
- Google Cloud confirmed that the root cause of the database restarts was due to over-usage of allocated resources during the pick load
May 23rd at 12:19 PM:
- After increasing resources and implementing the necessary adjustments, all nodes were brought back online, resolving the outage.
Response
All individual support tickets are answered. A status page opened on May 23rd at 10:09 AM
Recovery
To solve the problem, the amount of allocated resources was increased and all necessary adjustments were done
Root cause
There are multiple root causes.
- The product must be able to survive flaky database connections.
- Automated resource scaling protocol should be adjusted to cover such cases
Lessons Learned
- Resource monitoring: Implement robust resource monitoring and alerting mechanisms to proactively identify potential over-usage scenarios
- A big portion of our customer base is not aware of the status page
Corrective Actions
- Fix the product bug and ensure that failing database actions can not clog up the HTTP request processing engine
- Make the monitoring paths more robust, and start to monitor the monitoring services
- Adjust the automated resource scaling protocol
If you have any questions on this postmortem - please raise a request on our support portal (here)