On 2024-07-04 Dualog had an incident that affected the mail flow to and from ships. Some ships experienced delay in their mail flow, resulting in the mail service being marked as "Degraded Service".
Timeline (in UTC):
09:45 - Incident started
10:00 - Infrastructure team is alerted and incident is logged on status page
10:15 - Infrastructure team starts investigation
10:45 - Customer start reporting instability in services
10:57 - 1st notification sent to all customers
11:21 - 2nd notification sent to all customers. The issue is identified and the team started working on a resolution
12:00 - 3rd notification sent to all customers (no change, only update)
12:38 - 4th notification sent to all customers. Traffic is resumed and email traffic is starting to flow to and from ships
13:28 - 5th and final notification sent to all customers. Traffic is stable. incident closed.
Root Cause:
Our main database cluster reported sudden spikes in CPU usage to 100%, resulting in new mail connections being refused. After limiting certain database queries, the incident was mitigated. The surrounding systems slowly recovered as the mail queue was processed.
Resolution and Next Steps:
We have involved our main database vendor who is currently investigating how we can optimize a set of database queries so that they will not coincide in the future to potentially create the same database load.
In the meantime, we continue to limit some queries to ensure we do not run into the same incident again.
We are looking at implementing a broader load-balancing of servers connecting to the database cluster so that the recovery after such incidents will be quicker and smoother.
When our database expert has concluded his analysis and optimized database queries, we will issue a deeper root cause analysis and report the permanent solution.