Mail exchange degraded

Incident Report for Dualog

Postmortem

On 2024-07-04 Dualog had an incident that affected the mail flow to and from ships. Some ships experienced delay in their mail flow, resulting in the mail service being marked as "Degraded Service".

Timeline (in UTC):
09:45 - Incident started
10:00 - Infrastructure team is alerted and incident is logged on status page
10:15 - Infrastructure team starts investigation
10:45 - Customer start reporting instability in services
10:57 - 1st notification sent to all customers
11:21 - 2nd notification sent to all customers. The issue is identified and the team started working on a resolution
12:00 - 3rd notification sent to all customers (no change, only update)
12:38 - 4th notification sent to all customers. Traffic is resumed and email traffic is starting to flow to and from ships
13:28 - 5th and final notification sent to all customers. Traffic is stable. incident closed.

Root Cause:
Our main database cluster reported sudden spikes in CPU usage to 100%, resulting in new mail connections being refused. After limiting certain database queries, the incident was mitigated. The surrounding systems slowly recovered as the mail queue was processed.

Resolution and Next Steps:
We have involved our main database vendor who is currently investigating how we can optimize a set of database queries so that they will not coincide in the future to potentially create the same database load.
In the meantime, we continue to limit some queries to ensure we do not run into the same incident again.
We are looking at implementing a broader load-balancing of servers connecting to the database cluster so that the recovery after such incidents will be quicker and smoother.
When our database expert has concluded his analysis and optimized database queries, we will issue a deeper root cause analysis and report the permanent solution.

Posted Jul 15, 2024 - 17:43 UTC

Resolved

This incident has been resolved, and traffic is now stable.

Posted Jul 04, 2024 - 13:28 UTC

Monitoring

A fix has been implemented and we are seeing substantial improvements on the cloud side. The ships are connecting and exchanging emails. We are making sure the backlog is processed smoothly and monitoring the situation.

Posted Jul 04, 2024 - 12:38 UTC

Update

We are all hands on deck to work on a fix for this issue. We apologise for the inconvenience this may be causing to your fleet's operations. We commit to regularly update you on the progress.

Posted Jul 04, 2024 - 12:00 UTC

Identified

We are still experiencing performance issues on the cloud side. Email to and from ships may be affected. We will keep updating our Statuspage status.dualog.com.

No action is required from the ship side at the moment. All messages will be queued and no message will be lost.

Posted Jul 04, 2024 - 11:21 UTC

Investigating

We are currently seeing degraded performance on the cloud side that are affecting the ships' mail exchange. Our team is investigating with the highest priority.

Posted Jul 04, 2024 - 10:00 UTC

This incident affected: Email and Automatic File Transfer.