Cluster A Email Issue
Incident Report for OpenSRS
Postmortem

Incident Date: November 16, 2021
Incident Number: PR-2567

On November 16, 2021, at 6:14 AM ET, Tucows’ hosted email platform experienced service interruption, causing inbound email delays and intermittently impacting webmail in Prod A.

At 3:00 PM ET, The engineering team increased spam resources to better process the increased backlog of inbound emails and restarted AAA nodes to alleviate the high load to stabilize the hosted email environment.

The root cause is still under investigation; however, the engineering team was able to identify a malfunctioned process in the legacy Hosted Email platform that caused the Authentication nodes to high load. As a result, the Authentication node impacted, Spam filtering and Inbound mail nodes, forcing email to queue up and reducing email delivery volume.

The Tucows engineering team is to continue investigating the root cause. In addition, Tucows has increased the Hosted Email Spam filtering and Authentication nodes to address the load concerns as a preventative measure.

Tucows is committed to completing the hosted email migration into the new cloud before the end of the year to maintain a scalable and stable hosted email.

Thank you,

Tucows Engineering Team

Posted Nov 18, 2021 - 23:13 UTC

Resolved
The engineering team has successfully completed the maintenance on IMF nodes in order to stabilize services. All the services have returned back to operational status. We will provide a detailed post mortem once a full investigation is completed. Services were fully restored at 20:00 UTC, monitoring continued to ensure stability.

Incident Start Time: 11-16-2021 11:14:00
Incident End Time:11-16-2021 20:00:00
Total Duration:8 hours and 46 minutes
Posted Nov 17, 2021 - 01:34 UTC
Update
The engineering team has performed multiple maintenance operations to further stabilize the environment.

Monitoring will continue.

Further updates to come.
Posted Nov 16, 2021 - 23:21 UTC
Update
Services are stabilized but we are continuing to monitor to ensure complete functionality.

We will provide updates as we received them.
Posted Nov 16, 2021 - 21:15 UTC
Monitoring
A fix has been implemented by our engineering team and emails are now being received in a timely manner. We will monitor the results.
Posted Nov 16, 2021 - 20:33 UTC
Investigating
Engineering continue to investigate the cause of high load, backend processes are starting to stabilize though there is still some noticeable delay. We will provide updates based on progress.
Posted Nov 16, 2021 - 18:59 UTC
Update
The engineering teams are adding resources to alleviate high load in multiple backend services, we are continuing to work to bring services back to fully functioning. We're engaging additional teams to confirm there are no network issues.
Posted Nov 16, 2021 - 16:48 UTC
Update
Engineering operations team continues to work to bring service back up, we will continue to provide updates based on progress. Inbound delays are still expected.
Posted Nov 16, 2021 - 16:00 UTC
Update
Engineers have identified issues with login once again, we're continuing to investigate the cause and are working to restore services. Email delays are expected to continue for the time being.
Posted Nov 16, 2021 - 15:14 UTC
Update
Our engineers are continuing work to bring services back online, some impacted users may start to see mail flow through and login success. We will provide updates based on progress.
Posted Nov 16, 2021 - 14:29 UTC
Identified
Our team is still currently working on this issue. Customers still may see some delays with inbound emails.
Posted Nov 16, 2021 - 13:45 UTC
Investigating
We are aware of an issue with inbound and outbound emails on Cluster A.

Users experience the following error message "4.7.1 Service unavailable - try again later"

Our engineering team has been engaged and is investigating this issue at this time.
Posted Nov 16, 2021 - 11:25 UTC
This incident affected: Hosted Email (Cluster A).