Cluster A sending issues
Incident Report for OpenSRS
Postmortem

https://www.opensrsstatus.com/incidents/qs680w65lv82

Incident Date: October 18, 2021 Incident Number: PR-2465

On October 18, 2021, at 10:00 AM ET, Tucows’ hosted email platform experienced service interruption impacting outbound mail delivery for Prod A.

The service interruption was due to a hardware failure causing systems degradation impacting Hosted Email services.

After engaging remote hands, the engineering team brought back the impacted hardware online, restored the data from backup to stabilize the hosted email environment.

At 1:49 PM ET, All the outbound email queues started to process successfully after the services were remediated and restored.

Tucows is committed to continue with the hosted email migration efforts into the new cloud to maintain a scalable and stable hosted email environment.

Tucows is to implement and improve monitoring to detect hardware failures in a timely manner to improve mean time to recover.

Thank you,
Tucows Engineering Team

Posted Oct 22, 2021 - 13:59 UTC

Resolved
We are no longer reporting errors with sending mail, users should now be able to send mail without issue. We appreciate your patience and understanding during this outage.

Incident Start Time: 10-18-2021 14:00:00 UTC
Incident End Time:10-18-2021 17:49:00 UTC
Total Duration: 3 hours and 49 minutes
Posted Oct 18, 2021 - 18:04 UTC
Update
Our engineering teams have completed syncing operations and are now moving over resources onto the relevant hardware. We will provide more updates as they come. We will update within the next 30 minutes.

Thank you for your patience.
Posted Oct 18, 2021 - 17:35 UTC
Monitoring
The faulty hardware has been replaced and we are in a recovery state. All the mail queues are now processing and will recover within an hour or so.
Posted Oct 18, 2021 - 16:31 UTC
Update
Our engineering team has identified the cause to be due to a hardware issue and are working to resolve the matter.
Posted Oct 18, 2021 - 15:32 UTC
Investigating
We are experiencing a degradation in service for Hosted Email customers on cluster A. Users may experience issues with sending with error "SMTP Error (454)" Authentication failed". Our Engineering team has been engaged and they are currently investigating the issue.
Posted Oct 18, 2021 - 14:22 UTC
This incident affected: Hosted Email (Cluster A).