Cluster A Email Service Issue
Incident Report for OpenSRS
Postmortem

Incident Date: October 28, 2021
Incident Number: PR-2521 

On October 28, 2021, at 11:27 PM ET, Tucows’ hosted email platform experienced service interruption impacting POP/IMAP/Webmail for Prod A. 

The service interruption was caused due to a kernel bug on the affected network storage device causing high load on the system.

At 11:42 PM ET, The engineering team restarted the affected system to restore the services.

On October 29, 2021, at 12:25 AM ET, Another service interruption was observed and lasted for 21 minutes due to the same kernel systems bug.

At 12:46 AM ET, the engineering team performed a restart of the affected network storage devices to stabilize the hosted email environment.

Tucows is in the process of investigating the cause and develop a plan to roll out a permanent solution to address the identified systems bug.

Tucows is committed to continue with the hosted email migration efforts into the new cloud to maintain a scalable and stable hosted email environment.

Thank you,

Tucows Engineering Team

Posted Nov 03, 2021 - 08:00 UTC

Resolved
This incident has been resolved, and all services have been restored. We thank you for your patience while we worked to fix this issue.

Incident Start Time: 10-29-2021 03:27:00 UTC
Incident End Time:10-29-2021 04:50:00
Total Duration: 1 hour, 23 minutes
Posted Oct 29, 2021 - 05:38 UTC
Monitoring
We have deployed a fix and are monitoring the situation.
Posted Oct 29, 2021 - 04:06 UTC
Investigating
We are experiencing degradation of service. Users may experience the following error:

DATABASE ERROR: CONNECTION FAILED!
Unable to connect to the database!
Please contact your server-administrator.

Our engineering team has been engaged and are working to resolve this issue.
Posted Oct 29, 2021 - 03:44 UTC
This incident affected: Hosted Email (Cluster A).