Cluster A - IMAP/POP/Webmail
Incident Report for OpenSRS
Postmortem

Incident Date: July 23, 2020
Incident Number: PR-1216

On July 23, 2020, at 7:30 pm ET, Tucows Hosted Email platform experienced service interruptions impacting IMAP, POP and Webmail inaccessibility in cluster A.

The service interruption was caused during the recovery process of an incident (PR-1203) where snapshots deletion process impacted the stability of the shared storage due to corrupt snapshot data.

At 8:42 pm ET, the Engineering team identified inconsistencies in the file system and corrected the issue by bypassing the degraded area. 

Tucows is to revise systems design, architecture, and services by enhancing the Hostedemail application to no longer use snapshots.

 

Thank you,

Tucows Engineering Team

Posted Jul 28, 2020 - 15:30 UTC

Resolved
The engineering team has restored and enabled all the authentication service. Users will be able to access email using IMAP/Webmail/POP on Cluster A.

Incident Start Time: 07-23-2020 23:30:00 UTC
Incident End Time: 07-24-2020 00:42:00 UTC
Total Duration: 1 hours and 12 minutes
Posted Jul 24, 2020 - 01:06 UTC
Update
Our engineering team continues to investigate the high load. The root cause is still under investigation.
Posted Jul 24, 2020 - 00:38 UTC
Update
Users on Cluster A will be unable to use webmail to access emails, as well as unable to fetch their emails via IMAP/POP. Engineering team has already been engaged and currently investigating.
Posted Jul 24, 2020 - 00:12 UTC
Investigating
Degradation of service
Posted Jul 24, 2020 - 00:05 UTC
This incident affected: Hosted Email (Cluster A, Webmail).