Cluster A: Email issues

Incident Report for OpenSRS

Postmortem

Incident Date: July 15, 2021

Incident Number: PR-2178

On July 15, 2021, at 6:06 AM ET, Tucows’ hosted email platform experienced service interruption, causing email delays and login issues in Prod A.

The service interruption was due to a file system error that caused a high load on the network storage pair.

Tucows’ Engineering team increased the severity of the incident when we observed the external impact.

At 9:20 AM ET, The engineering team disabled the IMAP nodes to alleviate the high load causing login failures for a subset of users.

At 11:30 AM ET, The engineering team successfully stabilized the storage devices and restored services in a controlled manner.

At 2:50 PM ET, the engineering team restored all the services after stabilizing the hosted email environment.

The Tucows Engineering team has successfully upgraded the core software on the affected systems to prevent this incident from happening again.

Thank you,

Tucows Engineering Team

Posted Jul 26, 2021 - 19:32 UTC

Resolved

The engineering teams identified issues with our storage devices and have resolved the problems causing webmail login to fail. Our engineers will be running naturalizing jobs throughout the night to ensure we no longer experience high load, we can say now that our cluster A is in a healthy state and will be closing this incident.

Incident Start Time: 07-15-2021 10:06:00
Incident End Time:07-15-2021 18:50:00
Total Duration: 8 hours, 44 minutes

Posted Jul 15, 2021 - 20:00 UTC

Update

Our engineering team continues to troubleshoot to resolve the issue. Some users will still be temporarily affected and may need to log back in and reauthenticate if an issue is experienced.

Posted Jul 15, 2021 - 18:32 UTC

Update

Our engineering teams are running a naturalizing process on the backend which will disconnect some users temporarily, we expect users to be able to log back in afterward. We are continuing to monitor the state of our network devices, we appreciate your continued patience as we work to bring services back online.

Posted Jul 15, 2021 - 17:25 UTC

Monitoring

Our engineering team continues to monitor the load on the affected storage device.

Posted Jul 15, 2021 - 16:19 UTC

Update

Our engineers have identified that during a planned and routine maintenance at 7:32 UTC this morning a network system error caused high load on one of our storage devices. Our engineers successfully alleviated this load and are bringing services online in a controlled manner to resolve the issue.

Posted Jul 15, 2021 - 15:19 UTC

Identified

The engineering team has identified the issue within the Hosted Email system and is currently working on resolving the issue.

Posted Jul 15, 2021 - 12:22 UTC

Update

Our team is still investigating this issue, and we will provide updates when they are available.

Thank you for being patient while we continue to look into this issue for you.

Posted Jul 15, 2021 - 11:39 UTC

Update

We are aware this is still ongoing and are investigating this. We will post any updates when they are available.

Posted Jul 15, 2021 - 10:41 UTC

Investigating

We are currently experiencing a connection issue for a small set of email accounts on Cluster A. We are investigating this at this time, and will provide updates as they are available. Thank you for your patience.

Posted Jul 15, 2021 - 08:43 UTC

This incident affected: Hosted Email (Cluster A).