Cluster A - Webmail/IMAP/POP node issue
Incident Report for OpenSRS
Postmortem

Incident Date: March 23, 2020
Incident Number: PR-1022

On March 23, 2020, at 9:00 AM ET, the Tucows HostedEmail platform experienced service interruptions, impacting Inbound/outbound mail deliveries and mailbox accessibility in prod A.

The service interruption was caused due to a high load on a network storage device.

The operations team performed multiple maintenance to minimize the impact and migrated mailboxes to a different network storage device. The migration process was spread out over a couple of maintenance windows due to the high volume of users residing on the affected network storage device.

On March 26, 2020, at 7:30 PM ET, The Operations team restored all the services and stabilized the load in the email environment.

Preventive measures: As part of the ongoing stabilization efforts in cluster A, Tucows will continue to increase and provision additional capacity and continue with the load distribution efforts to prevent further client impact.

Thank you,

Tucows Operations

Posted Apr 07, 2020 - 15:51 UTC

Resolved
This incident has been resolved.
Posted Mar 27, 2020 - 02:06 UTC
Update
We have now unblocked all users on Cluster A to be able to access and use mail services.

There are still remaining accounts to be migrated to the new mailstore, and we will post updates on this incident to keep you informed as we work to complete the migration.

To be kept up to date, please subscribe either to the System Status page or this specific case.
Posted Mar 27, 2020 - 01:31 UTC
Update
The operations team has already enabled roughly 60% of users out of the subset of users (approx 2k) that was under maintainance and progressing towards enabling remaining users.
Posted Mar 26, 2020 - 23:18 UTC
Update
The affected mailstore is now enabled again. With the exception of the small subset of users (approx 2k) placed in maintenance for additional investigation. All other users on Cluster A should be accessible without issue
Posted Mar 26, 2020 - 20:13 UTC
Update
The previously disabled mailstores have been re-enabled again. Our Operations department has narrowed down the potential issue in the last affected mailstore to a subset of users. Currently, we are bringing the last affected mailstore online with those users disabled so we can monitor performance.

Updates will be provided when possible
Posted Mar 26, 2020 - 19:44 UTC
Update
While re-enabling the affected mailstore an issue regarding it was discovered. To mitigate that, our Operations department will be disabling some other mailstores. They will be brought back up one at a time to monitor performance. Approx 70k users will be impacted.

Updates will be provided as soon as possible
Posted Mar 26, 2020 - 19:01 UTC
Update
We have re-enabled all mailstores in our system except one since our last update. Some users on the affected mailstore will be offline until we completed the migration. We will provide an update as soon as the migration is done.
Posted Mar 26, 2020 - 18:33 UTC
Update
Most of the affected mailstores have been re-enabled. While the operations team continues to investigate another to identify and rectify the issue on it.

Updates will be provided as soon as possible
Posted Mar 26, 2020 - 12:05 UTC
Update
After re-enabling our mailstores to resolve this ongoing incident we have encountered a different issue and needed to bring the mailstores offline again in order to resolve this problem. We will likely be bringing them back up shortly once we have completed testing. These affected users will be unable to login to the Webmail/IMAP or POP until the mailstores have been re-established.
Posted Mar 26, 2020 - 11:10 UTC
Update
The mailstores have been enabled with 12521 users migrated. The CR and migration are stopped for today. At this time service is fully online and we will be monitoring for any impact come business hours EST which is currently expected to be better than previous days.
Posted Mar 26, 2020 - 10:00 UTC
Update
The emergency maintenance on Cluster A will be targeted for specific users. Those targeted users will be unable to login to their account via POP/IMAP/Webmail or an email client during the maintenance window. This would affect approx. 6.6% of users on Cluster A

Start time: Mar 25, 2020 at 10:00 PM UTC

End time: Mar 26, 2020, at 10:00 AM UTC
Posted Mar 25, 2020 - 20:58 UTC
Update
We will be performing an emergency maintenance of targeted users. Cluster A users will be unable to login to their account via POP/IMAP/Webmail or an email client during the maintenance window.

Start time: Mar 25, 2020 at 10:00 PM UTC

End time: Mar 26, 2020, at 10:00 AM UTC
Posted Mar 25, 2020 - 19:54 UTC
Update
The operations team will be rotating mail flow by enabling and disabling different smtpin nodes at a time to ensure mails are delivered to all users.

Updates will be provided as soon as possible
Posted Mar 25, 2020 - 17:25 UTC
Update
Our Operations team have enabled IMAP/POP/Webmail. Email deliveries will be resumed shortly.

Updates will be provided as soon as possible
Posted Mar 25, 2020 - 15:18 UTC
Update
We will again be stopping access to 70k users and start them again over the next hour.

Updates will be provided as soon as possible
Posted Mar 25, 2020 - 14:24 UTC
Update
We are seeing high load again. Users may notice slowness or intermittent issues to login.

Updates will be provided as soon as possible
Posted Mar 25, 2020 - 14:03 UTC
Update
Our operations team has ended the maintenance on Cluster A by enabling all of the mailstores.

A small number of accounts will still not have webmail or email client access as we prepare for another batch migration to be done during a lull period.
Posted Mar 25, 2020 - 07:35 UTC
Update
The maintenance is nearing completion. Currently 4 of the 5 mailstores have been enabled again and we are awaiting storage load to drop before the 5th mailstore is enabled.

At this time, only users on the still disabled mailstore are expected to be completely unable to utilize hosted email. Users from the other 4 mailstores which were enabled will still see degraded performance at this time.

We will continue to update, as updates become avaialble. We thank you for your patience.
Posted Mar 25, 2020 - 07:06 UTC
Update
Our operations team is working on an emergency migration process to help restore access.
A handful of users will be unable to login to their account via webmail or their email clients.
Posted Mar 24, 2020 - 23:56 UTC
Update
The operations team is still observing a high load on the NAS. They are monitoring it to ensure it stabilizes before they resume email deliveries.

Updates will be provided as soon as possible
Posted Mar 24, 2020 - 20:26 UTC
Update
Our operations team has turned on all 5 mailstores and is currently observing the performance. Once the NAS stabilizes, mail deliveries will be resumed one mailstore at a time.

Updates will be provided as soon as possible
Posted Mar 24, 2020 - 18:55 UTC
Update
Our operations team has turned on 4 mailstores out of 5 mailstores and is currently observing the performance of the NAS. We have around ~56k users online. Once the load on the NAS settles down, the remaining mailstore will be turned on as well.
Posted Mar 24, 2020 - 17:36 UTC
Update
The operations team has turned off all mailstores on the affected NAS and disabled mail services for ~70k users to stabilize the network storage device. The mailstore migration processes are still running. Once migration processes finish successfully, the Operations Team will be turning mailstores online one at a time and will observe the performance on the NAS

Updates will be provided as soon as possible
Posted Mar 24, 2020 - 15:36 UTC
Update
Update: The problem has not fully recovered after last night’s maintenance. The operations team is taking subset of users offline to stabilize the hardware. The plan is to take down around 15k users out of the 70k at a time until we notice improvement. Users on that mail store may experience problems accessing their mailbox including Tucows users. For now, email is still being delivered to users inboxes, even if they cannot login at this time.
Posted Mar 24, 2020 - 13:37 UTC
Update
Our OPS dept has stopped the migration for now. 6396 users have been migrated so far and OPS is planning for the next step for today. Webmail is degraded, as well as IMAP and POP
Posted Mar 24, 2020 - 12:54 UTC
Update
The migration appears to be progressing well. A further estimate on the completion time has not yet been made. We will update again once we have this information.
Posted Mar 24, 2020 - 09:20 UTC
Update
The migration continues. So far 5100 users have been migrated for the maintenance and the mailstores will be enabled again in approx. 1 hour. We will continue to provide updates as they become avaialble/
Posted Mar 24, 2020 - 06:01 UTC
Update
The migration appears to be progressing well. A further estimate on the completion time has not yet been made. We will update again once we have this information.
Posted Mar 24, 2020 - 03:26 UTC
Update
We are currently performing emergency maintenance on the cluster to expedite the migration of the affected users.
During the maintenance, users may be unable to access our email services. We will post the updates as we receive.
Posted Mar 24, 2020 - 01:37 UTC
Update
Our operation team is still working on the mailstore migration. We will provide updates as we receive them.
Posted Mar 23, 2020 - 23:23 UTC
Update
Our operation team is diligently working on migrating the affected users into newly created mailstores.
We appreciate your continued patience.
Posted Mar 23, 2020 - 21:32 UTC
Update
We are currently having emergency maintenance to bring service back online.
Posted Mar 23, 2020 - 20:46 UTC
Update
There is no change in the status of service impact, however, the failover is almost complete.

Some POP3 users may also experience some trouble connecting.

The next update will be within the next 60 minutes.
Posted Mar 23, 2020 - 19:40 UTC
Update
Our investigation continues into this issue.

Webmail is still down and imap still under heavy load currently.

Next Update: Within 30 minutes
Posted Mar 23, 2020 - 18:08 UTC
Update
Our Ops dept are still investigating the issue. Webmail is still down and imap still under heavy load currently.

Next update in 30 minutes.
Posted Mar 23, 2020 - 17:22 UTC
Update
We are continuing to investigate this issue. Webmail is still down completely and many users are still unable to login at this time. Currently, our Ops dept is taking down mailstores one at a time to try and mitigate the impact as a whole. At this time, 5 mailstores have been taken down and inbound smtp delivery for these users is being queued locally for later delivery.

Next update within 30 mins
Posted Mar 23, 2020 - 16:38 UTC
Update
Webmail is now down completely and many users are still unable to login at this time. We have upgraded this issue and our operations team is still working to mitigate this impact.

Updates will be provided as soon as possible
Posted Mar 23, 2020 - 15:53 UTC
Identified
Our operations team has identified a disk issue as the main cause of this problem and are currently working to mitigate the impact.

Client Impact: Some users will not be able to login at this time.

Updates will be provided as soon as possible
Posted Mar 23, 2020 - 15:18 UTC
Investigating
We are currently experiencing an issue where some of our IMAP nodes have maxed out and can result in some users experiencing connection issues

Client Impact: Some users will not be able to login at this time.

Updates will be provided as soon as possible
Posted Mar 23, 2020 - 14:52 UTC
This incident affected: Hosted Email (Cluster A, Webmail).