Cluster B - Webmail / IMAP / POP
Incident Report for OpenSRS
Postmortem

Incident Date: August 2, and August 5, 2019

Tucows' email platform, prod B, experienced a sequence of isolated incidents between August 2 and 8, 2019. The mentioned incidents caused service unavailability of the email platform in PROD B datacenter.

On August 2, 2019, at 08:51 a slow disk and raid controller card on one of the paired storage devices (d-nfs07b) failed; as per design the system failed-over to its secondary standby storage. Remote hands replaced the failed raid controller card on August 3, 2019, at 10:00.

Tucows Engineers continued migrating mail accounts over the weekend to prevent client impact.

On August 5, 2019. A storage device (d-nfs07b) experienced latency issues that affected around 2% of prod B customers. The degradation of d-nfs07b exhausted the resources on its paired replica d-nfs07a. Due to a software bug, it was challenging to identify and failover the bad disk following our standard operating procedures.

To stabilize the environment, Tucows' engineers started to migrate mailboxes onto other available clusters to reduce load and improve customers' experience. However, the degraded performance of the cluster (d-nfs A and B) hindered the migration efforts.

On August 7, 2019, 16:30 Tucows Engineers released select held-messages from the queue and they were delivered within 20 minutes. All messages were released at 20:00 after the Tucows abuse team completed the scan of the mail queues.

On August 8, 2019. The continued migration of mailboxes allowed us to stabilize the environment, reduce the load on the affected cluster and resolve customer experience.

The Tucows team continued the stabilization work over the weekend, which caused minor service interruptions while enabling services and further migrating mailboxes off the affected cluster.

Preventive measures:

  • Tucows to improve external communication
  • Tucows is committed to migrate remaining mail stores onto other available clusters without client impact within the next 4 weeks.
  • Tucows is provisioning additional hardware in the prod B datacenter to distribute user base further to lessen the impact in case of future similar incidents.
  • Tucows is working with the vendor to resolve the identified software bug
  • Tucows is introducing additional processes, automation and monitoring to identify the failed disk and improve time to recover.

Furthermore, Tucows has been planning the migration of the prod B email platform running in Ashburn data center to a new datacenter.

  • Introducing new technologies to improve high availability and scalability.
  • Migrating Prod B customers to new cloud infrastructure that provide us with redundant and resilient storage.
Posted Aug 13, 2019 - 20:01 UTC

Resolved
This incident has been resolved.
Posted Aug 08, 2019 - 18:29 UTC
Monitoring
We are pleased to inform you that all email services are back to normal on Cluster B. We will continue to monitor the services. Thank you very much for your patience.
Posted Aug 08, 2019 - 15:23 UTC
Update
This maintenance period has now completed, and can confirm that another portion of mailboxes are now performing as expected. We are assessing the mailboxes that are still experiencing the issue sending and receiving email and having an unresponsive webmail interface.

As we've put mail service for these accounts back online, users should expect there to be a mail delay for a time as the queue'd incoming emails are delivered to mailboxes on Cluster B.

Our operations teams are actively working to resolve the situation as best we can. We will provide additional updates as they are provided.
Posted Aug 08, 2019 - 09:07 UTC
Update
The maintenance period has started, during this time a small subset of customers on Cluster B will not be able to retrieve their email during this time period. Customers may receive errors for both Webmail and POP/IMAP.

Maintenance Start Time: 08-08-2019 at 02:00:00 UTC
Maintenance End Time: 08-08-2019 at 09:00:00 UTC

During this time, a small subset of users will not be able to retrieve their email during the maintenance period. Emails that are sent to these users will be held in our system and not seen within their mailbox until we have completed the maintenance.

We will update the status once this has completed.
Posted Aug 08, 2019 - 02:07 UTC
Update
Thank you for your continued patience in this matter. We have scheduled an emergency maintenance period on 08-08-2019 and will begin at 02:00:00 UTC, This will help alleviate the on-going IMAP/POP Connection timeouts and Webmail interface loading issues.

Maintenance Start Time: 08-08-2019 at 02:00:00 UTC
Maintenance End Time: 08-08-2019 at 09:00:00 UTC

During this time, a small subset of users will not be able to retrieve their email during the maintenance period. Emails that are sent to these users will be held in our system and not seen within their mailbox until we have completed the maintenance.

Users may see errors as seen below:

Webmail: A page that continuously loads
IMAP/POP: Connection time out errors

We apologize for any inconvenience this may cause.
Posted Aug 07, 2019 - 23:32 UTC
Update
SMTP services should be operational.

Some users accessing mail storage via:
IMAP/POP - will experience delays fetching mail
Webmail - may experience infinite load screens or error codes and messages of varying types

Our operations team continue to work on system changes that should improve mail services over the affected protocols and interface.
Posted Aug 07, 2019 - 19:19 UTC
Update
We are noticing email sending issues for some customers on our Cluster B.

Users may come across "SMTP error: Connection to server failed" messages while attempting to send emails.
Our Team is actively investigating and we will provide more updates as they are available.
Posted Aug 07, 2019 - 15:54 UTC
Update
The Cluster B issue is continuing for some accounts, with the effect:

Webmail: A page that continuously loads, or error messages and codes of varying type
IMAP/POP: Connection time out errors

There are less accounts experiencing this issue as a result of the scheduled maintenance. As more information is provided we will post the next update.

Thank you for your continued patience.
Posted Aug 07, 2019 - 11:10 UTC
Update
The maintenance period has now completed, and we are assessing the mailboxes that may still experience the issue retrieving email and using the webmail interface. There may be a mail delay for an a time as the queue'd incoming emails are delivered to mailboxes on Cluster B.

Our operations teams are actively working to resolve the situation as best we can. We will provide additional updates as they are provided.
Posted Aug 07, 2019 - 09:14 UTC
Update
The maintenance period has started, during this time a small subset of customers on Cluster B will not be able to retrieve their email during this time period. Customers may receive errors for both Webmail and POP/IMAP.

Webmail: A page that continuously loads
IMAP/POP: Connection time out errors

Emails that are sent to these users will be held in our system and not seen within their mailbox until we have completed the maintenance. We will provide another status update once this period has completed.

Maintenance Start Time: 08-07-2019 at 02:00:00 UTC
Maintenance End Time: 08-07-2019 at 09:00:00 UTC
Posted Aug 07, 2019 - 02:03 UTC
Update
Our operations team will be performing emergency maintenance tonight to assist in the resolution of the on-going Cluster B inbound mail delays and IMAP/Webmail/POP Connection timeout errors.

The start time and end times are below:
Maintenance Start Time: 08-07-2019 at 02:00:00 UTC
Maintenance End Time: 08-07-2019 at 09:00:00 UTC

A small subset of users on Cluster B will not be able to retrieve their email while within the maintenance window. They may see errors as such:

Webmail: A page that continuously loads
IMAP/POP: Connection time out errors

Emails that are sent to these users will be held in our system and not seen within their mailbox until we have completed the maintenance.
We apologize for any inconvenience this may cause. We will update once the maintenance has completed.
Posted Aug 06, 2019 - 21:46 UTC
Update
Our impact has not changed from the last status page posting. At current the 5% of users on cluster B will face errors through IMAP/POP: "Connection Timeout" while webmail users will experience sluggishness and continuous loading upon login.

Our operations teams are actively working to resolve the situation as best we can. We will provide additional updates as they are provided.
Posted Aug 06, 2019 - 20:18 UTC
Update
Thank you for your continued patience on this matter. OpenSRS experienced two separate degradations, these incidents were related to two different storage devices within our infrastructure with the first starting on August 2nd, 2019 13:44 UTC, and the second at approximately on Monday at 12:00 UTC.

Our systems have recovered from the first incident which was related to storage device degradation and are working through the second which is affecting inbound mail delivery.

At current 5% of users on Cluster B are affected by issues such as inbound mail delivery, which customers may not see emails in their mailboxes for a period of time along with intermittent webmail sluggishness and timeouts.

To help alleviate the on-going problem, we will be performing emergency maintenance tonight and we will be providing additional details later today.
Posted Aug 06, 2019 - 18:23 UTC
Update
We have completed our emergency maintenance window:
Maintenance Start Time: 08-06-2019 at 04:00 UTC
Maintenance End Time: 08-06-2019 at 08:00 UTC
This maintenance completed without issue.
Emails destined to each user that were queued in our system will now be delivered to each mailbox as required.
We will continue to monitor, before considering this resolved.
Posted Aug 06, 2019 - 08:57 UTC
Update
In an effort to alleviate the on-going situation with Cluster B, we have scheduled emergency maintenance on 2019-08-06. This will take place at 04:00 UTC for a duration of 4 hours.
During the maintenance window, users will not be able to retrieve email and will see intermittent connections to their mailbox. Emails destined to each user will be queued in our system and will only be delivered once the maintenance window has completed.

We apologize for any inconvenience this may cause, and will report back once the maintenance window has completed.
Posted Aug 06, 2019 - 03:53 UTC
Update
An issue with one of our storage devices was first reported on August 2nd, 2019 13:44 UTC which resulted in email accessibility issues via IMAP/POP3, and by extension webmail, for end-users.

Customer's will notice a ~45 - minute delay for inbound mail delivery(Cluster B) in addition to the small subset of users facing intermittent webmail timeout errors.

As additional information is received we will post an update. We are working hard to reduce the impact on those affected.

Thank you for your continued patience
Posted Aug 05, 2019 - 20:43 UTC
Update
Thank you for your continued patience in this matter. Customers may notice inbound mail delays on Cluster B, with a small subset of customers, degraded performance for IMAP/POP and Webmail interfaces. Our Operations team continues to work on this matter.

Once we have additional details around this matter we will report back.

Next Update: 20:00 UTC
Posted Aug 05, 2019 - 14:38 UTC
Update
Our team still currently working on resolving this issue.

We will continue to provide further updates as they become available.
Thank you for your patience.
Posted Aug 05, 2019 - 02:29 UTC
Update
Our Operations team is continuing to work on a resolution for our cluster issues.

We will provide more information as it becomes available.
Posted Aug 04, 2019 - 17:12 UTC
Update
We are currently still working on resolving this matter.

We will continue to work on the issue and report back as soon as we get any new updates.

Next Update: Aug 4th, 2019 | 5 PM UTC
Posted Aug 03, 2019 - 13:04 UTC
Update
Our team still currently working on resolving this issue.

We will continue to provide further updates as they become available.
Thank you for your patience.
Posted Aug 03, 2019 - 09:03 UTC
Update
Our team still currently working on resolving this issue.

We will continue to provide further updates as they become available.
Thank you for your patience.
Posted Aug 03, 2019 - 04:56 UTC
Update
Our team still currently working on resolving this issue.

We will continue to provide further updates as they become available.
Thank you for your patience.
Posted Aug 03, 2019 - 01:48 UTC
Update
As we continue to work on the problem, customers will see inbound/outbound mail delays along with intermittent Webmail/POP/IMAP connection problems.

As more information is provided we will post the next update.
Posted Aug 02, 2019 - 18:58 UTC
Update
Our operations team is continuing to work on a resolution. We have alleviated some of the users on Cluster B that are affected by this issue.
We will continue to work on the issue and report back as soon as we get any new updates.
Posted Aug 02, 2019 - 15:07 UTC
Identified
Webmail / POP / IMAP are services affected. The small subset of customers may also notice a small delay in inbound mail. Our Operations teams are actively working to resolve the problem.

Once we have additional information we will report back.
Posted Aug 02, 2019 - 13:55 UTC
Investigating
We are investigating an issue preventing a small subset of users on Cluster B from logging into Webmail. IMAP and POP services are also affected. Our operations team has been engaged.

We will provide an update once we have additional information.
Posted Aug 02, 2019 - 13:44 UTC
This incident affected: Hosted Email (Cluster B, Webmail).