Updates related to Incident #18894
Email Cluster A is Online
At this time all mailboxes on Cluster A are accessible via IMAP, POP, SMTP and Webmail.
While waiting for the ZFS kernel patch from our hardware vendor, we developed a workaround based on our knowledge of what the patch was being designed to correct. We feel that this workaround has addressed the issue we have been experiencing.
The workaround was put in place about an hour ago and since then Reseller mail queues have been clearing and the service has been returning to normal.
We will be watching the system extremely closely overnight and at the same time taking additional actions to reduce the likelihood further issues as we return to peak volume in the morning.
Incident Report
Executive Summary
During the period of June 20th at 07:30 until June 21st 23:21 (UTC) we were experiencing multiple mail events that resulted in the intermittent loss of connectivity to IMAP/POP3, SMTP and Webmail. Inbound and outbound mail delivery was delayed for approximately 25% of the mailboxes provisioned on Cluster A.
We were able to fully recover from the event by implementing a workaround to a previously unknown bug in the ZFS filesystem. We worked closely with senior engineers from Oracle/Sun to implement a workaround. This particular bug would only manifest itself under certain conditions. In the case of this event, a combination of storage capacity (over 50%), a large number of small files and other factors artificially maxed out the ZFS write capacity, which resulted in extremely high latency with the disk writes, and intermittent mail delivery delays.
A full solution in the form of a software patch has been provided by Oracle/Sun and will be implemented at a later date through our change control process after we've had a chance to test it in our QA environment.
The full update that details our efforts toward issue resolution can be found here: Incident #18894.
This update is related to Incident #18894
Email Cluster A is Degraded
We continue to experience degraded service on Cluster A of our Email Service.
The hardware vendor of the affected storage system - Oracle/Sun - believes they have identified a previously unknown bug in the ZFS kernel that is causing locking contention on our system during peak volumes. Their engineers are currently working on a patch and we hope to have this in place in the next few hours.
We’re hopeful this patch will resolve the issue but until it has been developed, QAed, installed in our production environment and the service has been brought back online we will not know for sure.
We will provide our next update in two to three hours.
This update is related to Incident #18894
Email Cluster A is Degraded
We continue to experience degraded service on Cluster A of our Email Service.
Affected users will experience slow or no access to POP, IMAP, SMTP and Webmail services for a period of time followed by full access.
We’ve reduced the number of mailboxes affected to less than 20%. Within the next few hours most resellers will see all of their customers’ mailboxes online the majority of the time. Intermittent inability to access mail may however be an issue until we have fully resolved the root cause.
Most of our efforts are now being directed towards determining if a Solaris kernel error is causing the issue. We’re working with Oracle/Sun on this. As a precaution we’ve also given a second team the task of investigating other alternatives, including ones previously ruled out. We don’t think we’ll find anything from this but feel it prudent to double check everything at this point.
We’re very sorry for the hassle this is causing you and your customers.
We will provide our next update in roughly two hours.
This update is related to Incident #18894
Email Cluster A is Degraded
Here's our latest update related to the Cluster A mail storage issue.
With the help of Oracle/Sun engineers, we're exploring the possibility of the operating system kernel as the source of the issue. Once again, we're very sorry for the inconvenience this issue has caused you and your customers.
We expect the next update will be provided in roughly two hours.
This update is related to Incident #18894
Email Cluster A is Degraded
We're continuing to investigate the mail issue affecting Cluster A.
Yesterday we engaged our vendor, Oracle/Sun to assist us. Earlier this morning, Oracle has escalated the issue to their senior engineers who are continuing to work with us to further diagnose the root cause affecting Cluster A's mail storage.
We expect to have more details around their investigation within the next couple of hours at which time we'll provide further updates.
This update is related to Incident #18894
Email Cluster A is Degraded
** Click here for recent updates on this issue **
We are currently experiencing degraded service on approximately 25% of mailboxes on Cluster A of our Email Service. The majority of these mailboxes belong to Tucows itself and to customers of Hover. Other resellers on Cluster A are also affected but to a lesser extent.
Affected users will experience slow or no access to POP, IMAP, SMTP and Webmail services intermittently followed by full access.
We have identified the problem as relating to mail-stores located on one particular hardware configuration within this cluster. We have however been unable to identify the root cause.
Everyone who can possibly help resolve this is working on finding a fix and we will do our best to give you more details as they become available.
We’re very sorry for the hassle this is causing you and your customers.
We will provide updates at roughly two hour intervals until the issue is resolved.
BACKGROUND
We first saw this issue Monday morning (EDT), a time when we typically see peak load on the system. Mornings as Europe and then North America wake up are daily peaks and the first day of the week is the weekly peak.
We investigated several different scenarios and ruled each out:
- Attack (DDOS, Spam Flood)
- Network Issues (packet loss, physical connections etc.)
- Hardware/Code Issues
Once we narrowed it down to an issue with the Storage system we checked:
- All storage system hardware
- Network connections to the storage system
- CPU usage
- Available bandwidth
- File system corruption
- RAID or parity errors
- Cache hit/miss rate
...etc.
About eight hours after the first occurrence we identified a misconfigured script that we believed could be the root cause. Due to human error the script (intended to take snapshots daily at 4:00, 8:00, 12:00, 16:00 and 20:00 hours) had been configured to occur at 4, 8, 12, 16 and 20 minutes past the hour each hour. We noticed a correlation between this script running and the periods of unavailability. We corrected the script and restarted the mail-stores one at a time. The problem did not reappear.
After monitoring for a while we returned Cluster A’s status to Online hoping we’d found the root cause but unable to say so definitively as we had left the window of peak traffic and therefore could not be certain we would not see the problem again under peak load.
Today, as Europe got into the office the issue presented itself again. It appears that the misconfigured script may not have been the root cause.
We are now working on further analysis of the situation with the hope of identifying and correcting the root cause of the issue.
This update is related to Incident #18894
Email Cluster A is Degraded
We've been monitoring Email Cluster A closely since the intermittent IMAP/POP3, SMTP and webmail login failure event yesterday and it appears that as of this morning, the symptoms have returned. We're investigating and will provide updates as soon as we have them.
This update is related to Incident #18894
Email Cluster A is Online
About 60 minutes ago, we brought the last of the affected mailboxes online.
We've since been monitoring the service closely and all tests have been positive. We believe we have found and corrected the root cause of this event.
We're going to continue to monitor the service and will provide a full incident report within the next 24 hours.
This update is related to Incident #18894
Email Cluster A is Degraded
We continue to bring more mailboxes online and are closely monitoring the service. We'll provide another update within the hour.
This update is related to Incident #18894
Email Cluster A is Degraded
We've identified the cause of the event to be related to storage and we've made progress in bringing this event to full resolution.
Earlier today, this event started with approximately 25% of the mailboxes on Cluster A affected by IMAP, POP3, SMTP and webmail login errors.
About thirty minutes ago, we implemented a change that has further reduced the number of affected mailboxes to less than 10%. We're closely monitoring the service and once the platform has stabilized, we hope to bring the remaining 10% of mailboxes back online.
This update is related to Incident #18894

