Email Cluster A Status Archive
Email Cluster A is Degraded
Resellers on OpenSRS Email Service Cluster 'A' may notice delays when provisioning new email accounts. Our operations team is aware of the issue and is investigating the cause. We'll keep you posted. Existing users on Cluster 'A' are unaffected and mail continues to flow in and out as usual.
This update is related to Incident 19899
Email Cluster A is Online
The maintenance has completed.
Email Cluster A is In Maintenance
OpenSRS Email Cluster A has a three-hour maintenance window beginning now.
During this 3-hour maintenance window, we'll be performing firmware updates on storage components that are used by a very specific subset of users on Email Cluster A.
We do not expect that this maintenance will have any impact on your email users, but we're letting you know about the maintenance out of an abundance of caution.
Email Cluster A is Online
We have been watching Cluster A closely for about an hour now and it looks like there are no further signs of any intermittent slowness. As a result, we're going to set the status back to 'Online' at this time. We'll keep monitoring things as usual.
This update is related to Incident 19120
Email Cluster A is Degraded
At this time (and for the past 20 minutes or so) it appears that we're no longer experiencing the intermittent slowness issue with Cluster A. Mail is flowing normally and your users should be able to log in as usual via POP, IMAP and webmail
That said, we're going to leave the status as 'Degraded' for the next little bit while we continue to closely monitor the service.
This update is related to Incident 19120
Email Cluster A is Degraded
We’re currently experiencing an issue with Email Cluster A. Users may experience intermittent slowness when attempting to log in via POP, IMAP or webmail. We’re investigating and we’ll let you know more shortly.
This update is related to Incident 19120
Email Cluster A is Online
At this time all mailboxes on Cluster A are accessible via IMAP, POP, SMTP and Webmail.
While waiting for the ZFS kernel patch from our hardware vendor, we developed a workaround based on our knowledge of what the patch was being designed to correct. We feel that this workaround has addressed the issue we have been experiencing.
The workaround was put in place about an hour ago and since then Reseller mail queues have been clearing and the service has been returning to normal.
We will be watching the system extremely closely overnight and at the same time taking additional actions to reduce the likelihood further issues as we return to peak volume in the morning.
Incident Report
Executive Summary
During the period of June 20th at 07:30 until June 21st 23:21 (UTC) we were experiencing multiple mail events that resulted in the intermittent loss of connectivity to IMAP/POP3, SMTP and Webmail. Inbound and outbound mail delivery was delayed for approximately 25% of the mailboxes provisioned on Cluster A.
We were able to fully recover from the event by implementing a workaround to a previously unknown bug in the ZFS filesystem. We worked closely with senior engineers from Oracle/Sun to implement a workaround. This particular bug would only manifest itself under certain conditions. In the case of this event, a combination of storage capacity (over 50%), a large number of small files and other factors artificially maxed out the ZFS write capacity, which resulted in extremely high latency with the disk writes, and intermittent mail delivery delays.
A full solution in the form of a software patch has been provided by Oracle/Sun and will be implemented at a later date through our change control process after we've had a chance to test it in our QA environment.
The full update that details our efforts toward issue resolution can be found here: Incident #18894.
This update is related to Incident #18894
Email Cluster A is Degraded
We continue to experience degraded service on Cluster A of our Email Service.
The hardware vendor of the affected storage system - Oracle/Sun - believes they have identified a previously unknown bug in the ZFS kernel that is causing locking contention on our system during peak volumes. Their engineers are currently working on a patch and we hope to have this in place in the next few hours.
We’re hopeful this patch will resolve the issue but until it has been developed, QAed, installed in our production environment and the service has been brought back online we will not know for sure.
We will provide our next update in two to three hours.
This update is related to Incident #18894
Email Cluster A is Degraded
We continue to experience degraded service on Cluster A of our Email Service.
Affected users will experience slow or no access to POP, IMAP, SMTP and Webmail services for a period of time followed by full access.
We’ve reduced the number of mailboxes affected to less than 20%. Within the next few hours most resellers will see all of their customers’ mailboxes online the majority of the time. Intermittent inability to access mail may however be an issue until we have fully resolved the root cause.
Most of our efforts are now being directed towards determining if a Solaris kernel error is causing the issue. We’re working with Oracle/Sun on this. As a precaution we’ve also given a second team the task of investigating other alternatives, including ones previously ruled out. We don’t think we’ll find anything from this but feel it prudent to double check everything at this point.
We’re very sorry for the hassle this is causing you and your customers.
We will provide our next update in roughly two hours.
This update is related to Incident #18894
Email Cluster A is Degraded
Here's our latest update related to the Cluster A mail storage issue.
With the help of Oracle/Sun engineers, we're exploring the possibility of the operating system kernel as the source of the issue. Once again, we're very sorry for the inconvenience this issue has caused you and your customers.
We expect the next update will be provided in roughly two hours.
This update is related to Incident #18894

