Email Cluster A Status

Online

Updated Sunday, January 11th, 2009 at 12:01 PM ET
2009-01-11 at 17:01 UTC - Other time zones

All users can now have full access to mail and systems are performing normally.  Undelivered spam is now being delivered and we expect that queue to be cleared later today.

We have begun a thorough root cause analysis with our storage vendor Netapp and will provide you with an incident summary by end of the business day on Monday.

Once again, we apologize to you and your customers for the inconvenience.

Update: Monday, January 12, 2009:
Incident Summary:

We strongly suspect a long-existing hardware defect caused this incident. Our technical team continues our analysis and we will follow-up with more details in the coming days. As well, our vendor is simultaneously conducting a root cause analysis.

NetApp released a shelf controller firmware upgrade on November 17, 2008. This release fixes 3 software bugs a number of which are quite pertinent and critical to our environment. Two of these critical bugs affected the controller reliability in the past.

Our approach to all releases is to allow some time for vendors to release updates and patches. We followed this approach with the November 17 NetApp release.

We first rolled this shelf controller firmware upgrade to our Development QA environment in November and subsequently rolled the upgrade out to our pre-production environment in December. These devices are the exact same models as our production environment. We conducted this shelf controller upgrade on over a hundred devices without incident.

Our primary OpenSRS technicians working on these upgrades are NetApp certified specialists. Based on testing, the time we've been running NetApps, and our previous experience making 10 successful firmware upgrades on this type of component, we were confident in undertaking maintenance on Cluster A of our email service.

During our scheduled maintenance on Cluster A (January 10, 2009: 06:00 UTC - 09:00 UTC), the shelf controller firmware upgrade of our NetApp caused a failure of the controlling disk head in the storage pool. The controlling disk head lost access to a shelf and handed control over to the second disk head which then triggered a rebuild of disks on that shelf. This directly affected 3 mailstores. 

For the first 2.5 hours, all of Cluster A mailboxes were offline for testing and to alleviate stress on the one functioning disk head while the rebuild occurred. At approximately January 10, 2009: 12:30 UTC, after thorough testing and confirmation from NetApp, we reactivated the offline disk head which restored access for 50% of customer mailboxes on the cluster as well as forward only and filter only accounts. These customers were able to access their mailboxes and send/receive mail. 

The rebuild continued on the affected 3 mailstores. The rebuild process is a consecutive activity. In order to restore services, each of the affected mailstores must be rebuilt in sequence. This meant that 50% of customer mailboxes remained offline for a period of approximately 20 hours. We assessed that the rebuild process required the mailboxes to be offline to execute the rebuild more quickly, restore service efficiently and maintain service for the 50% unaffected mailboxes.

Our Technical team closely monitored the rebuild process throughout the day. As well, we worked closely with NetApp to analyze the issue and determine possible paths to speed up the rebuild process while not adversely affecting the system. The first mailstore was rebuilt on January 10: 16:55 UTC, the second on January 11: 04:05 UTC and the third at January 11: 14:20 UTC. 

Full access was provided to remainder of customers by January 11, 2009: 05:00 UTC when the second mailstore finished rebuilding. The third mailstore rebuild was a single disk rebuild which could run in the background without affecting mailbox access. Inbound mail began flowing for most of the affected mailboxes. We determined that inbound mail for 10% of customers would need to be queued to not impede the final mailstore rebuild. These customers were able to access their mailboxes and send mail. Inbound mail was delayed an additional 10 hours for these customers. 

Inbound mail (ham) delivered throughout the night with queues flushing at January 11, 2009: 17:00 UTC. At this time, Cluster A was fully online. 

Spam email delivery was enabled and continued for the remainder of the day.

Update: January 16, 2009

Dear Customers -

On behalf of all of the members of the OpenSRS team, please accept our sincere and deepest apologies for the service disruption on Cluster A this past weekend.

Many of you have asked, “How could we have let this happen again?” We initially were led to believe that we had a software problem. We have now determined that the string of service problems on Cluster A are related to a hardware problem inside one of our NetApp devices.

Below is a letter of explanation I received from Jeff Goldstein, General Manager at NetApp Canada.

We are not without fault in this situation. Network-attached storage is complex and we trusted our vendor to provide us with accurate advice related to our problems. In hindsight, we should have pressed earlier for replacement hardware.

Please rest assured that we are dedicated to providing a reliable email service and will be working tirelessly to restore your confidence in us. An incident report is available at OpenSRS Status.

Sincerely,
Elliot Noss,
President and CEO, Tucows

Dear Elliot Noss,

I am writing today regarding the recent outage that occurred this past weekend with Cluster A of the OpenSRS Email Service.

As you are aware, Cluster A of the OpenSRS Email Service has experienced a number of service degradations related to issues with our NetApp storage device. Our engineers here at NetApp worked closely with the technical operations and development teams at OpenSRS to trouble-shoot and resolve these issues. In each of the cases, we believed a software
fault was the cause.

The intermittent problem turned out to be due to the hardware shelf controller as well as firmware in one of our NetApp storage devices, which caused the issues on Cluster A.

We are deeply sorry for the inconvenience that resulted from these hardware and email service issues.

One of the promises we make to our customers is that our solutions provide highly available data management and in this case we let you down.

To begin to resolve this issue, we’re taking immediate action to replace the hardware and firmware in Cluster A at our expense. Our engineers will then test and evaluate the components involved to determine what specifically went wrong and apply those findings back into our own quality control teams.

Our two companies have been working together for the past nine years. We value our relationship and will work hard to restore your confidence in NetApp and our solutions.

Again, please accept our sincere apologies.

Regards,

Jeff Goldstein
Canadian General Manager
NetApp Canada

This update is related to