Updates related to Incident 10941

Storefront is Online

Updated Monday, March 8th, 2010 at 4:09 PM ET
2010-03-08 at 20:09 UTC - Other time zones

Storefront services are online. For the past hour, there were no instances of intermittent timeouts for domain look-ups.

Incident Summary: (updated 22:29 UTC/17:29 ET)
Over the past week, we have been conducting a phased code roll-out for new OpenSRS look-up functionality. These changes will significantly improve the response times for all look-ups (including name suggestion calls).

The new functionality uses parallel streamed look-up elements (name suggestion, calls to individual Registries, etc.) and new levels of caching (in-memory caching of recent look-ups, plus data from zone files) to achieve the response time goals.

We had been performing volume and stress tests in QA and development environments since October 2009, but we recognized that we could not completely reproduce production loads and request profiles. Therefore we have been extensively testing this functionality in small pieces in Production since December 2009. Positive results encouraged us to roll these out as integrated components starting last Tuesday March 2nd.

We steadily increased the load on the new infrastructure through Thursday March 4th, until roughly 80% of load was using the new functionality by the Friday March 5th. On that day we first experienced an issue where particular ‘workers’ (components that process different types of look-up commands) reached a maximum queue size that, due to volume, it could not clear. This resulted in some timeouts back to requesting clients. These workers were restarted and full service resumed within roughly an hour. Analysis then pointed to an issue with the system time being out of sync between components on different machines. This was corrected on all machines and no issues were detected for the remainder of Friday and over the weekend.

However, we experienced the same intermittent domain look-up timeout issue again this morning (Monday March 8th). The development team was prepared for such an occurrence and gathered more granular log information before the workers were again restarted to restore service.

Analysis of the logs indicated that:

1 - we were seeing a request pattern and volume that we did not encounter in the QA and development environments.
2 - our application load-balancing component was reacting to that pattern of requests by favouring certain types of look-ups for faster processing
3 - certain types of workers which had been configured to handle high-volume look-up transactions (mostly .COM and .NET look-ups) were also configured to handle the lower volume (but higher latency) look-ups that were being favoured by the load-balancing function. This eventually caused the queues for high-volume transactions to fill up and not get relieved.

We have now implemented two things to fix the problem:

1 – Added a large amount of worker capacity to the pools
2 – Re-configured the workers to separate the high-volume lookups from all other types

Since then we have seen no re-occurrence of the issues and we believe that, having taken these steps, the service is returned to full stability. As well, we are re-evaluating our QA tests to look for this and similar issues in the future to avoid a recurrence post-release of new code.

This update is related to

Domain Service is Online

Updated Monday, March 8th, 2010 at 4:07 PM ET
2010-03-08 at 20:07 UTC - Other time zones

Domains services are online. For the past hour, there were no instances of intermittent timeouts for Domain look-ups.

Incident Summary: (updated 22:29 UTC/17:29 ET)
Over the past week, we have been conducting a phased code roll-out for new OpenSRS look-up functionality. These changes will significantly improve the response times for all look-ups (including name suggestion calls).

The new functionality uses parallel streamed look-up elements (name suggestion, calls to individual Registries, etc.) and new levels of caching (in-memory caching of recent look-ups, plus data from zone files) to achieve the response time goals.

We had been performing volume and stress tests in QA and development environments since October 2009, but we recognized that we could not completely reproduce production loads and request profiles. Therefore we have been extensively testing this functionality in small pieces in Production since December 2009. Positive results encouraged us to roll these out as integrated components starting last Tuesday March 2nd.

We steadily increased the load on the new infrastructure through Thursday March 4th, until roughly 80% of load was using the new functionality by the Friday March 5th. On that day we first experienced an issue where particular ‘workers’ (components that process different types of look-up commands) reached a maximum queue size that, due to volume, it could not clear. This resulted in some timeouts back to requesting clients. These workers were restarted and full service resumed within roughly an hour. Analysis then pointed to an issue with the system time being out of sync between components on different machines. This was corrected on all machines and no issues were detected for the remainder of Friday and over the weekend.

However, we experienced the same intermittent domain look-up timeout issue again this morning (Monday March 8th). The development team was prepared for such an occurrence and gathered more granular log information before the workers were again restarted to restore service.

Analysis of the logs indicated that:

1 - we were seeing a request pattern and volume that we did not encounter in the QA and development environments.
2 - our application load-balancing component was reacting to that pattern of requests by favouring certain types of look-ups for faster processing
3 - certain types of workers which had been configured to handle high-volume look-up transactions (mostly .COM and .NET look-ups) were also configured to handle the lower volume (but higher latency) look-ups that were being favoured by the load-balancing function. This eventually caused the queues for high-volume transactions to fill up and not get relieved.

We have now implemented two things to fix the problem:

1 – Added a large amount of worker capacity to the pools
2 – Re-configured the workers to separate the high-volume lookups from all other types

Since then we have seen no re-occurrence of the issues and we believe that, having taken these steps, the service is returned to full stability. As well, we are re-evaluating our QA tests to look for this and similar issues in the future to avoid a recurrence post-release of new code.

This update is related to

Storefront is Degraded

Updated Monday, March 8th, 2010 at 12:46 PM ET
2010-03-08 at 16:46 UTC - Other time zones

Storefront customers may experience very intermittent service issues with domain look-ups. These blips are under 15 minutes. Our technical teams are monitoring, testing and investigating.

Update: (18:05 UTC/13:05ET)
Our Network Operations Center team advises that instances of intermittent domain look-up timeouts have decreased significantly. All of our technical teams are working to test, monitor and address this issue.

Update: 19:01 UTC/ 14:01ET
Storefront domain look-ups are working well. We implemented a change to our systems and are closely monitoring the results. We are leaving our status message as "Degraded' while we continue to closely test and analyze services.

This update is related to

Domain Service is Degraded

Updated Monday, March 8th, 2010 at 12:45 PM ET
2010-03-08 at 16:45 UTC - Other time zones

Customers may experience very intermittent service issues with domain look-ups. These blips are under 15 minutes. Our technical teams are monitoring, testing and investigating.

Update: (18:05 UTC/13:05ET)
Our Network Operations Center team advises that instances of intermittent domain look-up timeouts have decreased significantly. All of our technical teams are working to test, monitor and address this issue.

Update: 19:01 UTC/ 14:01ET
Domain look-ups are working well. We implemented a change to our systems and are closely monitoring the results. We are leaving our status message as "Degraded' while we continue to closely test and analyze services.

This update is related to

Storefront is Online

Updated Monday, March 8th, 2010 at 11:29 AM ET
2010-03-08 at 15:29 UTC - Other time zones

Storefront services are online. Customers should no longer obtain intermittent domain look-up issues.

We will provide an incident summary once it is available. Our technical teams continue to investigate.

This update is related to

Domain Service is Online

Updated Monday, March 8th, 2010 at 11:28 AM ET
2010-03-08 at 15:28 UTC - Other time zones

Domains services are online. Customers should no longer obtain intermittent domain look-up issues.

We will provide an incident summary once it is available. Our technical teams continue to investigate.

This update is related to

Storefront is Degraded

Updated Monday, March 8th, 2010 at 10:53 AM ET
2010-03-08 at 14:53 UTC - Other time zones

We are investigating intermittent domain look-ups. Storefront customers may experience timeouts. We will have more details soon.

Update: (15:00UTC/10:00ET):
Our technical teams are reviewing the logs. We continue to investigate.

15:13UTC/10:13ET
Our Technical teams are testing and obtaining more successful results with look-ups. Services continue to be degraded while we work to analyze and address the symptoms.

This update is related to

Domain Service is Degraded

Updated Monday, March 8th, 2010 at 10:52 AM ET
2010-03-08 at 14:52 UTC - Other time zones

We are investigating intermittent domain look-ups including those via the API. Resellers may experience timeouts. We will have more details soon.

Update: (15:00UTC/10:00ET):
Our technical teams are reviewing the logs. We continue to investigate.

15:13UTC/10:13ET
Our Technical teams are testing and obtaining more successful results with look-ups. Services continue to be degraded while we work to analyze and address the symptoms.

This update is related to