Domain Service Status
Online
Domains services are online. For the past hour, there were no instances of intermittent timeouts for Domain look-ups.
Incident Summary: (updated 22:29 UTC/17:29 ET)
Over the past week, we have been conducting a phased code roll-out for new OpenSRS look-up functionality. These changes will significantly improve the response times for all look-ups (including name suggestion calls).
The new functionality uses parallel streamed look-up elements (name suggestion, calls to individual Registries, etc.) and new levels of caching (in-memory caching of recent look-ups, plus data from zone files) to achieve the response time goals.
We had been performing volume and stress tests in QA and development environments since October 2009, but we recognized that we could not completely reproduce production loads and request profiles. Therefore we have been extensively testing this functionality in small pieces in Production since December 2009. Positive results encouraged us to roll these out as integrated components starting last Tuesday March 2nd.
We steadily increased the load on the new infrastructure through Thursday March 4th, until roughly 80% of load was using the new functionality by the Friday March 5th. On that day we first experienced an issue where particular ‘workers’ (components that process different types of look-up commands) reached a maximum queue size that, due to volume, it could not clear. This resulted in some timeouts back to requesting clients. These workers were restarted and full service resumed within roughly an hour. Analysis then pointed to an issue with the system time being out of sync between components on different machines. This was corrected on all machines and no issues were detected for the remainder of Friday and over the weekend.
However, we experienced the same intermittent domain look-up timeout issue again this morning (Monday March 8th). The development team was prepared for such an occurrence and gathered more granular log information before the workers were again restarted to restore service.
Analysis of the logs indicated that:
1 - we were seeing a request pattern and volume that we did not encounter in the QA and development environments.
2 - our application load-balancing component was reacting to that pattern of requests by favouring certain types of look-ups for faster processing
3 - certain types of workers which had been configured to handle high-volume look-up transactions (mostly .COM and .NET look-ups) were also configured to handle the lower volume (but higher latency) look-ups that were being favoured by the load-balancing function. This eventually caused the queues for high-volume transactions to fill up and not get relieved.
We have now implemented two things to fix the problem:
1 – Added a large amount of worker capacity to the pools
2 – Re-configured the workers to separate the high-volume lookups from all other types
Since then we have seen no re-occurrence of the issues and we believe that, having taken these steps, the service is returned to full stability. As well, we are re-evaluating our QA tests to look for this and similar issues in the future to avoid a recurrence post-release of new code.
This update is related to Incident 10941

