API Services
Incident Report for OpenSRS
Postmortem

Our operations team has identified and fixed the root cause of an intermittent API performance issue that has affected some OpenSRS resellers. This occurred on October 1, 2017, between 6:45pm and 9:45pm EDT, and again on October 2, between 9:20am and 10:20am EDT.

Description of the issue:

One of the country-code registries we are connected to has a limitation on the number of checks available to us. Once our system reached this threshold, we no longer received responses from that registry, and open connections from additional requests remained open and accumulated. Our system is designed to close these connections, however, the rate of incoming connections outpaced the rate at which unresponsive connections were cancelled. Under normal circumstances and load, this is not reseller impacting. However, in combination with an extremely high and atypical volume of domain activity, the issue was amplified and created a cascading effect that led to the intermittent unavailability of the API and reseller control panel.

API and control panel performance returned to normal at around 10:30am EDT on October 2, 2017.

Description of the fix:

We have implemented a system-wide check that will monitor all unresponsive connections used for domain lookups and ensure that all unresponsive connections are proactively canceled after a short period of time. Associated API requests will also get an appropriate response from our systems, preventing the accumulation of API connections in an unresponsive state. This code change was promoted to our production environment at 5pm EDT on October 2, 2017.

We are considering this issue resolved and we apologize for any inconvenience that this issue may have caused.

This API performance issue is in no way related to the Network Connectivity Issue that we had experienced on September 29, 2017. While both have affected the availability of the API, the root cause of those two incidents is entirely different. We apologize for the unfortunate timing of these issues.

As always, please contact OpenSRS support for help or additional information.

Posted Oct 03, 2017 - 21:45 UTC

Resolved
We have successfully implemented the changes to our system to better deal with the variety of ways that registries respond (or do not respond) to our system requests, and this issue is now considered resolved. We apologise for the inconvenience, and again, thank you for your patience.
Posted Oct 02, 2017 - 22:14 UTC
Update
Our operations team has identified the root cause of the intermittent API performance issue that has affected some of our resellers. This occurred on October 1st, 2017 between 6:45pm and 9:45pm EST, and again today, October 2nd, between 9:20am and 10:20am EST. API and control panel performance returned to normal mid-morning today. We continue to monitor the situation and will respond immediately to any issues that may arise while our engineering team works to implement a permanent fix.

The source of this issue is that while performing domain availability lookups, we have determined that one of the country-code registries we are connecting to has a limitation on the number of checks available to us. Once our system reaches this threshold, we no longer receive any response from that registry, and open connections from additional requests would remain open and accumulate. Our system is designed to close these connections, however, the rate of incoming connections outpaced the cancelled unresponsive connections.

Under normal circumstances and load, this is not reseller-impacting. In combination with an extremely high and atypical volume of domain activity, the issue was amplified and allowed for the first time a cascading effect that led to the intermittent unavailability of the API.

We are now in the process of implementing changes to our system that will better deal with the variety of ways that registries respond (or do not respond) to our system requests.

We will provide a further detailed update through this communication channel as we progress through the issue.

Thank you for your patience.
Posted Oct 02, 2017 - 19:30 UTC
Monitoring
Our operations team has identified the source of the API performance issues. API and control panel performance has returned to normal. We are continuing to monitor the situation at this time.
Posted Oct 02, 2017 - 14:56 UTC
Identified
We are currently experiencing issues with OpenSRS API services. This is causing degraded performance when loading our control panels. Our operations team is working to resolve this issue as quickly as possible.
Posted Oct 02, 2017 - 13:40 UTC
This incident affected: APIs (OpenSRS API, OpenHRS API, Email API) and Control Panels (Reseller Control Panel, Classic RWI).