Connection issues (API, HRS, OpenSRS)

Incident Report for OpenSRS

Postmortem

Incident Date: March 12, 2022
Incident Number: PR-2929

On March 12, 2022 at 2:23 AM ET, Tucows’ Domains platform experienced service interruption impacting contact privacy service, domain lookups and Hover. Tucows Engineers were engaged and started investigating the issue.

The service interruption was caused due to an increase in the number of domain lookups performed causing high load on the system.

At 3:13 AM ET, Tucows’ engineering team increased the severity of the incident after we observed an increase in the external impact. 

At 3:30 AM ET, The Engineering team recovered the services by restarting the affected node to reduce the high load and stabilize the environment. 

Tucows is to further audit the system to make it resilient against high volume of traffic. 

Tucows is to revise and improve monitoring for better visibility.  

Thank you,

Tucows Engineering Team

Posted Mar 18, 2022 - 18:29 UTC

Resolved

All of the Domain lookup services have now recovered after our Domains team identified the issue and resolved it. All domain orders and lookups will now function normally for all control panels. The total downtime was 1 hour 7 minutes.

Issue start time: March 12 2022 7:23am UTC
End time: March 12 2022 8:30am UTC
Posted Mar 12, 2022 - 09:44 UTC

Update

We are continuing to investigate this issue.
Posted Mar 12, 2022 - 08:17 UTC

Update

We have identified the issue impacting domain updates and orders which is affecting OpenSRS, API and HRS. Domain lookups and updates are failing with 'Fatal server error has occurred'. Our team is working to resolve the interruption as soon as possible. In the meantime domain order and lookups from the RCP, Storefront, API or HRS will fail.

Incident start time: March 12th 2022 7:55am UTC
Posted Mar 12, 2022 - 08:16 UTC

Update

We are currently experiencing an incident that is impacting Domains, (API, HRS, OpenSRS). Impact may result in resellers having connection issues to OpenSRS, HRS and API. Our engineering team is working on the issue and we will update you as soon as possible.

During this issue there will be service impacts to cps, domain lookups (dls), gearman, and storefront
Posted Mar 12, 2022 - 08:15 UTC

Investigating

We are experiencing degradation of service. We will provide an update shortly with additional information.
Posted Mar 12, 2022 - 07:56 UTC
This incident affected: APIs (OpenSRS API, OpenHRS API), Control Panels (Reseller Control Panel, Classic RWI, End User Control Panel, Storefront, Payment Gateway), and Domain Services (Core gTLDs, Core ccTLDs, Other gTLDs, Other ccTLDs, Premium Names, WHOIS).