Cluster A - Login and Sending/Receiving Issues
Incident Report for OpenSRS
Postmortem

Incident Summary

On May 1, 2023, at 5:00 PM EDT, our system alerted us to a surge in request volume on Cluster
A of our Hosted Email solution. Our Development Operations (DevOps) Team began
investigating and eventually engaged our Security Operations (SecOps) Team, who confirmed
that a large-scale Distributed Denial of Service (DDoS) attack was underway and moved to
manually block the offending IPs. We successfully mitigated the attack, however the
unprecedented volume of requests overwhelmed our authentication service, which caused
failures to log in and to send and receive email. It also exposed a bug in our statistics file that
was consuming excessive memory resources, which further delayed our recovery.

By May 2, 2023, at 2:10 PM EDT, our DevOps Team had implemented fixes for the
authentication service issue and the statistics file bug. We then began to slowly ramp up
processing of our email service to full capacity. By May 2, 2023, at 8 AM EDT, the backlog of
pending inbound and outbound email was completely cleared.

Timeline of Events

May 1, 2023, 5:00 PM EDT - Our system alerted us to a surge in request volume on Cluster A of
our Hosted Email service. Our DevOps Team began investigating. Also around this time, our
Support Team began to see customer reports about service issues.

May 1, 2023, 5:03 PM EDT - Our Support Team posted the first status page update about the
incident, informing customers that we were experiencing service issues on Cluster A and
actively investigating the root cause.

May 1, 2023, 5:36 PM EDT - Our SecOps Team was engaged to further investigate, as the
number of requests to Cluster A was growing aggressively.

May 1, 2023, 5:47 PM EDT - Our SecOps Team confirmed it was a DDoS attack and began to
manually block the abusive IPs.

May 1, 2023, 6:00 PM EDT - The DDoS attack was successfully mitigated. However, our
DevOps Team was still seeing log-in issues and email request failures. They continued to
investigate.

May 2, 2023, 6:19 AM EDT - We pinpointed an issue with our authentication service. Our
DevOps Team began to explore potential solutions.

May 2, 2023, 1:45 PM EDT - We discovered that memory utilization was growing faster than
expected and identified a bug in our statistics process as the cause.

May 2, 2023, 2:10 PM EDT - DevOps promoted fixes for both the authentication service issue
and the stats file bug. We then began to slowly ramp up processing of our hosted email service
to full capacity. Users began to experience successful log-in attempts, and our service began to
process the backlog of pending inbound and outbound email requests.

May 2, 2023, 7:00 PM EDT - We officially marked the incident as closed. The backlog of
pending email requests was completely cleared by May 3, 2023, at 8:00 AM EDT.

Impact Analysis

The root cause of this service interruption was an authentication service failure caused by
unprecedentedly high traffic during a DDoS attack. Our recovery time was then further delayed
by a bug in our system wherein a statistics file was excessively writing to memory. The
authentication service is a critical component of the Hosted Email infrastructure; most other
services within Hosted Email run through the authentication service in order to maintain a
secure environment and access the metadata required to process requests. Consequently,
when it began to fail, the service impact was substantial.

Once the necessary fixes were deployed and Hosted Email was made fully operational, there
remained a backlog of pending email requests that had accumulated during the downtime. Our
system protects against email loss by creating a queue of inbound and outbound emails. During
general operation, this queue is incredibly small. However, during the event, a sizable backlog
was created, which took our service — once fully restored— 8 hours to clear. No data or emails
were lost. All backlogged email was time-stamped according to when it was delivered, as per
standard operating procedure.

Response and Mitigation

Our DevOps Team started investigating the surge in traffic on May 1, 2023, at 5:03 PM EDT, in
response to a system alert. The rate of requests to our system began to increase, and on May
1, 2023, at 5:36 PM EDT, our DevOps Team engaged our SecOps Team to further investigate.
SecOps concluded it was a DDoS attack and, at approximately 6 PM EDT on May 1, 2023,
started to block the IPs responsible for the spike in requests. This action successfully mitigated
the attack, and no further spikes in request volume occurred. Concurrently with this increase in
request volume, our Support Team saw an increasing number of reports of Webmail log-in
failures and failures to send and receive email.

When our Hosted Email service did not recover as expected following the DDoS attack, DevOps
began investigating the cause of the continued service interruption. By May 2, 2023, at 6:19 AM
EDT, they had pinpointed the issue with our authentication service, and soon after, they
discovered the bug with our stats process. On May 2, 2023, at 2:10 PM EDT, they put in place a
fix for both issues. The authentication traffic was split between two services, instead of being
directed through a single service. The bug was addressed by correcting the problematic code.

Lessons Learned
The root cause of the outage was a failure of our authentication service to sufficiently scale to
accommodate the severe spike in request volume. Prior to this event, the authentication service
had been identified as a service that needed to be better optimized. This incident will expedite
the process of rebuilding this service as the limitations have been clearly demonstrated.

Conclusion
While this service interruption was precipitated by a DDoS attack, the root cause was the
inability of our authentication service to adequately scale. We’re confident in the steps we’re
taking to mitigate this specific issue. This incident had a significant impact on our resellers and
their customers, and we are committed to addressing your concerns and questions.

We value our customer relationships, many of which are decades long, and we want to continue
to nurture and build long-lasting partnerships.

If you have any questions or feedback, please contact our Customer Service Team.
Thank you,
Tucows Domains Team

Posted May 11, 2023 - 12:09 UTC

Resolved
This incident has been resolved.
Posted May 03, 2023 - 11:59 UTC
Monitoring
At approximately 5 PM ET (9 PM UTC), our system alerted us to a large-scale DDoS attack to Cluster A of our Hosted Email Service. We effectively mitigated the risks, however the unprecedented size of the attack exposed some limitations in the scalability of our authentication service. Our engineering team moved quickly to identify and address the root problem.

As of the time of this update, normal service operations have been restored, all users are now able to log in without issue, and all new send requests are being processed as expected. The remaining backlog of inbound email requests may take up to 12 hours to clear.

Having addressed the root issue, we are marking this incident as closed. We sincerely apologize for the impact this service interruption has had. In the coming weeks, we will release an Incident Report detailing the what happened, how we approached and resolved it, and how we will perform better in the future.
Posted May 02, 2023 - 23:17 UTC
Update
Our system is now actively processing backlogged email requests. Users may continue to experience intermittent login issues and delays in sending and receiving email over the next few hours. By 7 PM ET, we expect the majority of the backlog to be cleared and for our email service to be functioning normally.

We will provide another update by 7 PM ET.
Posted May 02, 2023 - 19:02 UTC
Update
Users are still experiencing login issues as well as delays and failures to send and receive email. We're continuing the recovery process to restore service.

We will provide another update by 3 PM ET.
Posted May 02, 2023 - 18:20 UTC
Update
Users are still experiencing login issues as well as delays and failures to send and receive email. We’re continuing the recovery process to restore service.

We will provide another update by 2 PM ET.
Posted May 02, 2023 - 17:00 UTC
Update
Users are still experiencing login issues as well as delays and failures to send and receive email. We’re continuing the recovery process to restore service.

We will provide another update by 1 PM ET.
Posted May 02, 2023 - 16:00 UTC
Update
On May 1, 2023, at approximately 5 PM ET, our system alerted us to an attempted large-scale DDoS attack to Cluster A of our Hosted Email Service. We safely mitigated this attack.

However, the large volume of requests caused an interruption to our authentication service. We are now gradually ramping up our email service in accordance with our recovery procedure. No data has been or will be lost.

We will provide another update by 12 PM ET.
Posted May 02, 2023 - 14:30 UTC
Update
Restoration efforts are underway. There will be no data or email loss. Both inbound and outbound emails will be delivered gradually upon full restoration of the service.

We will provide another update by 11 AM ET.
Posted May 02, 2023 - 13:11 UTC
Update
Our Email team continues their work towards a resolution.

We apologize again for the inconvenience while we try to get this issue resolved.
Posted May 02, 2023 - 12:20 UTC
Update
Our email team continues to work as quickly as possible to resolve the Cluster A issues. We currently are not able to offer an ETR at this time, but will continue to keep you updated. We appreciate your patience.
Posted May 02, 2023 - 10:57 UTC
Update
Our teams are still working hard to resolve the Cluster A issues. We will continue to update you with any new information or resolution.
We appreciate your patience.
Posted May 02, 2023 - 09:46 UTC
Update
We appreciate your patience as our team continues to work to resolve email on Cluster A. We are working as fast as possible to restore services and will continue to update with any new information.
Posted May 02, 2023 - 08:26 UTC
Update
Our teams are still working hard to resolve the Cluster A issues.
We will provide an update once we know more.
Posted May 02, 2023 - 07:26 UTC
Update
Our teams are still working hard to resolve the Cluster A issues. We will continue to update you with any new information or resolution.
We appreciate your patience.
Posted May 02, 2023 - 06:37 UTC
Update
Our team continues to work to resolve the Cluster A Email issues, we will continue to update you when we have more information or the issue has been resolved. We appreciate your patience.
Posted May 02, 2023 - 05:37 UTC
Update
Nothing new to report at the moment. Our team is still hard at work trying to solve the problem.

Next update: within 60 minutes
Posted May 02, 2023 - 04:04 UTC
Update
We are continuing to work on a fix for this issue.
Posted May 02, 2023 - 02:47 UTC
Update
our engineering team continues to resolve the issue, we appreciate your patience and will provide updates as they become available.
Posted May 02, 2023 - 02:39 UTC
Update
We are still working on this and will update again when there is more progress.
Posted May 02, 2023 - 00:22 UTC
Update
We apologize for the inconvenience and are working to resolve the issue as quickly as possible.
Posted May 01, 2023 - 23:01 UTC
Identified
We think we've found the problem, and we're working on a fix. Please stand by.
Posted May 01, 2023 - 22:07 UTC
Update
This impacts IMAP and POP connections as well.
Posted May 01, 2023 - 21:10 UTC
Investigating
Clients will have difficulties logging into the cluster A webmail, as well as sending/receiving emails. Our engineering team has been engaged.

We will provide an update once we have additional information.
Posted May 01, 2023 - 21:04 UTC
This incident affected: Hosted Email (Cluster A, Webmail).