Incident Date: January 15, 2022
Incident Number: PR-2762
On January 15, 2022, at 9:00 AM ET, Tucows’ engineering team began planned maintenance work to migrate the Enom platform to a new cloud infrastructure. Due to the complexity of the cutover, the team encountered many issues resulting in continuous delays. The maintenance window was extended multiple times to address issues related to data replication, network routing, and DNS resolution issues impacting website accessibility and email delivery.
On January 15th, 2022, at 11:33 PM ET, the database replication was completed, and the team continued with the rest of the migration steps, which extended the maintenance window for six (6) hours beyond the scheduled time. Unfortunately, the maintenance exceeded this extended period, and it was further extended several times due to issues with network routing, port exhaustion and DNS resolution.
On January 17th, 2022, at 2:00 AM ET, the maintenance was completed; however, a number of issues remained unresolved and new ones were identified. The engineering team continued to investigate the issues as they were being identified via internal and external reports.
On January 17th, 2022, at 10:30 AM ET, customers’ ability to make DNS updates was disabled to avoid interference with our efforts to restore our DNS service. At this time, it was also identified that resellers could not search for or view their list of domains under management.
On January 17th, 2022, at 1:00 PM ET, an interim solution was implemented to address the DNS resolution issues that had impacted customer websites and email service delivery. At this time, we also restored customers’ ability to view existing domains. Later in the day, at 11:35 PM ET, the team restored customers’ ability to make DNS updates.
On January 17th, 2022, at 5:00 PM ET, payment processing issues were identified with PayPal and credit cards. The credit card processing issue was resolved at 9:00 PM ET. The PayPal payment feature was disabled on January 17th at 8:45 PM ET and on January 21, 2022 at 1:30 PM ET, we released a solution to re-enable it.
On January 18th, 2022, at 5:55 AM ET, we identified that Enom Control Panel login-attempts were failing due to missing configuration on one of the web servers. This issue was resolved at 8:30 AM ET.
On January 18th, 2022, at 10:00 AM ET, the engineering team identified issues with system notification emails, including password reset emails and Registrant Verification (RAA) emails. The system notification emails were delayed due to routing changes and new IP assignments, causing deferral by the recipient. As a result, the engineering team worked with global providers to add the new IPs to their allow lists at 12:00 PM ET.
On January 18th, 2022, at 3:30 PM ET, all Enom services were validated and the major incident was deemed resolved.
In an effort to make the data center migration more seamless from our customers’ perspective, we opted to complete the entire migration in a single maintenance period. At the time, we felt this was the right decision. Moving forward, we plan on taking a step-by-step approach to large-scale migrations, staggering the work into shorter maintenance sessions over a longer period of time.
We also introduced a new DNS provisioning system as part of this migration, rather than simply mirroring the existing setup on the new infrastructure. This increased the complexity of the migration. Going forward, we will reduce the number of moving parts by handling maintenance work separately from the introduction of new system components.
2. Enhancing our monitoring practices
We monitor all aspects of our systems and operations, but during this incident, we have identified some gaps. We are addressing these immediately, and the end result will be better high-level visibility that will allow us to identify and respond to complex scenarios faster.
3. More thorough planning and better migration readiness
Pre-migration, we conducted testing and identified possible failure scenarios. During the migration, the major failure we encountered, which resulted in missing DNS records, was one that we had not foreseen. In the future, we will conduct more robust crisis planning. This will involve spending more time identifying possible failure scenarios to develop a more nuanced migration roadmap that will not only help us avoid issues, but respond to any that do arise faster. This will be supported by a more thorough peer-reviewing process.
4. Extend Communications
We know that many of our customers felt that our communications during this incident were insufficient. We will gather additional customer feedback and are committed to using alternative channels, outside of email, to communicate with customers. We will also validate customer contact information within the platform to ensure communications reach the intended customers.
This incident report does not conclude our investigations and planning to prevent future system outages and downtime. We let you down and ourselves in the process. We are very sorry to all those who were impacted. We will do better.
We would also like to acknowledge our community who, despite the downtime and increasing stress levels, continued to treat us with respect and work collaboratively with our teams. This is not just a business to us, we value our customer relationships, many extending decades long, and want to continue to nurture and build long-lasting partnerships.
If you have any questions or feedback, please contact our customer service team at firstname.lastname@example.org
Tucows Engineering Team