Domain and DNS Management
Incident Report for Enom
Postmortem

Incident Report

Incident Date: January 15, 2022
Incident Number: PR-2762

Major Incident Summary:

On January 15, 2022, at 9:00 AM ET, Tucows’ engineering team began planned maintenance work to migrate the Enom platform to a new cloud infrastructure. Due to the complexity of the cutover, the team encountered many issues resulting in continuous delays. The maintenance window was extended multiple times to address issues related to data replication, network routing, and DNS resolution issues impacting website accessibility and email delivery.

Incident events breakdown and resolution steps:

On January 15th, 2022, at 11:33 PM ET, the database replication was completed, and the team continued with the rest of the migration steps, which extended the maintenance window for six (6) hours beyond the scheduled time. Unfortunately, the maintenance exceeded this extended period, and it was further extended several times due to issues with network routing, port exhaustion and DNS resolution.

On January 17th, 2022, at 2:00 AM ET, the maintenance was completed; however, a number of issues remained unresolved and new ones were identified. The engineering team continued to investigate the issues as they were being identified via internal and external reports.

On January 17th, 2022, at 10:30 AM ET, customers’ ability to make DNS updates was disabled to avoid interference with our efforts to restore our DNS service. At this time, it was also identified that resellers could not search for or view their list of domains under management.

On January 17th, 2022, at 1:00 PM ET, an interim solution was implemented to address the DNS resolution issues that had impacted customer websites and email service delivery. At this time, we also restored customers’ ability to view existing domains. Later in the day, at 11:35 PM ET, the team restored customers’ ability to make DNS updates.

On January 17th, 2022, at 5:00 PM ET, payment processing issues were identified with PayPal and credit cards. The credit card processing issue was resolved at 9:00 PM ET. The PayPal payment feature was disabled on January 17th at 8:45 PM ET and on January 21, 2022 at 1:30 PM ET, we released a solution to re-enable it.

On January 18th, 2022, at 5:55 AM ET, we identified that Enom Control Panel login-attempts were failing due to missing configuration on one of the web servers. This issue was resolved at 8:30 AM ET.

On January 18th, 2022, at 10:00 AM ET, the engineering team identified issues with system notification emails, including password reset emails and Registrant Verification (RAA) emails. The system notification emails were delayed due to routing changes and new IP assignments, causing deferral by the recipient. As a result, the engineering team worked with global providers to add the new IPs to their allow lists at 12:00 PM ET.

On January 18th, 2022, at 3:30 PM ET, all Enom services were validated and the major incident was deemed resolved.

Learnings and process changes to prevent and mitigate future issues:

  1. Simplifying our approach to complex maintenance work

In an effort to make the data center migration more seamless from our customers’ perspective, we opted to complete the entire migration in a single maintenance period. At the time, we felt this was the right decision. Moving forward, we plan on taking a step-by-step approach to large-scale migrations, staggering the work into shorter maintenance sessions over a longer period of time. 

We also introduced a new DNS provisioning system as part of this migration, rather than simply mirroring the existing setup on the new infrastructure. This increased the complexity of the migration. Going forward, we will reduce the number of moving parts by handling maintenance work separately from the introduction of new system components.

‌ 2. Enhancing our monitoring practices

We monitor all aspects of our systems and operations, but during this incident, we have identified some gaps. We are addressing these immediately, and the end result will be better high-level visibility that will allow us to identify and respond to complex scenarios faster. 

‌ 3. More thorough planning and better migration readiness 

Pre-migration, we conducted testing and identified possible failure scenarios. During the migration, the major failure we encountered, which resulted in missing DNS records, was one that we had not foreseen. In the future, we will conduct more robust crisis planning. This will involve spending more time identifying possible failure scenarios to develop a more nuanced migration roadmap that will not only help us avoid issues, but respond to any that do arise faster. This will be supported by a more thorough peer-reviewing process.

‌ 4. Extend Communications

We know that many of our customers felt that our communications during this incident were insufficient. We will gather additional customer feedback and are committed to using alternative channels, outside of email, to communicate with customers. We will also validate customer contact information within the platform to ensure communications reach the intended customers.

In summary

This incident report does not conclude our investigations and planning to prevent future system outages and downtime. We let you down and ourselves in the process. We are very sorry to all those who were impacted. We will do better. 

We would also like to acknowledge our community who, despite the downtime and increasing stress levels, continued to treat us with respect and work collaboratively with our teams. This is not just a business to us, we value our customer relationships, many extending decades long, and want to continue to nurture and build long-lasting partnerships. 

If you have any questions or feedback, please contact our customer service team at help@enom.com 

Thank you,

Tucows Engineering Team

Posted Jan 22, 2022 - 12:31 PST

Resolved
System functionality has been fully restored and we are marking this outage as resolved. In the coming days, we will release an Incident Report with the findings of our internal investigation. We will share a link to this report on this status page.

The PayPal payment feature remains temporarily disabled and is now being managed as a separate issue. We will reach out to the small subset of impacted resellers once the feature is re-enabled.

Here is the summary of issues that our engineering team have resolved:

* Websites not resolving due to missing or incorrect DNS records
* Email service interruptions caused by missing or incorrect MX records
* Inability to view, access and search on domains under management via Enom Control Panel
* Credit card payment processing issues
* Delays and failures in sending System Notification emails (this includes password reset emails and Registrant Verification/RAA emails)
* Enom Control Panel log-in issue


If you continue to have any questions, please contact support via phone, chat, or email at https://help.enom.com/
Posted Jan 18, 2022 - 12:30 PST
Update
Since our last update, the following issues have been resolved:

* Delays and failures in sending System Notification emails (this includes password reset emails and Registrant Verification/RAA emails)
* Enom Control Panel log-in issue

PayPal payment feature remains temporarily disabled, as our engineering team continues to investigate.

The next update will be provided no later than 4 PM EST (21:00 PM UTC).
Posted Jan 18, 2022 - 08:38 PST
Update
As of January 17, 2022, at 11:35 PM EST, DNS updates (including those for newly provisioned domains) are successfully processing and all websites are resolving. The Enom Control Panel log-in issue has been resolved and is being actively monitored.

Our engineers are currently addressing the following active issue, which impacts a small subset of customers:

* Delays and failures in sending System Notification emails (this includes password reset emails and Registrant Verification/RAA emails)

The PayPal payment feature also remains temporarily disabled. Our engineering team continues to implement a system-wide solution to resolve all issues.

The next update will be provided no later than 12 PM EST (17:00 p.m UTC).
Posted Jan 18, 2022 - 06:57 PST
Update
Over the past 24 hours, our engineering team has resolved the following issues:

* Websites not resolving due to missing or incorrect DNS records
* Email service interruptions caused by missing or incorrect MX records
* Inability to view, access and search on domains under management via Enom Control Panel
* Credit card payment processing issues

Our engineering team is still working to re-enable the ability to make DNS updates via the API and Control Panel. Once re-enabled, new domain registrations that use our DNS will also see their DNS propagate and resolve without delay. We have also temporarily disabled the PayPal payment feature for resellers as we continue to investigate this issue.

The platform is stable, and we will continue to monitor its performance overnight. The next update will be provided on Tuesday, January 18, 2022 by 10 AM EST (3:00 p.m UTC).
Posted Jan 17, 2022 - 20:05 PST
Update
Since engineering deployed a solution this afternoon we have not seen any further DNS resolution issues. We have also resolved the credit card payment issue that impacted a small subset of resellers.

Our engineering team is currently working to re-enable the ability to make DNS updates via the API and Control Panel. Once re-enabled, new domain registrations that use our DNS will also see their DNS propagate and resolve without delay.

At this time, a small subset of resellers are still having issues funding their accounts using PayPal, and we are temporarily disabling this feature as we investigate further.

The next update will be provided no later than 12:00 AM EST (05:00 UTC).
Posted Jan 17, 2022 - 17:48 PST
Update
With all websites resolving, Enom’s engineering team is now addressing a handful of isolated issues, impacting a small subset of resellers. These, include:

* Select resellers are not able to refill their accounts using PayPal
* Enom experiencing isolated credit card processing issues

Our engineering team continues to implement a system-wide solution to resolve all issues. At this time, the ability to update DNS records via the Enom Control Panel and API will remain disabled. Any DNS update requests have been backlogged and will be provisioned as soon as we’ve implemented our system-wide solution. For new domain registrations that use our DNS, there will be a delay in DNS provisioning until the “system-wide” solution is in place.

The next update will be provided no later than 9:00 PM EST (02:00 UTC).
Posted Jan 17, 2022 - 13:59 PST
Update
As of 1:45 PM EST (18:45 UTC), all websites are resolving. Our engineers continue to work on a system-wide solution, and once it is in place, updates to DNS records will once again process normally via the Enom Control Panel and API.

Any DNS update requests have been backlogged and will be provisioned as soon as we’ve implemented our system-wide solution.

The next update will be provided no later than 5:00 PM EST (22:00 UTC).
Posted Jan 17, 2022 - 10:45 PST
Update
As of 1 PM EST (18:00 UTC), our engineers have introduced an interim solution and continue to work on a system-wide solution, both of which will resolve the DNS resolution issue that certain end customers have been experiencing.

No action is required from customers. As of right now:

* Impacted customers should see their websites resolve imminently
* Any email service interruptions related to missing MX records should be resolved
* Resellers can now successfully generate the list of domains in their account using the Enom Control Panel

If you continue to experience resolution issues, we recommend clearing your browser history/cache.

At this time, the ability to update DNS records via the Enom Control Panel and API will remain disabled. Any DNS update requests have been backlogged and will be provisioned as soon as we’ve implemented our system-wide solution.

New domain registrations and renewals continue to process normally.

We will continue to update as these issues are resolved. The next update will be provided no later than 2:00 PM EST (19:00 UTC).
Posted Jan 17, 2022 - 10:02 PST
Update
Our engineering team continues to work to restore our systems. In the past hour we have:

* Begun work to implement our system-wide solution for impacted customers
* In parallel, developed a short term fix to the DNS resolution issues that certain end customers have been experiencing

We will continue to update as these issues are resolved. The next update will be provided no later than 1:00 PM EST (18:00 UTC).
Posted Jan 17, 2022 - 09:00 PST
Update
Our engineering team continues to work to restore our systems. At present, we are seeing issues with our DNS and Reseller Control Panel. As of 10:50 AM EST (15:50 UTC) we can report the following:

Impacted end customers (registrants) are experiencing service interruptions:
* Some websites are not resolving due to missing or incorrect DNS records.
* Some customers may experience email service interruptions due to missing DNS MX records


Please note: We’ve disabled DNS updates via the Enom Control Panel and API for all domains to ensure we don’t interfere with our ongoing efforts to restore DNS records for domains impacted by the DNS issue. New domain registrations and renewals are working as intended.

Resellers are experiencing the following services issues:
* Resellers will be unable to search for or view the list of domains in their account using the Enom Control Panel.

We will provide regular updates as we continue to resolve these issues. The next update will be provided by 12:00 PM EST (17:00 UTC).
Posted Jan 17, 2022 - 07:50 PST
Investigating
We are investigating reports of DNS resolution and management issues, as well as domain registration difficulties, post maintenance. Our engineering team continues to work on these customer-facing issues and certify their correct operation. We will provide updates as they become available.
We thank you again for your continued patience.
Posted Jan 17, 2022 - 05:01 PST
This incident affected: API and DNS Service.