Wed, Nov 13, 2024 8:00pm - Tue, Nov 19, 4:00pm EST
Description
On Nov 13, 8:00am EST , we had a hardware failure which caused IMAP and Webmail services to fail on one of our mailstores.
The failure and the normal inbound flow of requests resulted in an unexpected increase in the number of in-flight requests, which caused performance issues on the surviving members of the cluster. The net result was a degraded service level that did not self-remedy after the failover as expected.
This resulted in a subset of users unable to access their mailbox during the ongoing maintenance.
The increase in in-flight requests as the peak hours approached further compounded the issue, affecting the load on other platform components. This resulted in issues that affected normal access to IMAP and Webmail services.
Root Causes Found
The cause was due to hardware failure compounded with an increase of workload demands on the surviving network elements.
Solution
In order to recover from the peak backlog, we had to throttle connections and slowly enable service to stabilize the cluster.
Post Mortem
To further address recurrence of issues like this, we have modified the way in which users are assigned to different platform components. We have begun work to transfer user mailboxes to other elements to improve the resource requirements across the board.