Deskpro Cloud has two methods for processing messages internally: a primary “fast” mechanism that 99% of emails use, and a slower secondary mechanism that is used when the primary won’t work. These two mechanisms are nearly identical except they’re tuned for different use-cases and have different limits. For example, very large messages or messages that require extra processing tend to go through the secondary mechanism.
On 2023-04-13, a temporary network problem caused failures in the primary mechanism which caused many more messages to flow through the secondary mechanism instead. As more messages got sent through the secondary mechanism, processing times increased. As processing times increased, our system thought that messages were actually failing with time-outs. When a message times-out, our system automatically re-queues it to run through the secondary mechanism. This caused even more messages to flow through the secondary mechanism, causing the backlog to grow even more.
Once our team identified the issue, we increased throughput by scaling up the number of task runners used to process messages. We also tuned our runners to prevent the scenario where too many messages get re-queued on the secondary mechanism unnecessarily.
Going forward, we are working to add more resiliency to this part of our infrastructure. A faster and more robust mechanism has been under development for some time and we plan to start rolling it out slowly in the weeks to come.