Delays in email processing
Incident Report for Deskpro
Postmortem

Deskpro Cloud has two methods for processing messages internally: a primary “fast” mechanism that 99% of emails use, and a slower secondary mechanism that is used when the primary won’t work. These two mechanisms are nearly identical except they’re tuned for different use-cases and have different limits. For example, very large messages or messages that require extra processing tend to go through the secondary mechanism.

On 2023-04-13, a temporary network problem caused failures in the primary mechanism which caused many more messages to flow through the secondary mechanism instead. As more messages got sent through the secondary mechanism, processing times increased. As processing times increased, our system thought that messages were actually failing with time-outs. When a message times-out, our system automatically re-queues it to run through the secondary mechanism. This caused even more messages to flow through the secondary mechanism, causing the backlog to grow even more.

Once our team identified the issue, we increased throughput by scaling up the number of task runners used to process messages. We also tuned our runners to prevent the scenario where too many messages get re-queued on the secondary mechanism unnecessarily.

Going forward, we are working to add more resiliency to this part of our infrastructure. A faster and more robust mechanism has been under development for some time and we plan to start rolling it out slowly in the weeks to come.

Posted Apr 13, 2023 - 12:54 BST

Resolved
This incident has been resolved.
Posted Apr 13, 2023 - 12:52 BST
Monitoring
The backlog has been fully processed and new messages are now processing normally in real time.
Posted Apr 12, 2023 - 17:27 BST
Investigating
We are aware of some customers experiencing delays with email processing. We are investigating.
Posted Apr 12, 2023 - 16:00 BST
This incident affected: Incoming Email.