On February 24th from 01:35 to 06:26 UTC and again from 21:10 to 22:25 UTC, PagerDuty experienced an incident where a portion of email events were being rejected. Specifically, the largest impact occurred from 02:45 UTC to 05:15 UTC and from 20:20 UTC to 21:30 UTC.
The incident only impacted the US region and was caused by a storage issue affecting a subset of our mail servers. The email events that were routed to the unaffected mail servers were ingested normally, including emails from customers whose mail-sending servers were configured to retry failed deliveries.
As soon as the issue was identified we resolved the storage issue and email event processing was fully restored, concluding the incident. As an immediate action item we reviewed and made changes to our mail server monitoring and configuration to prevent the issue from reoccurring. At this time we do not have further concerns about the availability and health of the mail servers.
As part of incident investigation we’ve identified the following action items to help prevent such issues from happening in the future: