Equiem One and Multiple Services Briefly Down

Incident Report for Equiem

Postmortem

Summary

On August 29, 2025, a number of Equiem services experienced intermittent outages over approximately 30 minutes, impacting several of our sites. The incident was caused by a database connection issue within our Segmentation service. We traced the root cause to a recently deployed code change where a data loading function was incorrectly instantiated, which prevented it from batching database queries. This caused a surge in database connections that exceeded our system's capacity, leading to temporary service disruption. The system self-healed, and we have since prepared a permanent fix to prevent a recurrence.

Impact

  • Affected feature: The Segmentation service, which is used by most of our applications.
  • Scope: Multiple Equiem services and customer sites experienced brief periods of unavailability.

Root Cause

The outage was caused by a recently deployed code change in our Segmentation service. The change instantiated multiple DataLoaders, rather than reusing a single instance per request. The DataLoader is designed to batch multiple database requests into a single, more efficient query. By instantiating it multiple times per request, we created significantly more database connections, which quickly exhausted our database's connection limit. When the limit was reached, new connection requests were denied and all service containers terminated, resulting in the site outages. Our load balancer replaced the failed services and the system self-healed.

What We Did to Fix It

  • Temporary resolution: The system naturally recovered as the load balancer restarted failed service instances and the database connections were released.
  • Permanent fix: We prepared a code change to ensure the DataLoader is correctly instantiated once per request, restoring its intended batching functionality. This fix is now deployed.

What We’re Doing Next

  • Reviewing on-call processes: While we had monitoring in place that alerted on the issue in real-time, we were not aware of the severity of the issue and were slower than we would like to be to respond to customer feedback. We are reviewing our on-call procedures to improve the speed of our response to critical incidents.

Closing Note

We sincerely apologise for any disruption this caused. If you have any questions or concerns, please don't hesitate to reach out to our support team at support@getequiem.com.

Posted Aug 29, 2025 - 07:17 UTC

Resolved

The fix has been deployed and we've run load tests to verify the results. This incident is now resolved.
Posted Aug 29, 2025 - 04:06 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Aug 29, 2025 - 03:42 UTC

Update

All systems are currently operational as the system self-healed. A fix is being prepared to avoid this issue happening again.
Posted Aug 29, 2025 - 02:36 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Aug 29, 2025 - 02:35 UTC

Investigating

At 10:56am AEST multiple Equiem services were unavailable for about 30 minutes. We are aware of an issue in a core service and are preparing a fix for it now. There is a chance of further issues if the system is put under load until the fix is deployed.
Posted Aug 29, 2025 - 02:34 UTC
This incident affected: Mobile (Android, iOS), Supporting Applications (Admin Panel, Marketplace, Equiem One), and Web, API.