At 10:15, we observed a significant increase in traffic to the rewards page, primarily triggered by a newsletter distribution. This surge exposed performance limitations in the new rewards platform architecture.
⸻
System Context
The previous rewards platform relied on a single component handling both frontend and backend logic.
With the new platform: • The original core component remained largely unchanged • A new secondary component was introduced • Communication between components is handled via a Google Cloud PSC connection • The new component is responsible for integrating with external systems, primarily GLOX
⸻
What Happened
Under increased load: • Requests to GLOX began timing out after 15 seconds (application timeout threshold) • Initial investigation considered a network issue, but no supporting evidence was found
⸻
Root Cause Analysis (Current Understanding)
The secondary component was designed for high performance using asynchronous processing. For each request to GLOX, it performs several operations: 1. Validation of key material received via PSC (to verify request origin) 2. 3scale credential validation and refresh 3. Telemetry/metrics collection
All of these operations were implemented asynchronously.
Under high load, we believe: • The accumulation of asynchronous tasks led to contention within the event loop • This resulted in a degradation of throughput, effectively resembling an I/O loop deadlock scenario
⸻
Mitigation
We implemented the following changes: • Reduced pressure on the event loop by limiting asynchronous task consumers • Moved key refresh operations to dedicated threads
These changes have improved system stability.
⸻
Why This Was Not Detected Earlier
During load testing: • External systems (including GLOX) were mocked • Testing focused on internal application performance only
As a result, the interaction between asynchronous processing and real external dependencies under load was not fully validated.
⸻
Current Status • Overall system metrics have returned to normal levels • However, a small number of requests are still timing out
This indicates: • There may be additional contributing factors • The incident likely resulted from a combination of architectural limitations and external dependencies
⸻
Next Steps • Continue investigating residual timeouts • Reproduce the issue in test environment with the original codebase • Perform end-to-end load testing including real external integrations • Review architecture for backpressure handling and async workload isolation
Posted Mar 26, 2026 - 19:37 CET
Investigating
A lot of timeouts causing long loading times
Posted Mar 26, 2026 - 17:55 CET
This incident affected: Sunrise Moments (Backend, Infrastructure/WAF).