Long Loading Times

Incident Report for Wilmaa Tribe Status

Resolved

This incident has been resolved.
Posted Apr 07, 2026 - 09:37 CEST

Monitoring

Incident Summary

At 10:15, we observed a significant increase in traffic to the rewards page, primarily triggered by a newsletter distribution. This surge exposed performance limitations in the new rewards platform architecture.



System Context

The previous rewards platform relied on a single component handling both frontend and backend logic.

With the new platform:
• The original core component remained largely unchanged
• A new secondary component was introduced
• Communication between components is handled via a Google Cloud PSC connection
• The new component is responsible for integrating with external systems, primarily GLOX



What Happened

Under increased load:
• Requests to GLOX began timing out after 15 seconds (application timeout threshold)
• Initial investigation considered a network issue, but no supporting evidence was found



Root Cause Analysis (Current Understanding)

The secondary component was designed for high performance using asynchronous processing. For each request to GLOX, it performs several operations:
1. Validation of key material received via PSC (to verify request origin)
2. 3scale credential validation and refresh
3. Telemetry/metrics collection

All of these operations were implemented asynchronously.

Under high load, we believe:
• The accumulation of asynchronous tasks led to contention within the event loop
• This resulted in a degradation of throughput, effectively resembling an I/O loop deadlock scenario



Mitigation

We implemented the following changes:
• Reduced pressure on the event loop by limiting asynchronous task consumers
• Moved key refresh operations to dedicated threads

These changes have improved system stability.



Why This Was Not Detected Earlier

During load testing:
• External systems (including GLOX) were mocked
• Testing focused on internal application performance only

As a result, the interaction between asynchronous processing and real external dependencies under load was not fully validated.



Current Status
• Overall system metrics have returned to normal levels
• However, a small number of requests are still timing out

This indicates:
• There may be additional contributing factors
• The incident likely resulted from a combination of architectural limitations and external dependencies



Next Steps
• Continue investigating residual timeouts
• Reproduce the issue in test environment with the original codebase
• Perform end-to-end load testing including real external integrations
• Review architecture for backpressure handling and async workload isolation
Posted Mar 26, 2026 - 19:37 CET

Investigating

A lot of timeouts causing long loading times
Posted Mar 26, 2026 - 17:55 CET
This incident affected: Sunrise Moments (Backend, Infrastructure/WAF).