Long Loading Times

Incident Report for Wilmaa Tribe Status

Resolved

This incident has been resolved.

Posted Apr 07, 2026 - 09:37 CEST

Monitoring

Incident Summary

At 10:15, we observed a significant increase in traffic to the rewards page, primarily triggered by a newsletter distribution. This surge exposed performance limitations in the new rewards platform architecture.

⸻

System Context

The previous rewards platform relied on a single component handling both frontend and backend logic.

With the new platform:
• The original core component remained largely unchanged
• A new secondary component was introduced
• Communication between components is handled via a Google Cloud PSC connection
• The new component is responsible for integrating with external systems, primarily GLOX

⸻

What Happened

Under increased load:
• Requests to GLOX began timing out after 15 seconds (application timeout threshold)
• Initial investigation considered a network issue, but no supporting evidence was found

⸻

Root Cause Analysis (Current Understanding)

The secondary component was designed for high performance using asynchronous processing. For each request to GLOX, it performs several operations:
1. Validation of key material received via PSC (to verify request origin)
2. 3scale credential validation and refresh
3. Telemetry/metrics collection

All of these operations were implemented asynchronously.

Under high load, we believe:
• The accumulation of asynchronous tasks led to contention within the event loop
• This resulted in a degradation of throughput, effectively resembling an I/O loop deadlock scenario

⸻

Mitigation

We implemented the following changes:
• Reduced pressure on the event loop by limiting asynchronous task consumers
• Moved key refresh operations to dedicated threads

These changes have improved system stability.

⸻

Why This Was Not Detected Earlier

During load testing:
• External systems (including GLOX) were mocked
• Testing focused on internal application performance only

As a result, the interaction between asynchronous processing and real external dependencies under load was not fully validated.

⸻

Current Status
• Overall system metrics have returned to normal levels
• However, a small number of requests are still timing out

This indicates:
• There may be additional contributing factors
• The incident likely resulted from a combination of architectural limitations and external dependencies

⸻

Next Steps
• Continue investigating residual timeouts
• Reproduce the issue in test environment with the original codebase
• Perform end-to-end load testing including real external integrations
• Review architecture for backpressure handling and async workload isolation

Posted Mar 26, 2026 - 19:37 CET

Investigating

A lot of timeouts causing long loading times

Posted Mar 26, 2026 - 17:55 CET

This incident affected: Sunrise Moments (Backend, Infrastructure/WAF).