At around 20:00 UTC on June 23, an operational configuration change was deployed to the autoscaling configuration of Theta Lake's data processing pipeline. The change was intended for test and staging environments, but was incorrectly applied to production. The configuration change caused a key telemetry component that is used to drive dynamic scaling to fail. As a result, a lower static scaling level was applied. This lower scaling level is designed to keep data processing flowing, but isn't usually sufficient to keep up with the dynamic volume of data that we process.
Although the configuration change was applied globally, it only affected our Azure data centers, because AWS's auto scaling feature has more built in functionality that allows us to not rely on that component to drive scaling.
At 08:22 UTC on June 24, our L1 monitoring team reported that processing queues were higher than normal and the team engaged to triage. The issue was identified a few hours later, and at 12:00 UTC we started to roll out a fix.
We apologize for any inconvenience these data delays have caused. We are reviewing our change rollout procedures to prevent this from occurring in the future. We also apologize for the delay in reporting this issue on the status page. As the incident evolved, the escalation leads did not follow our published process for reporting issues on the status page. That process is also being reviewed and retraining will be done.