On December 10, 2024, at approximately 07:25 UTC, TrafficPeak experienced a service disruption that affected our customers' ability to upload, retrieve, or view logs. At the same time, we received a report of service degradation and connectivity issues at our Chennai data center (IN-MAA1).
Following an initial investigation, we ruled out a widespread network issue and determined the root cause to be a problem with the Block Storage service. The issue occurred because too many new hosts were added to the cluster simultaneously, causing it to degrade and become unusable. Additionally, it was established that the service disruption impacting TrafficPeak and service degradation at the Chennai data center were both caused by this same issue.
By approximately 10:30 UTC, we mitigated the situation by temporarily disabling Ceph Cluster rebalancing at IN-MAA1 and restarting all Object Storage Daemons (OSDs) in the affected data center, which restored service.
To prevent a recurrence, we have updated our internal documentation to limit host additions to one at a time and will re-enable rebalancing in the impacted cluster under controlled conditions. We are also evaluating further enhancements to improve system resilience and avoid similar incidents in the future.
This summary provides an overview of our current understanding of the incident given the information available. Our investigation is ongoing and any information herein is subject to change.