Summary of Allas service outage - Summary of Allas service outage
Summary of Allas service outage
As part of our principle of transparency, we publish a summary of the outage of the data management system Allas, the restoration of operations, and development work to prevent the future problems.
Allas service issues in November
CSC's Allas service suffered an outage between Monday 16.11.2020 and Friday 20.11.2020. In addition, Allas was in a read-only mode between Friday 20.11.2020 and Monday 23.11.2020.
There was a second outage of data management system Allas between Saturday 28.11 and Tuesday 1.12.
This was an exceptionally long downtime for a service that is designed and planned to work without ever needing downtime even during major changes, e.g. moving the service between datacenters.
The data management system Allas is based on the Ceph storage software. On Thursday 12.11 a standard maintenance operation was done on Allas, a minor version upgrade of the Ceph storage software. This version had previously been deployed in our test systems without problems.
A software bug caused the memory consumption to grow on the server, in addition to causing an on-disk transaction log to grow. This started on Friday 13.11 in the afternoon. On Monday 16.11 around 00:30 the cluster ran out of memory, and the cluster went into a degraded state. The customer facing frontends were shut down on Monday to make sure that there are no writes to a degraded cluster, and to allow cluster recovery.
The fix was time intensive, and in the end took Tuesday 17.11 - Friday 20.11 morning to test, automate, run, and verify the fix.
Around a week after the issues were fixed we again started seeing the memory growing and we had to take the cluster down on Saturday 28.11 to ensure we would not go in a state where we we would face long recovery times. We applied other mitigations, and the cluster was back up on Tuesday 1.12 morning.
To the best of our knowledge the problems started after the version update. This bug only becomes visible in large scale deployments with a specific configuration (only 1/3 of our Ceph installations were affected).
This bug introduced in the new version itself would not have caused a problem. However, new customer workload had recently started and that had exposed another - mostly harmless - bug in the Allas API, which caused resource consumption on the servers. These two bugs would have had minor effect on the cluster, but we believe the Allas API bug triggered a third dormant bug, which lead to memory growth, amplified by the first bug.
In the first recovery we managed to add mitigations for one of the three bugs. The second downtime was caused by the dormant bug, which affected us worse because of the heavy cluster recovery. We managed to mitigate that bug during the weekend.
Due to the interplay of several smaller bugs and customer load, we can't be completely sure about all the interplay in the system that caused the downtimes. However we have found at least one of the triggering situations, and we are adding monitoring for that, in addition to the mitigations we have added in Allas, which help with the other bugs.
Why did it take so long to fix?
Many different factors contributed to this. The main issue is that the bug triggered a persistent problem condition in the system.
This means the problem did not get solved by restarting services. Downgrading the software to a known-working version did not solve the problem either. These bugs are extremely rare.
The other factor is that the storage cluster tried to repair itself when it was in an unrecoverable state. This made the cluster unevenly populated, and caused great differences between the slowest and the fastest recovery.
Because this was a one-of-a-kind fix, with vastly different recovery times within the cluster, it was also very difficult to predict repair times.
What is done to prevent this from happening again?
This outage seems to have been caused by an interaction of three bugs and specific customer load. We have increased our monitoring, and we are alerted in the case of similar problems appearing in the future.
We also have added the mitigations in the cluster, which reduces the impact of similar bugs in the future.
Is data at risk in Allas?
No, not any more than other redundant storage services. Even in these very exceptional circumstances there was minimum risk towards data integrity in Allas.
Data integrity is of prime importance in both the software and service design of Allas.
Even a redundant data storage service can never replace backups. For example accidentally deleted objects can only be recovered from backups. We always recommend having backups of your important data, regardless of where the data is stored.
How likely will we see similar outages in the Allas service?
Similar problems are unlikely to occur. Although it is impossible to predict all failure scenarios, we have added mitigations to reduce the impact of future similar issues. The Allas architecture has built in reduncancy on all levels, which protects against other identified problem scenarios.