A month ago CSC's high-performance computing services suffered the largest unplanned outage in years. In total approximately 1.7 petabytes and 850 million files were recovered.
The largest unplanned outage in years and how we survived it

A month ago CSC's high-performance computing services suffered the most severe unplanned outage in years. The outage was due to data corruption that occurred in the parallel filesystem that houses the work directory.

While this is intended to be a temporary storage space we recognized that there was a significant quantity of data that needed to be recovered, which took time. In total approximately 1.7 petabytes and 850 million files were recovered.

Below is a detailed report of the incident and the recovery efforts. At the end of the document we describe what lessons we learned and the steps we are planning to take to minimize the risk of a similar situation reoccurring. 

The conclusion is that this kind of problem we faced is likely quite rare but it cannot be completely guaranteed to be avoided. The filesystem is designed for temporary storage of data and re-engineering it is not feasible at this point. However, there are several ways to mitigate the issues. 

A summary of the main mitigation actions:

  • We will further clarify and raise user awareness of the role of the $WRKDIR as a temporary storage space. 

  • Moving data from $WRKDIR to more permanent locations (i.e. archive and user local storage) will be made more frictionless

  • The recovery actions are documented and modifications to disaster recovery plans made so that a similar future issue can be dealt with faster and better

  • Next procurement and long term development plans will take into account the user need for reliable storage
     

Some background

The main storage area for working data sets of CSC users is the /wrk directory ($WRKDIR). The directory is intended for temporary storage of results before staging them into a more permanent location such as the archive, IDA or possibly the home organization.

During the three years that the filesystem has been in operation, it has accumulated 1.7 Petabytes of data in 850 million objects. Compared to our previous Lustre-based storage solutions and those of other sites, the filesystem has been exceptionally stable. Thus many users, also inside CSC, have grown to trust it.

 

The filesystem had been exceptionally stable. Thus many users, also inside CSC, had grown to trust it.

 

During previous generations of systems there has been typically a cleanup script that deletes old files. For example in the Louhi system, all files older than X weeks were subject to removal. In the current storage system the total capacity has been so large (> 3 PB) that we have not chosen to enforce automatic deletion to make the system more user-friendly.

The filesystem is based on Lustre, an open source parallel filesystem that is primarily developed by Intel. The data is accessed via multiple servers (Object Storage Servers, OSS) which work in parallel to enable high bandwidth.

The metadata server (MDS) provides the structure of the filesystem and has a map where each file is located as well as information on the owner, permissions etc. In practice the metadata server has a filesystem where there is a dummy entry for each of the 850 million directories, files etc. in the /wrk directory. The size of this metadata, is about 240GB (mostly empty files), residing on a 3TB filesystem.

All of the object storage servers and metadata servers are connected via multiple InfiniBand links to three large disk systems by DDN where the actual storage is located.

https://lh6.googleusercontent.com/tIUKeHd0t1QVSsOencgguFBNR9A1UnatokROpzrUEyj6aaN1KCvxhTfTUXXCC_dRo2RYZ9fJb16CGMALqfx65KG1r-fuf1loTuzkqmGTo67iX6K-Z_60d11r-zaECnBrBJWs4la5

Simplified diagram of the Lustre file system operation. Each file on the metadata filesystem (aka. MetaData Target, MDT) contains the layout of the associated data file, including the OST number and object identifier. Clients request the file layout from the MDS and then perform file I/O operations by communicating directly with the OSSs that manage that file data.

What happened?

Before the outage

In January one of the storage controllers of the DDN system malfunctioned. While the system operated normally, we had to schedule replacing it fairly quickly: If the backup controller had failed before the replacement, it would have caused unplanned break and probably some serious risk to the data. To ensure safety of the replacement, we decided to do this during a maintenance break. The break was scheduled for Tuesday February 9th.

Tue, Feb 9th

The DDN controller replacement went quite smoothly and around 10 a.m. we were ready to bring the system back online. However, when restarting the Lustre filesystem, the metadata server reported anomalies in its filesystem and requested to do a filesystem check (fsck). Typically these are fairly routine operations, especially when the filesystem has been up for a long time. Any problems that the check finds are typically fixed automatically with no impact.

In this case, however, the tool could not fix all the problems it identified. A faulty inode persisted. Trying to bring the Lustre up resulted in a system crash (kernel panic) with this inode a very likely cause.

Wed, Feb 10th–Thu, Feb 11th

During the week numerous filesystem checks were run but the problems persisted. This process was fairly time consuming as each check took several hours to complete and they could not be run in parallel.

Fri, Feb 12th–Sun, Feb 14th

By the end of the week it became obvious that fixing the problems on the existing filesystem was infeasible. Even if we could hunt down and fix each problem, there may still be some hidden issues. Thus the solution would be to wipe out the existing filesystem and build a new one.

At this time it was obvious that despite /wrk being designated as a temporary storage space, in practice many people had grown to trust it with some of their most important data. This was further demonstrated by the questionnaire we sent to users some days later about the criticality of their data under threat. There were over 60 users who had extremely important data on the filesystem, containing months of work.  Just wiping out all this work quickly became a non-option.

 

Over 60 users had extremely important data on the filesystem, containing months of work. Just wiping it all out wasn't an option.

 

To clean up the filesystem while trying to preserve the metadata caused us to resort to a so-called file-level backup. Basically copying all the 850 million files and directories to a secure location, wiping out and rebuilding the original filesystem and copying the files back to the rebuilt filesystem. A more technical description of the process can be found in the Lustre documentation.

A good analogy is that you have a house that has developed cracks in its foundation. In order to fix it you dismantle the house, organize all the pieces on pallets, destroy the foundation, create a new one and then carefully craft the house back together again. This house just happened to have 850 million parts. Dealing with such a huge quantity of files is not easy as we soon found out.

 

In order to fix a house that has cracks in its foundation, you have to rebuild the foundation, and then carefully craft the house back together again.

 

The backup started off smoothly on Friday and it seemed like the copying of all the files would be complete by Monday. However, once the we hit the first user with a lot of  files (> 1 million), the process slowed down dramatically.

https://lh6.googleusercontent.com/I196vHN5RC1GsKfqLO9LcJw4BUJy9p5ibD_MpDK06DiKLJ8MrZPl_Ur8vk6GTBCNzicFbJd3MWLMyaoJ-ClKIvKUK7I1TQiwsPAQkjpJ8JLBSkdKsOiSD14unGgvvygT8b1mCS4b

Progress of the metadata recovery process over the weekend. The slowdown can be clearly observed at around 4 hours into the operation. (X-axis minutes, Y-axis bytes)

An estimate of the total time needed became very difficult to do but it could be foreseen that the process would easily take weeks. We needed to figure out how to make the process faster, and quickly.

The main problem was not the filesystem bandwidth but rather the amount of filesystem operations, such as opening each file and reading its attributes. These are measured in IO Operations per Second (IOPS). For each of the 850 million files, 6 IOPS were need. For the disk we were using, the amount of IOPS we could achieve seemed to be in the range of 2000-3000 IOPS.

Mon, Feb 15th–Thu, Feb 18th

On Monday we called together a number of our top HPC and storage specialists to work in parallel on ways to get more IOPS with a 3TB capacity. We looked at everything from faster RAID arrays, to running tar in parallel to repurposing Cray DataWarp SSD nodes. All of these options did not work out due to either performance issues or proved to be infeasible to implement in reasonable time.

One obvious solution would be to use a ramdisk, a virtual disk that actually resides in the memory of a node. The problem was that even our biggest system had 1.5TB of memory while we needed at least 3TB.

As a workaround we created ramdisks on a number of Taito cluster compute nodes, mounted them via iSCSI over the high-speed InfiniBand network to a server and pooled them together to make a sufficiently large filesystem for our needs.

Initially this was considered somewhat of a long shot, but it paid off: The approach clearly outperformed the other experiments and copied the most difficult large directories in hours instead of weeks. Combined with running multiple copies in parallel we were able to achieve well over 20k IOPS.

Fri, Feb 19th

Once we had all the files copied, we reformatted the metadata filesystem and copied the files back. During the process we found some gaps in the directories that were recovered, partly due to the fact that several teams were doing the work in parallel and retrieving different parts of the filesystem. Thus we had to look through the files as well as we can. However due to the huge amount of files, it would have taken days to completely verify everything. We opted to give the disclaimer that we could not exhaustively check the filesystem.

There were also several hundred HDF5 and NetCDF files which the extraction process handled poorly. Finally they were recreated individually into the new metadata filesystem. Unfortunately a few of the files did not survive this transfer for some reason which seems impossible to determine afterwards. However these cases were identified and the affected users notified.

 

To summarize: With a clever workaround and teamwork it was possible to recover nearly all the data on the system in days rather than weeks.

 

Once we had the filesystem restored, we started to bring it online. Due to the new filesystem, the checks took much longer than in a usual Lustre startup with an existing metadata filesystem. In total it took about 10 hours, but after this we were able to bring up Lustre successfully.

Sat, Feb 20th–Sun, Feb 21st

While the system was up and running and things seemed ok, we wanted to minimize risks: Therefore we initially opened the /wrk over the weekend as read-only to the users who had indicated that they have very high priority files. Many of these users complied and copied their data to a safer location, as demonstrated by the very high network bandwidth out of the system.

Mon, Feb 22nd

Finally on Monday we were able to open up the system fully to end users! Our Managing Director Kimmo Koski also addressed a statement related to the situation and we personally contacted the few users which we knew had suffered data loss.

Mon, Feb 29th

Postmortem meeting where we went through the timeline and discussed what we can learn from the incident. A summary is provided below.

What are we going to do to prevent this from happening again?

The incident was a very educational event. Based on the experience gained, a similar incident should be recoverable in about 3 days of downtime if all goes well. The procedure is now familiar and we know how to set up a storage system that can provide the necessary performance. We also found a number of issues that we could improve to mitigate the risk and impact of a similar future event.

 

We found a number of issues that we could improve to mitigate the risk and impact of a similar future event.

 

Below are some of the key findings and discussion points:

Improving support for Lustre?

Currently the Lustre-filesystem is self-supported. It is possible to obtain commercial support from Intel. However, this requires that we use the Intel Enterprise Edition Lustre distribution. Migrating to that in mid-lifecycle would be disruptive and risky. (Edit June 21st 2016: Intel provides commercial support for non-Enterprise Edition Lustre these days and this should be investigated.)

It’s unlikely that we could have prevented or very quickly fixed the issue even with commercial support. Possibly we could have shaved off a couple of days of downtime. Also, this case made CSC’s own support stronger as specialists can now handle this type of a crisis situation much better.

Thus migrating to the commercial Lustre version at this time is not feasible. However, when we do the next major upgrade of our storage infrastructure hardware, we will consider this option.

We will look at ways to reinforce the in-house Lustre expertise and make the collaboration between the filesystem and the underlying storage system specialists even tighter.

Monitoring for data corruption?

Checking for this type of data corruption is difficult while the filesystem is online. Taking the system offline routinely for testing would be too disruptive.

We will investigate if we could monitor the health by taking a snapshot of the metadata and run the filesystem check for that. However, we need to ensure that this doesn’t trigger other stability or performance problems. Also the snapshot would not be able to recover the files/directories created after the snapshot so it has some tradeoffs.

Should the filesystem policy and architecture be improved?

The /wrk filesystem, which is non-guaranteed storage is used to store critical data by many groups. Essentially the planned design and policy does not meet the reality of the actual use.

CSC has always frequently reminded customers in user guides and training material that /wrk is not guaranteed and can be erased without notice. The incident demonstrates that this information should be even more visible and clear if possible. The policy needs to be really visible in very clear terms in the manuals, MOTD (Message of the Day), training courses and Service-level Agreement (SLA).

Also the policy of the /proj filesystem is likely unclear to many users. There is likely a lot of crtitical data there. Thus it should be included in the dissemination work.

We are working on a HPC strategy and clarifying our reliability guarantees will be included there.

CSC used to have cleanup scripts to delete old files from the temporary work directories in previous systems. Due to the large amount of storage in the current system, we have not enabled them yet. Enabling them would probably help enforce the policy. However, the lead time for enabling them should be quite long and user dissemination should be done. No decision on when to enable them has been made yet.

Having a larger and more accessible archive filesystem would make it easier for users to migrate data off the /wrk and /proj directories. This is already an ongoing project in the storage group to make it more accessible. Increasing the size of the archive and upgrading the tape robot is also in the works.

One potential solution would be to employ a hierarchical storage management (HSM) system which migrates old files automatically to the archive filesystem. We will be testing the Robinhood Policy Engine. As an added bonus, this tool will also keep a backup of the metadata automatically. If this does not work as expected, we could also include HSM in the next storage procurement.

Other findings

  • Point-to-point data transfers between different systems were often relatively slow with our current tools compared to the maximum line rates available (~800Mbit/s vs. 10Gbit/s). We need to investigate and possibly implement better data transfer tools. Users can also benefit from this work.
     
  • Identifying programs/users which create huge numbers of files (>1 million) and providing instructions on how to possibly avoid this or at least have clear instructions or scripts to clean them up. Smaller number of files would make recovery faster and filesystem errors less likely. The amount of these heavy users is quite small.
     
  • Communications played a crucial role. Adding a dedicated communication person to the disaster recovery team will be done.
     
  • Better tools for collaboration in crisis. Information was spread across multiple chat and other tools. This would have really helped coordinate the multiple threads of work and share important info.
    • We should agree on a common communication channel in a crisis in the beginning and have all the stakeholders there (including managers)
    • Having a way to get a quick backstory for people that join the team in middle of the incident would be important
       
  • CSC HPC specialists work contract is during office hours and the SLA is for office hours. However, in an exceptional situation like this, people may volunteer to work longer hours. This needs to be managed to avoid people becoming overloaded.
    • The disaster task force should agree on how many extra work hours to put in per day would be a good idea and possibly arrange shifts.
       
  • Commandeering a little-used common work space as a war room improved effectiveness in a phase where tight collaboration was required.
    • There should be a designated space that could be used like this in future incidents. It should have real desks (not a stand-up meeting room or the coffee area).
       
  • Improve monitoring: Percentage of free inodes (as well as capacity) should be part of capacity planning. We could also use track number of objects in the filesystem … this would be a good proxy for the restore time if we ever need to do it again. These checks were already added at the time of writing.
     
  • We could automate the in-memory filesystem we used for recovery. This would be useful for future recovery exercises, and might be useful for customers (e.g. Grand challenges). Knowing that this could be started at any time would be a low cost safety net vs upgrading MDS servers or adding dedicated hardware. Preferably integrated via slurm, like adding feature=moreIOPS to the job.
     
  • Separating the /wrk into smaller chunks using the Distributed Namespace (DNE) feature could make a potential full system failure less likely. However, this needs Lustre 2.8. Upgrade needs to be planned at least 6 months in advance.
Olli-Pekka Lehto

Olli-Pekka Lehto

Olli-Pekka Lehto is a full-stack HPC geek, working in various roles involving supercomputers since 2002. These days he manages the Computing Platforms group at CSC. Follow him on Twitter: @ople

To front page
Trackback URL:

comments powered by Disqus

LATEST POSTS: LATEST POSTS:

ARCHIVE: