During past years, sensitive data has become one of the hottest of hot topics in the area of Finnish scientific data management discussion — and not least thanks to the European General Data Protection Regulation. At the same time, for nearly five years now, CSC has provided ePouta cloud platform for all sensitive data processing needs with quite substantial computing and storage capacity. From grounds up, this virtual private IaaS cloud solution has been designed to meet the national requirements for IT systems for protection level III (ST III) data.
While ePouta has been successful in providing our institutional customers a safe and robust platform for their sensitive data processing, it has lately become very clear that something more is desperately needed; something which is more easily adopted and accessed, something for individual researchers and research groups, and something more collaborative.
Now here, a problem arises; by definition sensitive data contains information which should only be processed either by explicit consent or a legitimate permission, and there are certain rules for such processing. Probably most notable ones of those rules — from researchers’ perspective — are requirements for data minimisation, pseudonymisation, encryption, safe processing and data disposal after its use.
Data minimisation and pseudonymisation relate directly to dataset definition. Minimisation means that only the data that is absolutely needed should be processed. For example, if the dataset includes information about persons' age but that information is not needed for the research, it should not be included in the dataset and should be removed from it before processing.
Pseudonymisation is a de-identification procedure by which personally identifiable information fields within a data record are replaced by one or more artificial identifiers, or pseudonyms.
Pseudonymisation differs from anonymisation in that pseudonymised data can be restored to its original state with the addition of information which then allows individuals to be re-identified again. Such re-identification codes must be kept separate from the pseudonymised data. Clearly then, these topics are something that the data owner or the researcher should take care of but for the rest, they seem to be more of a technical things and are something CSC should help with. And this is exactly where our sensitive data services step in.
You know the rules and so do I
The center piece for sensitive data services is storage. The data should be stored in such a way that unauthorised access is virtually impossible yet at the same time legitimate access is as easy as possible. Furthermore, the data should not disappear, corrupt, or leak out while being stored and used. Data owners should be able to easily store their sensitive data and be able to share it with only those users they grant permissions to.
CSC’s Sensitive Data Archive service is designed to fulfil all the requirements mentioned above and even some more. Instead of providing just regular storage space the new Sensitive Data Archive adds a service layer between the storage and the user applications. This service layer, called Data Access API, takes care of encryption and decryption of data on behalf of the user, which also offloads the encryption key management tasks from users.
Furthermore, the Data Access API ensures that the secured data is visible and accessible for only those users who have been granted to access it by the data owner. The processing environment, access mechanism and the sensitive data storage are all logically and physically separated from each other in order to ensure maximum security. This also makes the sensitive data platform flexible since compute and storage are not dependent on each other but the glue between them still makes it seamless and transparent for the user.
Take my hand, we’re off to secure data land
So, how does it work for the user then? Let’s first assume that the dataset a user is interested in has already been stored in the Sensitive Data Archive. The data is safely stored and it is findable by its public metadata but by no means it is accessible at this point — the user needs a permission for the dataset she needs for her research. Instead of traditional paper application sent to the dataset owner, she will apply through a web portal to a Resource Entitlement Management System, REMS, which will circulate the application with data owner(s). Once the application has been accepted a digital access token will be created, which is equivalent, e.g. to a passport and visa granting entry into a foreign country.
Now, when logging in to a sensitive data processing system, this digital access token will be transparently passed along with login information on the compute system. The Sensitive Data Archive’s Data Access API will query the token and, based on the information in it, will present the dataset in a read-only mount point on the local file system. Even though files seem just like your regular files on your file system they are actually a virtual presentation of the actual files. No single file has been copied into the compute system, yet they are accessible as any regular file. Once a file operation is acted upon a dataset file the Data Access API will fetch just the requested bits from the storage, decrypt them and hand out to the process requesting them — just like any other operating system call to any other file.
One added benefit directly derived from the usage of access tokens is the fact that they have a validity period — or they can be revoked by the data owner at any given time. Once the token expires the Data Access API will cut off the access to the files; they simply disappear from the compute system like a puff. Or the validity period can be easily extended, too. Thus, the data owner retains full control over the data she stored on the Sensitive Data Archive.
For data owner the procedure for storing the data is — if possible — even simpler. You just need to define metadata for your dataset and then enter it (either manually or automated through an API) into REMS and then upload your data. The upload tool will encrypt the data and send it to the archive, which will re-encrypt the data such that it truly is secure. Even you, as a data owner and submitter, are not able to read it back without granting yourself a permission first and using the Data Access API on our sensitive data compute systems.
Something old, something new, something browser’ed
So far so good, but the question has always been ePouta being too inflexible for individuals and smaller research groups, actually. Good news is that the Data Access API has been successfully demonstrated in ePouta and it will become a full-blown service later this year.
But even better news is that along with that there will be a whole new service for ePouta: a remote desktop connection for individual users.
Each user, or a group of users if that’s the case, will get their very own private virtual cloud resource with Data Access API. And the best part of it is that it does not require any client software installations on users’ end. Just a reasonably modern web browser is enough, even a smartphone’s browser is sufficient (I have tested it, it works, even on 4G — but really, it is close to useless on such a small screen with touch input only).
Are we there yet?
While we haven’t really figured out yet how the project model goes, or how users can install the software they need — it is ePouta without external connections — and some other pretty important stuff for service processes, the technology is already there and becoming mature and robust enough that we’re confident in saying that ePouta Remote Desktop is a publicly available service later this year.
The end credits (which no one reads)
Early on with much planning put into our sensitive data model we realised that it is vital that we do not just develop a new fancy platform and then try to make everyone use it. Instead, we tried to team up and collaborate with partners with similar ambitions and focused on making as flexible a service as possible and use open standards as much as possible.
Developed in a joint effort with Nordic e-Infrastucture Collaboration’s (NeIC) Tryggve project and Centre for Genomic Regulation (CRG), the Data Access API is part of the Federated EGA concept designed to provide a general, distributed and secure storage for genomic data along the European Genome-Phenome Archive (EGA). But while genomic data has been the driving factor the API is actually data type agnostic and works for any data type, e.g. text, binary, video, etc.
In our future dreams anyone could install the Sensitive Data Archive and host their sensitive data by themselves but still make it available for access in ePouta Remote Desktop — something we’ve already tested with our Swedish partners, accessing two separate datasets stored in Finland and Sweden, used in ePouta Remote Desktop with a mobile phone at Oslo Airport…
Image: Adobe Stock