Encrypting and compressing your data on fly
When running a distributed system for data analysis, you have to take the big responsibility for your users' data. While doing development for our Chipster system, we started to think about the question: How to make that big responsibility smaller?
From that thinking emerged an idea about using data encryption. The normal solution is to use SSL to encrypt file transfers. However it does not protect files when they have been transferred and are lying on the file broker.
Data files must be in plain format when they are being processed, but they are not processed except for small periods of time on the compute servers. So couldn't we just encrypt them? In our case, files are stored in a file server that is connected to public internet. Processing is done on backend servers. Hence it would make even more sense to store them encrypted while they are on the more attack prone file server. The keys to decrypt files would be stored on user's side and transferred to compute servers when needed, deleting them immediately after the processing is done.
To achieve that, we have to encrypt data while it is uploaded and decrypt data when it is downloaded. With Java, this is easy to do on fly. This solution would also have the advantage that data is encrypted and decrypted only once when it travels from client to file server and to compute server. With SSL, there is encrypt+decrypt for both hops.
The on-fly encryption scheme aligns nicely with a distributed architecture. CPU load would be on clients and compute servers, which are exactly the places where CPU load should be in. With SSL, most of the load would be on the central file broker(s), which is not optimal. As a bonus, you might throw in on-fly compression, so that big piles of files sitting on our file broker would be compressed.
As data sizes are large in Chipster, the big question is: What is the performance hit compared to raw transfers and SSL? Is this idea going to fly in real life?
That question I will answer in the next blog post.