Blog Post - CSC Blog
Scientists have made use of the CSC supercomputers going back some 30 years. Since 70s’, CSC and its predecessor have hosted Finland’s fastests computers, starting with Univac 1108 in 1971, Vax 8600 in 1985 and finally the first supercomputer, the Cray X-MP, which was taken into use in the autumn of 1989.
Taito and Sisu were originally installed in 2012, and their computing power was improved with a major update in 2014. It has been 5 years since then, which is nearing the standard retirement age for a supercomputer. Due to continuous improvements in the efficiency of processors and other components, the same computing power can be achieved with significantly smaller hardware, which also consumes just a fraction of the power. On the other hand, significantly more computing power and storage space can be achieved using the same amount of power.
In 2015, CSC began to prepare its next update. The first task was to determine what needs and visions scientists had. What kinds of resources and how much of them will be needed in the future? We engaged in dialogue, conducted user surveys, held workshops in just about every Finnish university and interviewed top scientists. The report showed that there was a need for new infrastructure, with data and its use playing a particularly important role.
Together with research and innovation actors, the Ministry of Education and Culture launched the Data and Computing 2021 development programme (DL2021). In the development programme, EUR 33 million in funding was granted to the procurement of a new computing and data management environment, in addition to which the Finnish Government granted EUR 4 million from the supplementary budget for the development of artificial intelligence.
Supercomputer Puhti (2019). Photo: Mikael Kanerva, CSC.
The new hardware will serve six primary purposes:
1) Large-scale simulations: This group represents traditional high performance computing (HPC). These are utilized in physics and in various related fields. Challenging scientific questions are studied by massive computing, for example by high-precision simulations of nuclear fusion, climate change, and space weather.
2) Medium-scale simulations: This category covers a large part of the usage of the computing resources provided by CSC. These simulations include a wide range of disciplines, ranging from topics like biophysical studies of cell functions to material science and computational fluid dynamics. For this type of simulations, it is particularly important to enable workflows that allow a large number of simulations and provide efficient means to handle the resulting data. The created data requires efficient analysis methods utilizing data-intensive computing and artificial intelligence.
3) Data-intensive computing: This use case covers analysis and computing with big data based on extensive source material. The largest group of data-intensive computing users at CSC are currently the bioinformaticians. Other large user groups include language researchers and researchers of other digital humanities and social sciences.
4) Data-intensive computing using sensitive data: Research material often contains sensitive information that cannot be disclosed outside the research group and is governed by a number of regulations, including the Personal Data Act and, from May 2018 on, the EU General Data Protection Regulation. In addition to the needs of data-intensive research in general, managing sensitive data requires e.g. environments with elevated data security and tools for handling authentication and authorization. Some examples include biomedicine dealing with medical reports and humanities and social sciences utilizing information acquired from informants and registries.
5) Artificial intelligence: Machine learning methods are applied to many kinds of scientific challenges, and their use is rapidly expanding to various scientific disciplines, including life sciences, humanities and social sciences. Machine learning is typically applied to analysis and categorization of scientific data. Easy access to large datasets, like microscope images and data repositories, is crucial for the efficiency of the artificial intelligence workload.
6) Data streams: Many important scientific datasets consist of constantly updated data streams. Typical sources for these kinds of data streams include satellites with measuring instruments, weather radars, networks of sensors, stock exchange prices, and social media messages. Additionally, there are data streams emanating from the infrastructure and between its integrated components
Supercomputers Puhti and Mahti
Two independent systems will provide computing power for CSC in the future: Puhti and Mahti.
Puhti is a supercomputer, which is intended to support many of the above-mentioned purposes. It offers 664 nodes for medium-sized simulations with plenty of memory (192 GB or 384 GB) and 40 cores, which represent the latest generation of Intel Xeon processor architecture. These nodes are combined with an efficient Infiniband HDR interconnect network, which allows for the simultaneous use of multiple nodes. Some quantum chemistry applications benefit a great deal from fast local drives, which are found in 40 nodes. The same nodes can be used for data-intensive applications, in addition to which the supercomputer has 18 large-memory nodes that contain up to 1.5 TB of memory.
One of the hottest topics right now is artificial intelligence. In science, its use is constantly increasing in both data processing and as part of simulations. With regard to this, Puhti has accelerated partition, Puhti-AI, which contains 80 GPU nodes, each of which has four Nvidia Volta V100 GPUs. These nodes are very tightly interconnected, thus allowing simulations and artificial intelligence work using multiple nodes to get as much out of the GPUs as possible. Majority of current machine learning workloads use only one GPU, but the trend is toward larger learning tasks. The new hardware makes it possible to use multiple nodes at the same time. The new Intel processors (Cascade Lake) also include new Vector Neural Network Instructions (VNNI), which accelerate inference workloads by as much as a factor of 10. The supercomputer work disc is 4.8 PB.
In the procurement of Puhti, CSC and the Finnish Meteorological Institute (FMI) collaborated to extend Puhti with a dedicated research cluster for the FMI. This 240 node partition is fully funded by the FMI and is logically separated from the main Puhti system while the hardware is fully integrated. In total this means that in the joint machine has 1002 nodes.
Mahti is being installed in the Kajaani Datacenter in the same room where Sisu was. Unlike Puhti, Mahti is fully liquid cooled. In terms of datacenter technology, the new supercomputer is a major improvement over Sisu. Mahti's liquid cooling system uses warm water (just under 40 degrees) as opposed to Sisu, which required cooled water. As a result, Mahti can be cooled more affordably and efficiently. Mahti is a purebred supercomputer containing almost 180 000 CPU cores in 1404 nodes. Each node has two next-generation AMD 64 core processors (EPYC 7H12)running at 2.6 GHz, making the theoretical peak power of the whole system 7.5 Pflops. This version of the AMD EPYC processor is the fastest CPU currently available, and will give Finnish science a unique competitive advantage. There is 256 GB of memory per node, so even large scale simulations requiring a large amount of memory can be run effectively. The supercomputer work disc is over 8 PB.
- 682 nodes, with two 20-core Intel Xeon Gold 6230 processors, running at 2.1 GHz
- Theoretical computing power 1.8 Pflops
- 192 GB - 1.5 TB memory per node
- High-speed 100 Gbps Infiniband HDR interconnect network between nodes
- 4.8 PB Lustre parallel storage system
- 80 nodes, each with two Intel Xeon Gold 6230 processors and four Nvidia Volta V100 GPUs
- Theoretical computing power 2.7 Pflops
- 3.2 TB of fast local storage in the nodes
- High-speed 200 Gbps Infiniband HDR interconnect network between nodes
- 1404 nodes with two 64 core AMD EPYC processors (Rome) running at 2.6 GHz
- Theoretical computing power 7.5 Pflops
- 256 GB of memory per node
- High-speed 200 Gbps Infiniband HDR interconnect network between nodes
- 8.7 PB Lustre parallel storage system
Allas data management solution
Growth in the volume of data and the need for different approaches to sharing it also pose new challenges for data management. A file system based on a conventional directory hierarchy does not fully meet future needs where, for example, the scalability of storage systems and the sharing and re-use of data are concerned.
Allas is CSC's new data management solution, which is based on object storage technology. The 12 PB system offers new possibilities for data management, analysis and sharing. Data is stored in the system as objects, which for most users are just files. As opposed to a conventional file system, files can be referred to other ways than by their name and location in the directory hierarchy, as the system assigns a unique identifier to each object. In addition to this, an arbitrary metadata can be added to each object, thus allowing for a more multifaceted description of the data.
Data stored in Allas is available on CSC's supercomputers and cloud platforms as well as from any location over the Internet. In the simplest case user can add and retrieve data on their own computer just through a web browser. Allas also facilitates the sharing of data, as the user is able to share the data they choose with either individual users or even with the whole world. Allas also offers a programming interface, which can be used to build a wide variety of services on top of it.
One example of the new use cases is data (possibly even very high volume) generated by an instrument, which can be streamed directly to Allas. The data can then be analyzed using CSC supercomputers, and the results can be saved back to Allas, from which it is easy to share the results with partners.
Data management system Allas (2019). Photo: Mikael Kanerva, CSC
A broad spectrum of scientific problems in pilot projects
During the Puhti supercomputer acceptance phase, a limited number of Grand Challenge research projects were given an opportunity to use the extremely large computing resources. An effort was made to take the various computing needs behind the supercomputer procurement into account when selecting pilot projects. The selected projects varied from conventional, large-scale simulations to research conducted using artificial intelligence, and the researchers studied a wide range of topics from astrophysics to personalized medicine. The rise of AI as a part of the workflow was a big trend, and 61% of all resources were used by projects which had, or planned to have, AI as a part of their work.
Pilot period was very successful in testing the system. The projects were able to generate very high load on the system and thus confirm that the system was usable with real workload. Several projects were also able to make significant progress in their research during the piloting period. Due to testing nature of the acceptance phase some projects, however, faced technical problems but also these experiences were very important to CSC since it helps CSC to improve the functionality of the system. In successful projects the performance of Puhti was generally a bit better when compared to Sisu, both in terms of parallel scalability and in terms of single core performance.
A new group of Grand Challenge pilot projects will be selected at the end of 2019 for the acceptance phase of the Mahti supercomputer. We look forward to see what kinds of scientific challenges await!
The Puhti supercomputer has been opened to customers in 2.9.2019 and Allas data management solution in 2.10.2019. Researchers working in Finnish universities and research institutes may apply for access rights and computing resources on the CSC Customer Portal at https://my.csc.fi.
Software offering in Puhti is currently more limited than in Taito, but new software is being installed almost on a daily basis. Also the user documentation is continuously extended. CSC will also be organizing several training sessions on the use of the environment for both new and experienced users in 2019 - 2020, the first Puhti porting and optimisation workshop has already been held.
CSC supercomputers and superclusters
1989 Cray X-MP
1995 Cray C94
1997 Cray T3E
1998 SGI Origin 2000
2000 IBM SP Power3
2002 IBM p690 Power 4
2007 Cray XT4 (Louhi)
2007 HP Proliant CP400 (Murska)
2012 Cray XC40 (Sisu)
2013 HP Apollo 6000 XL230a/SL230s Supercluster (Taito)
2019 Atos BullSequana X400 (Puhti)
(2020 Atos BullSequana XH2000 (Mahti))
Sebastian von Alfthan and Jussi Enkovaara are high performance computing experts at CSC.
You might have heard news about LUMI, the European pre-exascale computer that will be hosted by CSC. LUMI will be huge addition to computational resources available to Finnish researchers from 2021 on, but we will come back to the story of LUMI later on.
If you follow CSC on social media you might have noticed a recent announcement about a new service based on OKD/Kubernetes called Rahti. This new service allows you to run your own software packaged in Docker containers on a shared computing platform. The most typical use case is web applications of all sorts. In this blog post I will provide additional context for the announcement and more detail and examples about what Rahti is and why it’s useful.
CSC has been running cloud computing services for a while. The first pilot systems were built in 2010 so the tenth anniversary of cloud computing at CSC is coming up next year. All of CSC’s previous offerings in this area – cPouta, ePouta and their predecessors – have been Infrastructure as a Service (IaaS) clouds. In this model, users can create their own virtual servers, virtual networks to connect those servers and virtual disks to store persistent data on the servers. This gives you a lot of flexibility as you get to choose your own operating system and what software to run on that operating system and how. The flip side is that after you get your virtual servers, you are on your own in terms of managing their configuration.
Rahti takes a different approach. Instead of a virtual machine, the central concept is an application. The platform itself provides many of the things that you would need to manage yourself in more flexible IaaS environments. For example:
- Scaling up applications by adding replicas
- Autorecovery in case of hardware failures
- Rolling updates for a set of application replicas
- Load balancing of traffic to multiple application replicas
Not having to manage these yourself means you can get your applications up and running faster and don’t have to spend as much time maintaining them. What enables this is standardization of the application container and the application lifecycle. In IaaS clouds you have a lot of choice in terms of how you want to make your application fault tolerant and scalable. There are many software products available that you can install and configure yourself to achieve this. With Rahti and other Kubernetes platforms, there is one standard way. This simplifies things greatly while still providing enough flexibility for most use cases.
Based on the description above you might think that Rahti fits into the Platform as a Service (PaaS) service model. While there are many similarities, traditional PaaS platforms have typically been limited in terms of what programming languages, library versions and tools are supported. It says so right in the NIST Definition of Cloud Computing: “The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages, libraries, services, and tools supported by the provider.” These limitations are largely not there in Rahti or other Kubernetes platforms: if it runs in a Docker container, it most likely also runs (or can be made to run) in Rahti. You are free to choose your own programming language and related libraries and tooling yourself.
Setting up Spark in Rahti
One of the big benefits of Rahti is that complex distributed applications that would be difficult to install and configure on your own on virtual machines can be packaged into templates and made available for a large number of users. This means figuring out how to run the application has to be done only once – end users can simply take the template, make a few small customizations and quickly get their own instance running. You are of course also free to create your own templates and run your own software.
One example of a distributed application that can be difficult to install and manage is Apache Spark. It is a cluster software meant for processing large datasets. While it is relatively simple to install it on a single machine, using it that way would defeat the point of running Spark in the first place: it is meant for tasks that are too big for a single machine to handle. Clustered installations on the other hand mean a lot of additional complications: you need to get the servers to communicate with each other, you need to make sure the configuration of the cluster workers is (and stays) somewhat identical and you need to have some way to scale the cluster up and down depending on the size of your problem – and the list goes on.
Let’s see how one can run Spark in Rahti. The template that we use in Rahti is available on GitHub and the credit for it goes to my colleagues Apurva Nandan and Juha Hulkkonen. And yes, I know that is actually the Hadoop logo.
First select “Apache Spark” from a catalog of applications:
You can also find other useful tools in the catalog such as databases and web servers. After selecting Apache Spark, you’ll get this dialog:
Click next and enter a few basic configuration options. There are many more that you can customize if you scroll down, but most can be left with their default values:
After filling in a name for the cluster, a username and a password, click “Create” and go to the overview page to see the cluster spinning up. After a short wait you’ll see a view like this:
The overview page shows different components of the Spark cluster: one master, four workers and a Jupyter Notebook for a frontend to the cluster. These run in so called “pods” that are a collection of one or more containers that share the same IP address. Each worker in the Spark cluster is its own pod and the pods are distributed by Rahti on separate servers.
From the overview page you can get information about the status of the cluster, monitor resource usage and add more workers if needed. You can also find a URL to the Jupyter Notebook web interface at the top and if you expand the master pod view you can find a URL to the Spark master web UI. These both use the username and password you specified when creating the cluster.
If you need a more powerful cluster you can scale it up by adding more workers. Expand the worker pod view and click the up arrow next to the number of pods a few times:
You can then follow the link from the overview page to Jupyter Notebook which acts as a frontend for the Spark cluster.
And that’s all there is to it! The process for launching other applications from templates is very similar to the Spark example above. The plan for the future is to add more of these templates to Rahti for various types of software in addition to the ones that are already there.
If you’re interested in learning more about Rahti, you can find info at the Rahti website or you can contact firstname.lastname@example.org.
Photo: Adobe Stock
Recently, the CSC policy for free and open source software was posted without any celebration. It is under our Github organization and you can check it out at:
Our tuned down approach stemmed from the fact that not much changed with the adoption of the policy. It pretty much stated the already established approach to endorsing open source software in our daily work. The paths of CSC and open source have crossed from the very beginning, when we were in the happy position to offer the platform for distributing the very first version of the Linux operating system – and were of course early adopters of Linux in our operations.
CSC is a non-profit state enterprise embracing free and open source software throughout the operations and development. For us, open source software together with open data and open interfaces are the essential building blocks of sustainable digital ecosystems. CSC employees haven’t been shy of using and producing open source, but we still wanted to codify the current de facto practices and to encourage employees to go on supporting open source.
The major decision when formulating the policy was to put special emphasis on collaboration. We’ve been involved in dozens of open source projects and seen the realities of community building efforts. Community building is hard work.
The policy aims to encourage practices that in the best possible way encourage collaboration and contributing within the open source community. We find that the best way to do it is to embrace the licensing practices of the surrounding community. For some types of applications it might mean GPL licensing, where as increasingly the norm has been to use permissive licenses and to not enforce contributor agreements.
We have been happy contributors to projects such as OpenStack and felt extremely delighted to be also in the receiving side when working as main developers of software such as Elmer and Chipster. Every contribution counts and even the smallest ones usually carry some expertise or insight that broadens the scope of the project.
Finally, the policy aims to be concise and practical. It should offer guidance to everyday working life of CSC people who are part of the large open source community. So we did not want to make it a monolithic document written in legal language that would have been foreign to almost all of the developers in the community.
P.S. If you would like to use the policy or parts of it for your organization or project, please do so! It is licensed under CC-BY 4.0, so there are no restrictions on reuse. Obviously, this is the licensing recommendation for documentation we give in the policy!
Photo: Adobe Stock
Our trusted workhorse Sisu is ending its duty during this month after respectable almost seven years of operation.
Sisu started its service in the autumn of 2012 as a modest 245 Tflop/s system featuring 8-core Intel Sandy Bridge CPUs, reaching its full size in July 2014 with a processor upgrade to 12-core Intel Haswell CPUs and increasing the number of cabinets from 4 to 9. The final configuration totalled 1688 nodes and 1700 Tflop/s theoretical performance. At best, it was ranked the 37th fastest supercomputer in the world (Top500 November 2014 edition). It remained in among the 100 fastest systems in the world for three years, dropping to position #107 in the November 2017 list.
Throughout its service, Sisu proved itself as a very stable and performant system. The only major downtime took place when there was a major disaster that took down the shared Lustre filesystem.
During the course of years, Sisu provided over 1.7 billion core hours for Finnish researchers, playing a major role in several success stories in scientific computing in Finland. Just a couple of examples:
- All simulations in the work by Miguel Garo et al that explained the debated growth mechanism of an amorphous carbon material were performed on Sisu.
- Theoretical calculations for a feat of strength in atom manipulation on an insulated surface were performed with Sisu. These were a collaboration between a Swiss group, a Japanese group and the group of Adam Foster in Aalto University.
- Another cool piece of nanotechnology was about finding ways with simulations to synthetize nanoparticles in cubic shape, carried out on Sisu and presented by Flyura Djurabekova (University of Helsinki).
- Sisu was involved in an exhaustive benchmarking and comparison study of density-functional theory methods and implementations, with Finnish collaborator Torbjörn Björkman (Åbo Akademi University).
In addition to being a highly utilized and useful Tier-1 resource, it acted as a stepping stone for several projects that obtained the heavily competed PRACE Tier-0 access on the Piz Daint system in Switzerland and other largest European supercomputers. Without a credible national Tier-1 resource, establishing the skills and capacities for using Tier-0 resources would be hard if not impossible.
Sisu also spearheaded several technical solutions. It was among the first Cray XC supercomputers in the world with the new Aries interconnect. In the second phase it was equipped with Intel’s Haswell processors weeks before they had been officially released. It also heralded a change in hosting for CSC. Instead of the machine being placed in Espoo in conjunction with the CSC offices, it was located in an old papermill in Kajaani. This change has brought major environmental and cost benefits, and has been the foundation for hosting much larger machines.
Sisu was the fastest computer in Finland throughout its career, until last month when CSC’s new cluster system Puhti took over the title. Puhti will be complemented by the end of this year by Sisu’s direct successor Mahti, which will again hold the crown for some time. Puhti is currently under piloting use and becomes generally available during August, Mahti at the beginning of next year. Sisu has done its duty now and we wish it a happy retirement. Hats off!
Modern next-generation sequencing technologies have revolutionized the research on genetic variants whose understanding hold a greater promise for therapeutic targets of human diseases. Many human diseases, such as cystic fibrosis, sickle cell disease and various kinds of cancers are known to be caused by genetic mutations. The identification of such mutations helps us diagnose diseases and discovery new drug targets. In addition, other relevent research includes topics such as human population separation history, species origin, animal and plant breading research.
Variant calling refers to the process of identifying variants from sequence data. There are mainly four kinds of variants: Single Nucleotide Polymorphism (SNP), short Insertion or deletion (Indel), Copy Number Variation (CNV) and Structural Variant (SV) (Figure 1).
Figure 1 The four most common types of variants.
Industry gold-standard for variant calling: GATK and Best Practices
To offer a high accurate and repeatable variant calling process, Broad Institute developed variant calling tools and its step-by-step protocol, named: Genome Analysis Toolkit (GATK) and Best Practices.
GATK is a multiplatform-capable toolset focusing on variant discovery and genotyping. It contains the GATK variant caller itself and it also bundles other genetic analysis tools like Picard. It comes with a well-established ecosystem that makes it able to perform multiple tasks related to variant calling, such as quality control, variation detection, variant filtering and annotation. GATK was originally designed and most suitable for germline short variant discovery (SNPs and Indels) in human genome data generated from Illumina sequencer. However, Broad Institute keeps developing its functions. Now, GATK also works for searching copy number variation and structure variation, both germline and somatic variants discovery and also genome data from other organisms and other sequencing technologies.
Figure 2 The GATK variant calling process.GATK Best Practices is a set of reads-to-variants workflows used at the Broad Institute. At present, Best Practices contains 6 workflows: Data Pre-processing, Germline SNPs and Indels, Somatic SNVs and Indels, RNAseq SNPs and Indels, Germline CNVs and Somatic CNVs. (You can check the Best Practices introduction on forum and codes on github).
Although workflows are slightly different from one another, they all share mainly three steps: data pre-processing, variant discovery and additional steps such as variants filtering and annotation. (1) Data pre-processing is the starting step for all Best Practices workflows. It proceeds raw FASTQ or unmapped BAM files to analysis ready BAM files, which already aligned to reference genome, duplicates marked and sorted. (2) Variant discovery is the key step for variant calling. It proceeds analysis ready BAM files to variant calls in VCF format or other structured text-based formats. (3) Additional steps are not necessary for all workflows and they are tailored for the requirements of different downstream analysis of each workflow. Variants filtering and annotation are the two common choices.
GATK pipelining solution: WDL and Cromwell
It is great and time saving to have scripts to run analysis pipelines automatically. In the past, people used Perl or Scala to do this. However, it shows steep learning curve for non-IT people. Broad Institute solved this problem by introduced a new open source workflow description language, WDL. By using WDL script, you can easily define tasks and link them orderly to form your own workflow via simple syntax and human understandable logic. WDL is simple but powerful. It contains advanced features and control components for parallelism or running time and memory control. Also, WDL is a cross-platform language which can be ran both locally and on cloud.
Cromwell is the execution engine of WDL, which is written in Java and supports three types of platform: local machine, local cluster or computer farm accessed via a job scheduler or cloud. Its basic running environment is Java 8.
Write and run your own WDL script in 5 minutes with this quick start guide.
Run GATK4 on CSC Pouta Cloud and Taito
GATK3 was the most used version in the past. Now, GATK4 taking advantage of machine learning algorithm and Apache Spark tech presents faster speed, higher accuracy, parallelization and cloud infrastructure optimization.
The recommend way to perform GATK Best Practices is to combine GATK4, WDL script, Cromwell execution engine and Docker container. In CSC, Best Practices workflows are written in WDL, then run by Cromwell on Pouta cloud and relative tools such as GATK4, SAMtools and Python are called as Docker images to simplify software environment configuration.
CSC provides large amount of free computing/storage resources for academic use in Finland and facilitates efficient data transfer among its multiple computing platforms. cPouta and ePouta are the open shell IaaS clouds services at CSC. cPouta is the main production public cloud while ePouta is the private cloud which is suitable for sensitive data. They both own multiple virtual machine flavors, programmable API and Web UI, which enables users to generate and control their virtual machines online easily. They are suitable for various kinds of computational workloads, either HPC or genetic computing load.
In CSC, GATK4 Best Practices germline SNPs and Indels variants discovery workflow has been optimized and performance benchmarked on Pouta virtual machine (FASTQ, uBAM and GVCF files are acceptable input). Somatic SNVs and Indels variants discovery workflow is coming soon.
Besides using cloud infrastructure for GATK via launcing a virtual machine in Pouta with this tutorial, one can also use GATK in supercomputing cluster environment (e.g. on Taito with tutorial) by loading GATK module as below:
module load gatk-env
You are welcome to test GATK tool in CSC environment and our CSC experts are glad to help you to optimize running parameters, set up virtual machine environment, estimate sample processing time and offer solutions for common error message.
Photo: Adobe Stock
During past years, sensitive data has become one of the hottest of hot topics in the area of Finnish scientific data management discussion — and not least thanks to the European General Data Protection Regulation. At the same time, for nearly five years now, CSC has provided ePouta cloud platform for all sensitive data processing needs with quite substantial computing and storage capacity. From grounds up, this virtual private IaaS cloud solution has been designed to meet the national requirements for IT systems for protection level III (ST III) data.
While ePouta has been successful in providing our institutional customers a safe and robust platform for their sensitive data processing, it has lately become very clear that something more is desperately needed; something which is more easily adopted and accessed, something for individual researchers and research groups, and something more collaborative.
Now here, a problem arises; by definition sensitive data contains information which should only be processed either by explicit consent or a legitimate permission, and there are certain rules for such processing. Probably most notable ones of those rules — from researchers’ perspective — are requirements for data minimisation, pseudonymisation, encryption, safe processing and data disposal after its use.
Data minimisation and pseudonymisation relate directly to dataset definition. Minimisation means that only the data that is absolutely needed should be processed. For example, if the dataset includes information about persons' age but that information is not needed for the research, it should not be included in the dataset and should be removed from it before processing.
Pseudonymisation is a de-identification procedure by which personally identifiable information fields within a data record are replaced by one or more artificial identifiers, or pseudonyms.
Pseudonymisation differs from anonymisation in that pseudonymised data can be restored to its original state with the addition of information which then allows individuals to be re-identified again. Such re-identification codes must be kept separate from the pseudonymised data. Clearly then, these topics are something that the data owner or the researcher should take care of but for the rest, they seem to be more of a technical things and are something CSC should help with. And this is exactly where our sensitive data services step in.
You know the rules and so do I
The center piece for sensitive data services is storage. The data should be stored in such a way that unauthorised access is virtually impossible yet at the same time legitimate access is as easy as possible. Furthermore, the data should not disappear, corrupt, or leak out while being stored and used. Data owners should be able to easily store their sensitive data and be able to share it with only those users they grant permissions to.
CSC’s Sensitive Data Archive service is designed to fulfil all the requirements mentioned above and even some more. Instead of providing just regular storage space the new Sensitive Data Archive adds a service layer between the storage and the user applications. This service layer, called Data Access API, takes care of encryption and decryption of data on behalf of the user, which also offloads the encryption key management tasks from users.
Furthermore, the Data Access API ensures that the secured data is visible and accessible for only those users who have been granted to access it by the data owner. The processing environment, access mechanism and the sensitive data storage are all logically and physically separated from each other in order to ensure maximum security. This also makes the sensitive data platform flexible since compute and storage are not dependent on each other but the glue between them still makes it seamless and transparent for the user.
Take my hand, we’re off to secure data land
So, how does it work for the user then? Let’s first assume that the dataset a user is interested in has already been stored in the Sensitive Data Archive. The data is safely stored and it is findable by its public metadata but by no means it is accessible at this point — the user needs a permission for the dataset she needs for her research. Instead of traditional paper application sent to the dataset owner, she will apply through a web portal to a Resource Entitlement Management System, REMS, which will circulate the application with data owner(s). Once the application has been accepted a digital access token will be created, which is equivalent, e.g. to a passport and visa granting entry into a foreign country.
Now, when logging in to a sensitive data processing system, this digital access token will be transparently passed along with login information on the compute system. The Sensitive Data Archive’s Data Access API will query the token and, based on the information in it, will present the dataset in a read-only mount point on the local file system. Even though files seem just like your regular files on your file system they are actually a virtual presentation of the actual files. No single file has been copied into the compute system, yet they are accessible as any regular file. Once a file operation is acted upon a dataset file the Data Access API will fetch just the requested bits from the storage, decrypt them and hand out to the process requesting them — just like any other operating system call to any other file.
One added benefit directly derived from the usage of access tokens is the fact that they have a validity period — or they can be revoked by the data owner at any given time. Once the token expires the Data Access API will cut off the access to the files; they simply disappear from the compute system like a puff. Or the validity period can be easily extended, too. Thus, the data owner retains full control over the data she stored on the Sensitive Data Archive.
For data owner the procedure for storing the data is — if possible — even simpler. You just need to define metadata for your dataset and then enter it (either manually or automated through an API) into REMS and then upload your data. The upload tool will encrypt the data and send it to the archive, which will re-encrypt the data such that it truly is secure. Even you, as a data owner and submitter, are not able to read it back without granting yourself a permission first and using the Data Access API on our sensitive data compute systems.
Something old, something new, something browser’ed
So far so good, but the question has always been ePouta being too inflexible for individuals and smaller research groups, actually. Good news is that the Data Access API has been successfully demonstrated in ePouta and it will become a full-blown service later this year.
But even better news is that along with that there will be a whole new service for ePouta: a remote desktop connection for individual users.
Each user, or a group of users if that’s the case, will get their very own private virtual cloud resource with Data Access API. And the best part of it is that it does not require any client software installations on users’ end. Just a reasonably modern web browser is enough, even a smartphone’s browser is sufficient (I have tested it, it works, even on 4G — but really, it is close to useless on such a small screen with touch input only).
Are we there yet?
While we haven’t really figured out yet how the project model goes, or how users can install the software they need — it is ePouta without external connections — and some other pretty important stuff for service processes, the technology is already there and becoming mature and robust enough that we’re confident in saying that ePouta Remote Desktop is a publicly available service later this year.
The end credits (which no one reads)
Early on with much planning put into our sensitive data model we realised that it is vital that we do not just develop a new fancy platform and then try to make everyone use it. Instead, we tried to team up and collaborate with partners with similar ambitions and focused on making as flexible a service as possible and use open standards as much as possible.
Developed in a joint effort with Nordic e-Infrastucture Collaboration’s (NeIC) Tryggve project and Centre for Genomic Regulation (CRG), the Data Access API is part of the Federated EGA concept designed to provide a general, distributed and secure storage for genomic data along the European Genome-Phenome Archive (EGA). But while genomic data has been the driving factor the API is actually data type agnostic and works for any data type, e.g. text, binary, video, etc.
In our future dreams anyone could install the Sensitive Data Archive and host their sensitive data by themselves but still make it available for access in ePouta Remote Desktop — something we’ve already tested with our Swedish partners, accessing two separate datasets stored in Finland and Sweden, used in ePouta Remote Desktop with a mobile phone at Oslo Airport…
Image: Adobe Stock
March has been the month for the Spring School in Computational Chemistry for last 8 years. This time the school was overbooked already in November so if you want to join next year, register early.
Correspondingly, we decided to accept more participants than before resulting in tight seating and parallel sessions also for the last day hands-ons of the School. 31 researchers from Europe and beyond spent four science-packed days in occasionally sunny Finland.
Three paradigms in three days
The foundations of the school - the introductory lectures and hands-on exercises of (classical) molecular dynamics and electronic structure theory - have been consistently liked and found useful and have formed the core with small improvements.
For the last four years we've integrated the latest research paradigm, i.e. data driven science, also known as, machine learning (ML) to the mix. This approach has been welcomed by the participants, in particular as the lectures and hands-on exercises given by Dr. Filippo Federici Canova from Aalto University have been tailored for computational chemistry and cover multiple approaches to model data. ML is becoming increasingly relevant, as one of the participants, Mikael Jumppanen, noted in his flash talk quoting another presentation from last year: "Machine learning will not replace chemists, but chemists who don't understand machine learning will be replaced."
The ML day culminated in the sauna lecture given by prof. Patrick Rinke from Aalto University. He pitted humans against different artificial intelligence "personalities". The competition was fierce, but us humans prevailed with a small margin - partly because we were better at haggling for scoring.
Food for the machines
This year we complemented the ML session with means to help create data to feed the algorithms. Accurate models require a lot of data, and managing hundreds or thousands of calculations quickly becomes tedious.
Marc Jäger from Aalto University introduced the relevant concepts, pros and cons of using workflows, spiced with the familiar hello world example. It was executed with FireWorks, a workflow manager popular in materials science. Once everyone had succeeded in helloing the world, Marc summarized that "this was probably the most difficult way of getting those words printed", but the actual point was, that if there is a workflow, or a complete workflow manager, which suits your needs, someone else has done a large part of the scripting work for you and you can focus on the benefits.
Workflow managers of course aren't a silver bullet beneficial in all research, but in case you need to run lots of jobs or linked procedures, automating and managing them with the right tool can increase productivity, document your work and reduce errors.
What to do with the raw data?
How do you make sense of the gigabytes of data produced by HPC simulations? It of course depends on what data you have. The School covered multiple tools to make sense of you data.
Visual inspection is a powerful tool in addition to averages, fluctuations and other numerical comparisons. MD trajectories or optimized conformations were viewed with VMD, electron density and structure were used to compute bonding descriptors using Multiwfn and NCIPLOT and a number of python scripts employing matplotlib for result visualization were given as real life examples on current tools.
To brute force of not to brute force?
Although computers keep getting faster, brute forcing research problems is not always the right way. In one of the parallel tracks on the last day, Dr. Luca Monticelli built on top of the MD lectures of the first day by presenting 6+1 enhanced sampling techniques to enable proper study of rare events.
The last one, coarse graining, strictly speaking is not an enhanced sampling method, but as it is orders of magnitude faster than atomistic simulations it can be used to equilibrate a system quickly enabling switching to atomistic detail from truly independent configurations.
Posters replaced with flash talks
The previous Spring Schools have included the possibility to present posters to facilitate the discussion among participants of one's own research with other participants and lecturers. Posters have helped to discover potential collaborations and new ideas to apply in one’s own research.
There is a lot of potential for collaboration as the School participants come from a highly diverse background as shown in the wordcloud below. The wordcloud is created from the descriptions filled in by the participants at the registration step.
Word Cloud: Scientific background of the participants.
One participant suggested in last year's feedback to replace the poster session with flash talks, which we now did. Each participant was asked to provide one slide to introduce the background, skills and scientific interests, and the slides were used in three minute flash talks to everyone else. The feedback was very positive, so we will likely continue with flash talks also in 2020.
Networking with researchers is yet another motivation to participate in the school. Philipp Müller from Tampere University of Technology took the initiative and proposed a LinkedIN group for the participants to keep in contact also after the school. This was realized on the spot and now the group has already most of the participants signed up.
As potential collaborations are discovered, the HPC-Europa3 programme, also presented in the School, can be used to fund 3-13 week long research visits. Or, if you choose your research visit to take place in Finland in March 2020, you could also participate to the School at the same time.
Whom do the participants recommend the School?
For the first time we asked the participants for their recommendation on who would benefit in participating in the school. The answers range from any under or post-grad student in the field to everyone who needs any computational skills. One participant also confessed that spending some time to learn elementary Python (as suggested) before the School would have been useful. The computational tools known to the participants at registration are collected to the picture below.
Word Cloud: Computational tools used by the participants.
The feedback also emphasized the quality of hands-ons, social events, and overall organization, while the pace of teaching sparked also criticism. This is understandable as the School covers a wide range of topics and therefore it is not possible to go very deep into details. Also, as the background of the participants is heterogeneous some topics are easier for some, but new to others. Partially this has been mitigated by organizing the hands-on sessions of the first two days in three parallel tracks with different difficulty.
The great majority of the participants was satisfied with all aspects of the school. Actually, our original aim has been to introduce the most important fundamental methods and some selected tools so that the participants are aware of them, and in case an opportunity to apply them comes, a deeper study will anyway be necessary.
Materials available online
Most of the lectures and hands-on materials are available on the School home page. The hands-on exercises in particular also also suitable for self study - take a look!
More about the topic:
- The Spring School is a PATC training event sponsored by the Implementation Phase of PRACE, which receives funding from the EU’s Horizon 2020 Research and Innovation Programme (2014-2020) under grant agreement 730913.
- Spring School homepage with links to materials: https://events.prace-ri.eu/e/CSC_Spring_School_2019
- PATC training events: www.training.prace-ri.eu
- CSC training and events: www.csc.fi/web/training
- Funding for research visits: www.hpc-europa.eu
CSC develops, integrates and offers high-quality digital services and is committed to good data management. We believe that the future of the world and people will become better as a result of research, education and knowledge management. That's why we promote them to the best of our abilities and develop and provide internationally high-quality digital services. CSC’s strategic goals include enabling world-class data management and computing and maximizing the value of data.
Data is often too important and valuable to be handled carelessly. In their work our customers, especially researchers, are required to adhere to the FAIR data principles and to make their data Findable, Accessible, Interoperable and Re-usable. Furthermore, they need tools to enable proper data citation. This affects us as a service provider and puts expectations on our data management service development.
Our revised data policy and new policy for persistent identifiers support us in achieving our strategic goals and promote the best data management practices. These newly released policies oblige us to undertake appropriate institutional steps to help customers to safeguard the availability, usability and retention of their data and help us assure compliance with all applicable laws and regulations as well as internal requirements with respect to data management. The policy for persistent identifiers (often referred to as PIDs, the most commonly known are probably the DOI and URN identifiers) enables creation and management of globally unique unambiguous identifiers at CSC for our own processes and for those of our customers.
These documents are, in their first versions, mainly written for research dataset management, but as they represent generic level principles of good data management, they are aimed to cover and guide all data and information management at CSC including both customer-owned and CSC-owned data. In addition, these policies are living documents that will be reviewed regularly and revised when needed.
- Aleksi Kallio
- Anni Jakobsson
- Antti Laitinen
- Antti Pursula
- Ari-Matti Saren
- Atte Sillanpää
- Hanna-Mari Puuska
- Heta Koski
- Irina Kupiainen
- Jaakko Leinonen
- Jan Åström
- Janne Ignatius
- Jarmo Nieminen
- Jessica Parland-von Essen
- Johanna Kaunisvaara
- Jonna Helenius
- Jussi Enkovaara
- Jussi Heikonen
- Jyry Suvilehto
- Kaisa Kotomäki
- Kalle Happonen
- Kimmo Koski
- Klaus Lindberg
- Manne Miettinen
- Maria Lehtivaara
- Mariikka Kekäläinen
- Miia Lindell
- Minna Ahokas
- Olga Heino
- Olli-Pekka Lehto
- Pekka Järveläinen
- Pekka Manninen
- Per Öster
- Päivi Rauste
- Riina Salmivalli
- Risto Laurikainen
- Sami Saarikoski
- Sebastian von Alfthan
- Seija Sirkiä
- Shuang Luo
- Suvilehto Jyry
- Tomasz Malkiewicz
- Tommi Kutilainen
- Tommi Nyrönen
- Ville Virtanen
- Walter Rydman
- Yrjö Leino