Working together towards a federated European Genome-phenome Archive for publishing and re-using sensitive research data

Image: Adobe Stock

Working together towards a federated European Genome-phenome Archive for publishing and re-using sensitive research data

In January 2022 CSC as the Finnish ELIXIR Node completed the first end-to-end test for the Federated EGA (Federated European Genome-phenome Archive). Federated EGA is a network of repositories that will enable sharing and reusing of sensitive biomedical data internationally while complying with the General Data Protection Regulation (GDPR). 


This first end-to-end test represents a significant milestone for all the organisations involved in the project: the European Bioinformatics Institute (EMBL-EBI) in the UK, the Centre of Genomic Regulation (CRG) in Spain, and ELIXIR Nodes in Finland, Norway, and Sweden. 

Biomedical data management

Sequencing the first human genome represented a turning point for genetics and research in general, requiring 13 years to achieve with a cost of about 2.7 billion dollars. 

Since then, DNA sequencing has become an integral part of scientific research, personalised medicine, and daily life. The methodologies have significantly improved to the point that it is possible to profile the whole content of a single human cell, and integrate large amounts of genetic, phenotypic, and clinical data in one single study. 

As a result of this development the biomedical data management model has evolved  and grown: large  amounts of data come with extensive organising and storage costs and safeguard tolls. Data processing and integration require specific expertise, collaboration between several organisations (e.g. academic institutions, biobanks, healthcare providers), and scientific disciplines. In addition, European and national data privacy protection and data reuse regulations (e.g. informed consent) protect the research subject. All this needs to be supported by appropriate data infrastructures which can facilitate secure data flow and promote interoperability across organisations. 

The groundwork

The European Genome-phenome Archive (EGA) is a repository for biomedical data launched in 2008 by the European Molecular Biology Laboratory European Bioinformatics Institute (EMBL-EBI) located in the UK. The EGA stores data about 4 500 research studies from more than 1 000 institutions worldwide. Data held in the EGA can be reused for further analysis or replication in compliance with the consent provided by the data subjects.

Each dataset submitted to EGA is fully managed by the data controller, who approves or denies data  reuse based on a data access request. In this context, the data controller is represented by a Data Access Committee nominated for each dataset, composed of data stewards, legal representatives of the academic organisation and biobank institutions, or PIs involved in the original study. In addition, each dataset is also linked to the policies that specify the condition of reuse, which have been standardised in machine-readable “barcodes” using specific data use ontologies (i.e. DUO codes). 

EGA constitutes the central European repository for sensitive research data, and with the ELIXIR infrastructure, paved the road for developing the policies regulating sensitive data governance (data access and reuse) and international standards to promote interoperability. 

The federated network

European and national data protection laws aim at protecting the data subject or patients who consent to share their personal information for research aiming to improve healthcare. How can a repository comply with the regulation of each specific country? We have successfully demonstrated that this can be achieved through a federated and collaborative network.

Based on its extensive expertise, Central EGA (as EBI and CRG) provides technical and operational support to a network of repositories in Europe. Each node stores sensitive national data under controlled access without the need to exchange these data with other Federated EGA nodes or the Central EGA. At the same time, the public information related to a specific study travels internationally, and it is published via central EGA. Currently, over 14 countries, part of the ELIXIR life science infrastructure, are engaged in developing the federated model. As mentioned before, Finland, Norway and Sweden have completed their first test.

During the demonstration, we simulated the entire data submission process: preparation of legal agreements, encryption, uploading sensitive data to the Finnish Federated EGA node at the CSC. Finally, the public metadata was shared acrossborders. The dataset access was linked to the Data Access Committee decision process, and the data release process was finalised with the publication of the dataset and its permanent identifier (accession number) on the Central EGA webpage. Data remains under control of the data access committee even though it is shared and can be discovered in the federated network. 

How is controlled data access managed in a federated network?

The process of building a federated infrastructure requires deployment of services and exchange of specific information (e.g., the identity of data requester, policies linked to a particular dataset) between linked organisations to maintain high-security standards. 

For this reason, automation and standardisation of procedures inside and between organisations is essential  for secure  data access.  Important technical components for this are standardised machine-readable messages  that allow service providers to establish the researchers' identity, affiliation, and data access permissions (with GA4GH passports) when they login to a specific service. Combined with DUO codes, this information provides a powerful tool to streamline secure data access processes. For example, for datasets held in the Finnish Federated EGA Node, data access requests are possible using a service called SD Apply. This simple web user interface allows the Data Access Committee to easily review, approve or deny data access to a specific dataset. Once the access is approved, the applicant is given access to the data to analyse it in a private cloud environment called SD Desktop via a web browser. The Desktop is a secure encapsulated environment where data export or download requires specific authorization. With this tool, the original copy of the dataset never leaves the country of origin, and no extra copies are created after every application. 

SD Desktop is part of CSC sensitive Data Services for research and is now available in the Open Beta Versions. SD Apply service is in the pilot phase. We are presently working on the final steps to formalise the release of Federated EGA.

More information:

Contact: (Subject: Sensitive data)


Funded by:

Lisää tästä aiheesta » Siirry sisältöihin ja uutisiin »

Francesca Morello

The author works as Customer Liaison Officer for the CSC Sensitive Data services office.