Find our blogs at www.csc.fi/blog.
This site is an archive version and is no longer updated.
Modern next-generation sequencing technologies have revolutionized the research on genetic variants whose understanding hold a greater promise for therapeutic targets of human diseases. Many human diseases, such as cystic fibrosis, sickle cell disease and various kinds of cancers are known to be caused by genetic mutations. The identification of such mutations helps us diagnose diseases and discovery new drug targets. In addition, other relevent research includes topics such as human population separation history, species origin, animal and plant breading research.
Variant calling refers to the process of identifying variants from sequence data. There are mainly four kinds of variants: Single Nucleotide Polymorphism (SNP), short Insertion or deletion (Indel), Copy Number Variation (CNV) and Structural Variant (SV) (Figure 1).
Figure 1 The four most common types of variants.
To offer a high accurate and repeatable variant calling process, Broad Institute developed variant calling tools and its step-by-step protocol, named: Genome Analysis Toolkit (GATK) and Best Practices.
GATK is a multiplatform-capable toolset focusing on variant discovery and genotyping. It contains the GATK variant caller itself and it also bundles other genetic analysis tools like Picard. It comes with a well-established ecosystem that makes it able to perform multiple tasks related to variant calling, such as quality control, variation detection, variant filtering and annotation. GATK was originally designed and most suitable for germline short variant discovery (SNPs and Indels) in human genome data generated from Illumina sequencer. However, Broad Institute keeps developing its functions. Now, GATK also works for searching copy number variation and structure variation, both germline and somatic variants discovery and also genome data from other organisms and other sequencing technologies.
Figure 2 The GATK variant calling process.
Although workflows are slightly different from one another, they all share mainly three steps: data pre-processing, variant discovery and additional steps such as variants filtering and annotation. (1) Data pre-processing is the starting step for all Best Practices workflows. It proceeds raw FASTQ or unmapped BAM files to analysis ready BAM files, which already aligned to reference genome, duplicates marked and sorted. (2) Variant discovery is the key step for variant calling. It proceeds analysis ready BAM files to variant calls in VCF format or other structured text-based formats. (3) Additional steps are not necessary for all workflows and they are tailored for the requirements of different downstream analysis of each workflow. Variants filtering and annotation are the two common choices.
It is great and time saving to have scripts to run analysis pipelines automatically. In the past, people used Perl or Scala to do this. However, it shows steep learning curve for non-IT people. Broad Institute solved this problem by introduced a new open source workflow description language, WDL. By using WDL script, you can easily define tasks and link them orderly to form your own workflow via simple syntax and human understandable logic. WDL is simple but powerful. It contains advanced features and control components for parallelism or running time and memory control. Also, WDL is a cross-platform language which can be ran both locally and on cloud.
Cromwell is the execution engine of WDL, which is written in Java and supports three types of platform: local machine, local cluster or computer farm accessed via a job scheduler or cloud. Its basic running environment is Java 8.
Write and run your own WDL script in 5 minutes with this quick start guide.
GATK3 was the most used version in the past. Now, GATK4 taking advantage of machine learning algorithm and Apache Spark tech presents faster speed, higher accuracy, parallelization and cloud infrastructure optimization.
The recommend way to perform GATK Best Practices is to combine GATK4, WDL script, Cromwell execution engine and Docker container. In CSC, Best Practices workflows are written in WDL, then run by Cromwell on Pouta cloud and relative tools such as GATK4, SAMtools and Python are called as Docker images to simplify software environment configuration.
CSC provides large amount of free computing/storage resources for academic use in Finland and facilitates efficient data transfer among its multiple computing platforms. cPouta and ePouta are the open shell IaaS clouds services at CSC. cPouta is the main production public cloud while ePouta is the private cloud which is suitable for sensitive data. They both own multiple virtual machine flavors, programmable API and Web UI, which enables users to generate and control their virtual machines online easily. They are suitable for various kinds of computational workloads, either HPC or genetic computing load.
In CSC, GATK4 Best Practices germline SNPs and Indels variants discovery workflow has been optimized and performance benchmarked on Pouta virtual machine (FASTQ, uBAM and GVCF files are acceptable input). Somatic SNVs and Indels variants discovery workflow is coming soon.
Besides using cloud infrastructure for GATK via launcing a virtual machine in Pouta with this tutorial, one can also use GATK in supercomputing cluster environment (e.g. on Taito with tutorial) by loading GATK module as below:
module load gatk-env
The detailed usage of instructions can be found in GATK user guide and the materials from the GATK course held in May 2019 at CSC can be found in “Variant analysis with GATK” course page.
You are welcome to test GATK tool in CSC environment and our CSC experts are glad to help you to optimize running parameters, set up virtual machine environment, estimate sample processing time and offer solutions for common error message.
Photo: Adobe Stock