11  Pipeline overview

11.1 Overview of variant calling pipeline

The chart below provides a high-level overview of the different steps involved in the processing of genomic data. In the following chapter we will focus on the variant calling pipeline, which deals with:

  • Assessing the quality of sequence reads and filtering if necessary.
  • Aligning the reads to a reference genome - this is also referred to as read mapping.
  • Identify variants, i.e. where do my samples differ from the reference genome - this is known as _variant calling.

In the diagram below the various tools (e.g., bwa, samtools, gatk) and data formats (fastq, bam, vcf, etc.) are depicted as well. Some of these we have already encountered in the previous chapters. You might also recall the pipeline scripts (stored in ./training/scripts) that were mentioned in the chapter on Unix scripting. These already showcased a number of the different steps involved in the pipeline.

Variant calling pipeline

Variant calling pipeline
A more simplified overview of the pipeline can be seen here:

Variant calling pipeline - simplified

Variant calling pipeline - simplified
Tip

We will mostly focus on the AmpliSeq pipeline available here as it was introduced by Kattenberg et al. (2022) and Kattenberg et al. (2023).

For the whole-genome sequencing parts, while largely similar, we try to adhere to standard conventions established by examples like the GATK best practices and MalariaGEN’s genomic databases (2023).

11.2 Additional resources

We have collected a number of related resources below, which have served as an inspiration while preparing our own course:

The video below by Tobias Rausch @ EMBL-EBI (Rausch 2022) also provides an excellent overview of various topics that come up in genomic variant calling, but it is broader in scope than just AmpliSeq or molecular surveillance of parasites: