14 Alignment to reference genome
The next step of our pipeline tries to align the filtered and trimmed reads to the reference genome of the organism we’re studying. It starts from the processed FASTQ
files and produces SAM
and BAM
files, which contain the mapped reads. We encountered these file formats before in Section 6.2.1 and Section 6.4.1, but we will go a bit more in-depth now. Two other important aspects in this part of the pipeline are indexing the reference genome and filtering out PCR duplicates from our alignment.
To make your analyses more reproducible, and to be able to figure out what you did in the (distant) future, it is important to use meaningful file names and to organise all of your data, scripts and accompanying files (like metadata and readme files, see Tip 14.2). During these hands-on sessions, try to already think about naming conventions and a clear hierarchy of your files and directories.
We will write separate scripts for each of the different parts of the pipeline, but at the very end, we can try to merge them into just a handful scripts for each of the major conceptual steps, optionally accompanied by a master script to govern them all.
As a last tip, try to be aware of the types of paths that you use when writing your scripts: absolute or relative ones (Section 4.1).
Relative paths are more portable than absolute ones (which will not work on a different workstation with different file paths), but they can be more prone to errors. Indeed, the relative path is defined based on where call the script from, rather than the location of the script itself, so be mindful of this.1
Whichever type you use, we recommend to define all important files and folders at the top of your scripts as variables. This makes it easier to change them at a later time.
You can already compare the basic and extended versions of some of the example scripts stored in the ./training/scripts
directory to see how you can extend a simple script to make it more robust and reusable (and hopefully clear).
1 An alternative is to make use of relative paths, but in that case the script still needs to be called from a fixed location in your project folders (or you need to make use of this strategy). For even more complex scripts, you might want to look into providing paths as an argument, or to make use of configuration or samplesheets (as is done by more complex workflow manager tools).
14.1 Where to find your reference genomes - bwa mem
We already downloaded a reference genome from PlasmoDB in Section 6.5 (where we introduced the wget
command). These reference genomes allow us to compare the samples we have collected with a representative reference type_of the organism that we’re studying2. For some organisms, multiple reference genomes exist (e.g., because the organism exhibits a lot of genetic variability between different geographic locations), and these references also receive updates from time to time, as the assemblies of complex regions improves (e.g., gaps spanned, repetitive regions mapped, or in the case of Plasmodium, the polymorphic subtelomeric regions).
2 In practice, reference genomes rarely contain the genetic material of a single individual. Instead, they tend to be a mosaic of DNA sequences from different sources. See https://en.wikipedia.org/wiki/Reference_genome.
- Download the most recent releases of the P. falciparum 3D7 and the P. vivax P01 reference genomes and store them in
./data/references
in clearly labelled subdirectories. - Try to inspect the file using
head
andless
.
14.2 Indexing a reference genome
Before we can map our reads to a reference genome, we need to create an index to allow our mapper to more efficiently search through the (usually rather large) FASTA file3. You can think of it like the index or table of contents of a book (if that particular book contained about 23 million characters in the case of Plasmodium falciparum, or 3.1 billion characters for the human reference genome).
3 For a refresher on the FASTA file format, check out Section 5.1.2.1.
The indexing step itself is usually performed by the same tool as the one that does the actual alignment. The index only needs to be created once, and then it can be re-used when mapping each of our FASTQ files. In this course, we will be using bwa
- the Burrows-Wheeler Aligner.
- Use the
bwa index <reference.fasta>
command to create an index for both reference genomes. - Which files were created by the indexing step?
We’d like to mention two important related aspects of (computational) scientific workflows here: collaboration and reproducibility.
Reference genomes and their indices take up a lot of space. In the case of larger genomes, like the human one, it also takes considerable time to produce the index. Fortunately, these files can be re-used for any future analysis - not just by you, but by others as well. That is why we recommend to store these references and their indices in a shared location on your lab’s workstation, so that they become accessible to others who might benefit from them.
To ease collaboration with others, but also with your future self, it is a good idea to keep track of the various steps you’ve performed in your analysis, just like you would do in a lab notebook. Of course the code and scripts that you write should be clearly named (and ideally, elaborated with clear comments), but it also fruitful to note down more nitty-gritty details on how you managed to fix a particular issue or why a specific option was chosen. In the case of reference genomes, we recommend to create a
README.md
file4 to store alongside your reference genomes, which clearly describes where you obtained them, the version/build/release, and perhaps even a code snippet to re-download and checksum them.
As a bonus exercise, try to create such a readme file.
4 .md
stands for markdown, which is a popular plain-text markup language used for these kinds of tasks. You will often see people used markdown or .txt
files to add metadata to collections of files in databases and analysis scripts or workflows on places like GitHub.
14.3 Alignment to the reference genome
During mapping, the aligner tries to find a potential alignment site for each individual read (or query) in the reference genome. After aligning all the reads, we can start looking for variation in our samples compared to the reference genome and each other.
There exist many different tools and algorithms for sequence alignment, each with their own strengths and weaknesses. A few examples of popular ones for DNA sequencing are BWA, bowtie2 and minimap2. The differences between these tools mostly lies in speed vs sensitivity, but also preferred application; i.e. long versus short reads5.
5 You can read the author of BWA and minimap2 thoughts: https://lh3.github.io/2018/04/02/minimap2-and-the-future-of-bwa
Note that there is also an entirely different class of aligners that were developed for RNA-sequencing; these need to be “splice-aware” to handle RNA reads without introns (e.g., STAR), and some of them also make use of some clever tricks for faster quantifications - so-called pseudo-aligners (e.g, Kallisto and Salmon).
We will use BWA as our mapping tool of choice, as is customary in the field of malaria genomics (MalariaGEN et al. 2023). BWA actually comes with three different flavours of aligners, but the most recent and popular one is bwa mem
6.
6 A second version of bwa mem
has also been released, but while it is faster, it is more suited for running on high-performance compute clusters or in the cloud, due to the higher memory requirements.
Figure 14.1 illustrates the mapping process (Hiltemann et al. 2023; Wolff, Batut, and Rasche). Each individual read is matched against the reference. You can see that read 1 has a few mismatches in its alignment, which eventually might turn out to be single nucleotide polymorphisms (SNP). Similarly, for read 2 only the middle part aligns well; bwa mem
performs a local alignment and can clip the ends of the read. The third read shows an example of an insertion and a deletion (indel); the read is longer in one area (GCCA
), but shorter in another (AC(A)TA
).
As we discussed in Tip 13.1, when we are working with paired-end sequencing data, there is additional information available for the aligner to work with. The expected distance between forward and reverse reads that can help to span repetitive regions. For this to work, paired reads are always mapped together, which is why they also need to be provided to the mapping tool as a pair; that is why we tend to give them similar file names (*R1.fastq
/*_R2.fastq
), and the paired reads appear in the same order in each file. The final alignment report usually provides dedicated output on paired reads, letting us know of cases where one of the reads fails to map (unmapped mate) or the pair is not in the expected orientation (i.e., not pointing towards each other, see Tip 14.3) / too far apart (not properly paired) , because this is not something that you’d expect in normal circumstances, since they should be derived from the same fragment of DNA.
The following is based on the excellent write-up on https://www.cureffi.org/2012/12/19/forward-and-reverse-reads-in-paired-end-sequencing/ and https://seekdeep.brown.edu/illumina_paired_info.html.
Recall that during sequencing-by-synthesis, the new strand is always synthesized in the 5’-3’ direction (since DNA polymerase walks along the template strand in the 3’-5’ direction). The newly synthesized strand is what is being read out during the sequencing process, so this is what eventually ends up in our FASTQ files.
For paired-end sequencing, this means eventually results in the reverse read (read 2) being in the opposite orientation of the first read (read 1); it is read from the complementary strand compared to read 1 and also in the opposite direction. Consequently, the mapping software will need to take the reverse complement of read 2, to put it in the same orientation as read 1, before aligning it to the reference genome (or equivalently, try to map it to the other strand of the reference genome).
For Illumina paired-end sequencing, the double-stranded library fragments (= insert + adapter sequences (= anchor + read primer + index + index primer), see Figure 13.3) are first attached to the flow cell, but then the complementary strands (5’-3’ oriented towards the flow cell) are removed. The template strand (3’-5’ oriented towards the flow cell) is then used to synthesize and sequence the forward read in the 5’-3’ direction, starting from read primer 1 (adjacent to the insert). Next, index 1 (I7) is sequenced in a separate reaction (using the i7 index primer). For the reverse reads, the template strand is first used to regenerate its complementary strand (which gets attached to the flow cell), and is then removed. Lastly, the read 2 primer is used to sequence the reverse read, again in the 5’-3’ direction.
If you need a refresher, you can always jump back to one of the resources on the Illumina sequencing chemistry in Chapter 13 like ClevaLab (2022).
When we call bwa mem
without any additional arguments, we get an overview of how to use it:
Usage: bwa mem [options] <reference-genome> <R1.fastq> [R2.fastq]
As you can see, the second read pair file is [optional]
, but when dealing with paired-end read data the two files need to be supplied at the same time. To do this programmatically, it is again important to use concepts like looping over pairs of FASTQ files to do this efficiently (see Tip 13.2).
By default, bwa
will output its results straight into the terminal’s standard out (see Section 7.1). This is very useful once we start chaining different tools together, but for your first attempts you should provide the -o <output_file>
flag to write everything to a file instead.
bwa mem
provides many different options, but we recommend the following ones:
-t <int>
: the number of threads to use. See Tip 14.4.-R "@RG\tID:L001\tPL:ILLUMINA\tSM:${sample}"
: defines read groups, which are tags to identify which reads belong together across flowcells/lanes/library preparations7.${sample}
is a variable containing the sample name of the current read (pair) in a for loop.-Y
: use soft clipping for supplementary alignments.K 100000000
: makes results more deterministic.
7 Reads groups are a confusing subject, and a very deep one as well. In a nutshell, read groups are used to inform downstream tools about which reads were sequenced together in the same flowcell/lane (to differentiate between and account for technical variability between runs, and not just samples) and which reads need to be grouped because they are derived from the same sample (but stored in separate files). Fortunately, they are only required by a few downstream GATK tools, but leaving them out can result in errors. For a nice overview of how to define them, see this example.
Taken together, an example would be:
bwa mem \
-t 2 \
-Y -K 100000000 \
-R "@RG\tID:L001\tPL:ILLUMINA\tSM:PF0080" \
-o "PF0080.sam" \
"${ref}" \
"${fastq_dir}/PF0080_S44_L001_R1_001.fastq.gz" \
"${fastq_dir}/PF0080_S44_L001_R2_001.fastq.gz"
Of course, there exist various other options that can be used to do things like tweak the parameters, scores and thresholds used for local alignment.
bwa mem
- Map a single pair of reads to the reference genome (use the Peruvian P. falciparum AmpliSeq data).
- Open the resulting SAM file using
less
and inspect its contents and structure.
Some of the tasks that we are faced with as bioinformaticians can take a long time to complete, even when using modern hardware. Fortunately, some of these tasks can be parallelized. Which ones? Well, any task that consists of a specific procedure that needs to be applied to many files in the same manner, but for which the different processes are independent of one another. This is sometimes called “embarrassingly parallel” and can be achieved by simply running the different tasks as the same time on multiple threads or compute cores.
You could script this yourself using a tool like parallel
(see here for an example), and it’s also an important aspect of more advanced workflow managers like Nextflow, but fortunately many of the tools that we use also provide built-in multi-threading: bwa mem
, samtools
, trimmomatic
, etc.
When analysing WGS data (but not AmpliSeq) it is important to remove human reads from your FASTQ files prior to mapping them to the Plasmodium reference genome. There are a number of different approaches and dedicated tools to this - assigning reads to different organisms is a crucial step in bacterial metagenomics for example - but one of the most common approaches for malaria genomics is to simply do a two-step mapping process with your alignment software of choice: in the first round all reads are mapped against the human reference genome. Anything that “sticks” is discarded. In the second round, the remaining unmapped reads are mapped to Plasmodium.
When depositing genetic data into public sequencing data repository like NCBI’s SRA or EMBL-EBI’s ENA, only filtered FASTQ files should be used, to avoid any potential issues with regards to human privacy and data protection.
You could try this for yourself as a bonus exercise. See this guide for more info on how to filter out the host/non-host reads using samtools
after the mapping step.
14.4 SAM and BAM files
The output of the alignment step (using bwa mem
) is a SAM file (see Section 6.2.1). This stands for Sequence Alignment Map. These are plain text files delimited with tabs, so you can view them using a text editor or a command like less
.
To save space and make the alignment quicker to index, SAM files are often converted to BAM files (Section 6.4.1), which are compressed binary versions of the alignment. These can be used by downstream tools directly, so we tend to only store the alignment in this format, even though you can no longer inspect the alignment yourself.
0080_S44_L001.bam | head -n30
@HD VN:1.6 SO:coordinate
@SQ SN:Pf3D7_01_v3 LN:640851
@SQ SN:Pf3D7_02_v3 LN:947102
@SQ SN:Pf3D7_03_v3 LN:1067971
@SQ SN:Pf3D7_04_v3 LN:1200490
@SQ SN:Pf3D7_05_v3 LN:1343557
@SQ SN:Pf3D7_06_v3 LN:1418242
@SQ SN:Pf3D7_07_v3 LN:1445207
@SQ SN:Pf3D7_08_v3 LN:1472805
@SQ SN:Pf3D7_09_v3 LN:1541735
@SQ SN:Pf3D7_10_v3 LN:1687656
@SQ SN:Pf3D7_11_v3 LN:2038340
@SQ SN:Pf3D7_12_v3 LN:2271494
@SQ SN:Pf3D7_13_v3 LN:2925236
@SQ SN:Pf3D7_14_v3 LN:3291936
@SQ SN:Pf3D7_API_v3 LN:34250
@SQ SN:Pf3D7_MIT_v3 LN:5967
@RG ID:001 SM:001 PL:ILLUMINA
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem ../reference/PlasmoDB-67_Pfalciparum3D7_Genome.fasta PF0080_S44_L001_R1_001.fastq.gz PF0080_S44_L001_R2_001.fastq.gz -t 4 -k 30 -R @RG\tID:001\tSM:001\tPL:ILLUMINA
@PG ID:samtools PN:samtools PP:bwa VN:1.19.2 CL:samtools sort -@ 4 -o ../../results/bwa/PF0080_S44_L001.bam
@PG ID:samtools.1 PN:samtools PP:samtools VN:1.19.2 CL:samtools view -h training/results/bwa/PF0080_S44_L001.bam
M05795:43:000000000-CFLMP:1:2118:16821:2796 2211 Pf3D7_01_v3 35509 0 84H71M144H = 193095 157838 AAAAAAAAAAAAAAAAAAAAAAAAAAAATTAAAAAAAAAAAAAAAAAAAAACATTAAAAAAAATAAAAAAA FFGFEGGGGGGGGGGGGGGGGGGGGGE33,@FGGBEGGGG**>CE*B**=5*;++23C+9*7C0<CD?+:* NM:i:5 MD:Z:42T4T0T2A2A16 MC:Z:51H73M2D177M AS:i:46 XS:i:45 RG:Z:001 SA:Z:Pf3D7_01_v3,193086,+,112M187S,60,1;Pf3D7_13_v3,2049877,-,41S32M226S,0,0; XA:Z:Pf3D7_05_v3,+529511,87S43M6I19M144S,7;Pf3D7_13_v3,-112651,164S40M95S,0;Pf3D7_12_v3,+497450,90S35M3D19M155S,4;Pf3D7_12_v3,+100906,96S18M1D41M2I11M131S,6;Pf3D7_11_v3,-973273,152S11M1I10M2D34M91S,3;
M05795:43:000000000-CFLMP:1:2105:5392:16351 2211 Pf3D7_01_v3 35510 12 84H70M147H = 193086 157856 AAAAAAAAAAAAAAAAAAAAAAAAAAATTAAAAAAAAAATAAAAAAAAAACATAAAAAAAAAAAAAAAAA AFGGGGGGGGGGGGGGGGGGGGGGG>:,,CFGGCFGGGG,,@DF9EC*>**,,,6<?,**:8/?EEE8=8 NM:i:6 MD:Z:39A1T4T0T2A11T7 MC:Z:80M4D3M1I193M AS:i:40 XS:i:33 RG:Z:001 SA:Z:Pf3D7_01_v3,193086,+,111M190S,60,1; XA:Z:Pf3D7_09_v3,-1458764,146S20M1I30M104S,3;
M05795:43:000000000-CFLMP:1:1114:12861:20875 2179 Pf3D7_01_v3 67176 0 70H30M199H Pf3D7_07_v3 405766 0 TTTTTATTTAAAACACAAAAAAAAAAAAAA ,38,,,,,,,,,8,,,>*,,355C7<@@7@ NM:i:0 MD:Z:30 MC:Z:146M155H AS:i:30 XS:i:0 RG:Z:001 SA:Z:Pf3D7_07_v3,405968,-,242S13M1I43M,0,2;Pf3D7_12_v3,1133014,-,159S38M102S,0,1;
M05795:43:000000000-CFLMP:1:1119:27790:13377 145 Pf3D7_01_v3 109767 0 174S39M88S Pf3D7_07_v3 405766 0 TGTCACACACGTAGTGTTGTGACAGTCGTATCAGTGTATTATTGTACGTCTGTTTGTTTATTATATTTTATTATTGTGTTTTTGTTATTATGTGTTGATGTGTTAACGGAGAGTGTAGAGTAAGAGGAAAGTATTTTAGTATTTTTAATATTTTTAATTTTAAAATTAAAAATTTATTTTTTTTTTTTTTTTTTTTTTTTTTTTTAAATAATTAATTTTTTTTTAAATTTGTTTTAAATTAAATTGGTTTTTTGTTTAAAAAAATTTTTATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT )-).(-(((((()4.()))6).(((4()444.)))46.)6))-((-((.((,.()))))))))6)6)--)))),((4,(((-.))-)..)((),)),)((-)(2.(((,)()))*-***)***9**/***01**2*0*+2***2*20+3*0++++++3++++0++2*+<300+2+**CGEEGEDGGGGGGECGGEE>GGE@,D3,,,,,3,,,,,*3++,,,,,3,,,,,,,,,,8,,@,,3A+3+CCD8AFC,AB=4+,FC,4,,9,:,:@@EGEGGFGGGGGGGGGGGGGGGGGCCCCC NM:i:0 MD:Z:39 MC:Z:128M7I119M45S AS:i:39 XS:i:38 RG:Z:001 SA:Z:Pf3D7_09_v3,683170,-,263S38M,0,0; XA:Z:Pf3D7_05_v3,+166830,86S38M177S,0;Pf3D7_10_v3,+448221,90S38M173S,0;Pf3D7_01_v3,-519133,169S35M97S,0;Pf3D7_07_v3,-399258,171S35M95S,0;Pf3D7_14_v3,-2709009,170S35M96S,0;
M05795:43:000000000-CFLMP:1:2109:24931:7426 2145 Pf3D7_01_v3 134097 0 236H16M3I46M Pf3D7_12_v3 2093554 0 AACAAAACAAAAAAAACATACAAATAAATACAAACAAAAAAAAAAAAAAAAAAAAAAACAAAAAA .*)(,)0(77),<012(4))*)*.).).)))))5((,.(-(-((27((4(-91-(70>(4.334( NM:i:7 MD:Z:17G3C3A31C4 MC:Z:25H276M AS:i:33 XS:i:32 RG:Z:001 SA:Z:Pf3D7_01_v3,193086,+,111M190S,60,2;Pf3D7_06_v3,932942,-,146S29M2I29M95S,8,3; XA:Z:Pf3D7_09_v3,+1284793,269S32M,0;Pf3D7_12_v3,-659692,4S30M267S,0;
M05795:43:000000000-CFLMP:1:2116:28002:9063 2211 Pf3D7_01_v3 134127 18 242H33M26H = 193113 59239 ACAAAAAAAAAAAAAAAAAAAAAAACACAAAAA .()).(4>B9(((-7(39>91-341(3(,,6(0 NM:i:0 MD:Z:33 MC:Z:27H53M4D3M1I193M AS:i:33 XS:i:0 RG:Z:001 SA:Z:Pf3D7_01_v3,193086,+,110M191S,60,0;Pf3D7_12_v3,448676,-,138S27M1I22M3I24M86S,0,7;
M05795:43:000000000-CFLMP:1:1113:17834:11827 2161 Pf3D7_01_v3 137459 4 56H74M171H = 196019 58732 TTATTGTTTTTTTTTTTATCGTTATTGTTTTTGTTATTATTATTATTGTTTTTGTTATTATTATTATTGTTTTT 9),<:2>EEB@<?;+2(0))*;+*;A987GA=;7;+++31++*7F<;**GGF?*:=,++,,+1=,>5F?8EE@; NM:i:4 MD:Z:11A2A4T0A53 MC:Z:56H245M AS:i:54 XS:i:50 RG:Z:001 SA:Z:Pf3D7_08_v3,1374826,+,41M9D29M231S,6,9; XA:Z:Pf3D7_01_v3,-137459,77S80M144S,6;Pf3D7_01_v3,+196068,123S99M79S,10;
M05795:43:000000000-CFLMP:1:1111:24375:10578 99 Pf3D7_01_v3 139572 0 84S32M185S = 139572 32 CTTGAACCCAGGAGGTAGAGGTTGCAGTGAGCCGAGATCGCACCCCTGCACTCCAGCCTGGGGGACAGAGTGAGACTCTGTCTCAAAATAAATAAATAAATAAATAAACAAATAAAAGATAACTGTTTAATTAACAACAACGTAATAGATAGTGTGTGGCACTGATACCACCTGTAAAAATAAAATGAACAACGGTGTAAAGACCAAGAGGAGAGAAATGGAAGTGTCCTATTGTAAGCTTCTCTTACTCTACGAAAAATGGTATATTACTTAAGAATAAACTATGATAATTAAGACGGCA CCCCCFGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCEGGGGGGGGGGGGGGGGGGGGCGGEGGGGGGGGGGDC3;DGGGGGGGGGGGG7FD9BFGGGFFFAA;>FB@FEFFBBCFCEFF>0>B>AEFEEF:5=B59FDAEFFFECEFFFCFF7<64CC4=.75)7797<)9CE))(-4- NM:i:0 MD:Z:32 MC:Z:80S32M188S AS:i:32 XS:i:32 RG:Z:001 XA:Z:Pf3D7_06_v3,+225165,84S32M185S,0; M05795:43:000000000-CFLMP:1:1111:24375:10578 147 Pf3D7_01_v3 139572 0 80S32M188S = 139572 -32AACCCAGGAGGTAGGGGTTGCAGTGAGCCGAGATCGCACCCCTGCAGTCCAGCCGGGGGGACAGAGTGAGACTCTGTCTCAAAATAAATAAATAAATAAATAAACAAATAAAAGATAACTGTTTAATTAACAACAACGTAATAGATAGTGTGTGGCACTGATACCACCTGTAAAAATAAAATGAACAACGGTGTAAAGACCAAGAGGAGAGAAATGGAAGTGTCCTATTGTAAGCTTCTCTTACTCTAAGAAAAATGGTATATTACTTAAGAATAAACTATGATAATTAAGGCTGCATTT 4,(563?;3@2;:6(:4)E7.);)5(96-0(FD>2==@:252>7)85DA8=1):):DBEED>8:*>;;::D:53:5:)@7FEDEDDEFCDE9>EEEC;7ED:099;9;+>D=C?C?96B?D7D9EBGGGD8GGFFGGCGGGGGEF8@@,GGGGGGGGGDGFFC6GGGGGGGGFGGGEGGGGGGGGCEGGGGGGCGGGGGGGGGF:GGGGGGFCGGGDGFFDGFGFGGGGGGGGGFGGGFGGGFGGGGGGEGFE9GFGGGGGGGGFFFFGGGGGGGGGFFGGFE<GFAGGGGGGGGCCCCC NM:i:0 MD:Z:32 MC:Z:84S32M185S AS:i:32 XS:i:32 RG:Z:001 XA:Z:Pf3D7_06_v3,-225165,80S32M188S,0;
Aside from the header info, every other line in the SAM file corresponds to a particular read and provides information on where in the reference genome it was mapped and the quality of this mapping8, alongside various other metadata (including the original Phred quality scores). This information is stored in tab-delimited fields (11 mandatory ones, followed by optional ones).
Col | Field | Type | Brief description |
---|---|---|---|
1 | QNAME | String | Query template NAME |
2 | FLAG | Int | bitwise FLAG |
3 | RNAME | String | References sequence NAME |
4 | POS | Int | 1- based leftmost mapping POSition |
5 | MAPQ | Int | MAPping Quality |
6 | CIGAR | String | CIGAR string |
7 | RNEXT | String | Ref. name of the mate/next read |
8 | PNEXT | Int | Position of the mate/next read |
9 | TLEN | Int | observed Template LENgth |
10 | SEQ | String | segment SEQuence |
11 | QUAL | String | ASCII of Phred-scaled base QUALity+33 |
All details on the SAM format can be found in the format specification: https://samtools.github.io/hts-specs/SAMv1.pdf.9
One important column is the bitwise FLAG: it is a lookup code that can tell you (or more commonly, a software tool) whether a read was aligned, how it is oriented, if its pair was aligned, etc.
Bit | Description | |
---|---|---|
1 | 0x1 | template having multiple segments in sequencing |
2 | 0x2 | each segment properly aligned according to the aligner |
4 | 0x4 | segment unmapped |
8 | 0x8 | next segment in the template unmapped |
16 | 0x10 | SEQ being reverse complemented |
32 | 0x20 | SEQ of the next segment in the template being reverse complemented |
64 | 0x40 | the first segment in the template |
128 | 0x80 | the last segment in the template |
256 | 0x100 | secondary alignment |
512 | 0x200 | not passing filters, such as platform/vendor quality controls |
1024 | 0x400 | PCR or optical duplicate |
2048 | 0x800 | supplementary alignment |
Lastly, we want to share this convenient tool to explain the different bitwise FLAGs. It is invaluable whenever you need to perform (or interpret) a samtools
filtering operation.
9 A more in-depth exploration of the different type of alignments that can occur (e.g., supplementary, secondary, etc.) is given in this blogpost.
10 An amazing guide to several samtools
tasks can be found on the Learning the BAM format page by Dave Tang.
To convert a SAM to a BAM file, you can use the samtools software. samtools
is used to convert between SAM and BAM files, but it can perform many other tasks too, like coordinate sorting, which we will talk about first.10
14.5 Coordinate sorting alignment files
The reads in a SAM/BAM file can be sorted in different ways. For our needs (variant calling using GATK), we will need to sort them according to their location on the reference genome: a coordinate-ordered format.
The neat thing is that samtools sort
will automatically do the conversion from SAM to BAM for us, so we can perform both steps in one go.
samtools sort -@ <number_of_threads> -o <output.bam> <input.sam>
Can we make things even more streamlined and combine the bwa
mapping and samtools
sorting steps? Turns out we can!
Recall what you learned in Section 7.1 about piping the output of one command into the input of another using the |
operator. We can make use of this to feed the output of bwa mem
directly into samtools
. This allows us to forego the need to write the output to any intermediate files.11
The only difference compared to before is that we now remove the -o output.sam
option from the bwa mem
command and that we do not provide an explicit input file to samtools
(it is read from stdout
directly instead).12
bwa mem <arguments> | samtools sort -o alignment.sort.bam
For clarity we use the sorted.bam
file suffix for any sorted BAM files.
12 One final thing we could do is use >
to write the output of samtools
to a file directly, instead of using the -o
flag.
11 This works out of the box, but you might encounter the syntax command_1 | command_2 -
, where the hyphen -
is a way of explicitly telling the second command to look at stdin
for its input. For many tools, including samtools
, this is not necessary however.
Align the same FASTQ file as you did in the previous exercise, but immediately pipe the
bwa mem
output tosamtools
for coordinate sorting and conversion to a.bam
file.Compare the size of the original
.sam
and the new.bam
file (du -sh
).Modify your mapping script to immediately perform compression and sorting using the
samtools sort
command, without producing any intermediate files.
14.6 Inspecting BAM files
You can read BAM files by using the samtools view
command. By default it only outputs the tab-delimited parts of the file, but you can provide the flags -H or
-h` to optionally show only the header or everything.
Another useful command is samtools flagstat
, which reports various mapping statistics on how well the reads aligned to your reference genome:
samtools flagstat training/results/bwa/PF0097_S43_L001.bam
104559 + 0 in total (QC-passed reads + QC-failed reads)
95962 + 0 primary
0 + 0 secondary
8597 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
103657 + 0 mapped (99.14% : N/A)
95060 + 0 primary mapped (99.06% : N/A)
95962 + 0 paired in sequencing
47981 + 0 read1
47981 + 0 read2
89790 + 0 properly paired (93.57% : N/A)
94736 + 0 with itself and mate mapped
324 + 0 singletons (0.34% : N/A)
4246 + 0 with mate mapped to a different chr 3834 + 0 with mate mapped to a different chr (mapQ>=5)
Explore the output of the
samtools flagstat
command. What do the different statistics mean?How can you store the mapping statistics in a file? What do you need to take into consideration when doing this in a for loop for multiple bam files?
Try to write a script for processing all FASTQ files belonging to the Peruvian P. falciparum AmpliSeq data in one go. You can inspect the provided scripts
map.sh
andmap-extended.sh
for inspiration, especially with regards to file paths and structuring your code.
Bonus exercise: do the same for the Vietnam P. vivax WGS data.
When you’re just starting out, it is fine to use simple scripts using just the concepts that you know. As you become more experienced though, and as your scripts might need to be used by others and re-use it for a longer time, you should try to focus on making them more readable, robust and efficient. For the former, focus on doing things in a clear and structured way; use code comments, clear file names, split your scripts into logical chunks when necessary, provide READMEs, etc. Writing robust and efficient scripts is more like a never-ending journey, where you can pick up new techniques and ideas even after a lot of experience, and it might extend beyond the use of just bash, but into other languages or even workflow managers. You can compare the normal and the extended versions of the scripts in ./training/scripts
for some ideas.
For now, knowing enough to get by, and being careful when trying things out, is most important. Howvever, if you decide in the future that really want to master bash, there are also several resources availabe that explain some of the more common beginner mistakes. Alternatively, you can look into workflow tools like Nextflow (and the nf-core community pipelines), Snakemake or Makefiles.
14.6.1 More indices
Like our reference FASTA files, BAM files can also be indexed to improve the performance of downstream tools that need to search for particular reads in this large file. For example, it is required for visualisation with IGV (Section 14.9) and variant calling with GATK (Chapter 15).
samtools index alignment.sort.bam
Try to create an index for one of your BAM files. Which new files are created?
14.7 Mark PCR duplicates - picard MarkDuplicates
The library construction process for AmpliSeq results in identical fragments or duplicates for each of the targeted amplicons, hence we should not attempt to remove them.13
13 As a side note, in RNA-seq this duplicate removal step is also not recommended (unless unique molecular identifiers (UMIs) are used), although the number of duplicates can be used for quality control.
14 You can read more about how PCR duplicates arise in this excellent blog post by Eric Minikel.
A final step in the mapping pipeline is the tagging (or removal) of PCR duplicates. These are identical sequences that can arise because of PCR steps during library preparation or because a cluster on the flow cell was erroneously detected as two separate ones14. The overall goal of PCR duplicate removal is to reduce bias during downstream analyses (like variant detection) due to their presence.
PCR duplicates are polyclonal molecules that do not represent any true biological variation, nor are they independent observations of the DNA that was present in a sample. Thus, the observed abundance of these identical reads does not accurately reflect the true abundance of the different molecules in the sample(Rochette et al. 2023). If a particular PCR duplicate happens to be over-represented compared to other reads covering the same region of the genome, this could cause a bias in our analyses. Just imagine what would happen if during PCR amplification an error was introduced into one of the fragments. If we ended up with many PCR duplicates of this fragment, we might erroneously conclude that there is a SNP at this position because we see it occur in multiple reads. However, if we remove PCR duplicates, we will only observe the SNP in a single read, and variant detection will not mark this as a SNP due to lack of confidence (~agreement between different reads covering the same position, more on that in Chapter 15).
One of the tools that we use for marking PCR duplicates is Picard. Picard actually provides many different functionalities, but the one that we’re interested for now is MarkDuplicates
. An in-depth explanation of how it works can be found here. After running the tool, we end up with a BAM file again, but this time with all suspected PCR duplicates marked as such. Our downstream analysis tools will take this information into account whenever relevant (e.g., when detecting variants).15
15 Note that removal generally is not necessary, but it can be done by passing the -REMOVE_DUPLICATES true
flag.
- Try to mark all PCR duplicates for a single BAM file using the following command:
picard MarkDuplicates
-I alignment.sort.bam
-O alignment.sort.dups.bam
-M alignment.markeddups_metrics.txt
Run
samtools flagstat
again. What changed compared to the previous report?Examine and run the
remove_dups.sh
script to process all files.
Bonus exercise: Try to combine all the steps of the pipeline we have seen so far (including trimming) into a single script.
You might have noticed in the example above that we continued to extend the suffix of the BAM file to indicate what has happened to it: sorting and PCR duplicate marking. This information is also kept track of in the BAM header, but it is a good habit to still look for appropriate file names whenever possible.
The pipeline that we’re introducing here deviates from the GATK4 best practices in this section, because we do not introduce the concept of base quality score recalibration (BQSR). The basic idea here is that we want to detect and correct systematic errors made by the sequencer, by using our knowledge of known variants to model them.
The above link by GATK gives an in-depth explanation on the topic, but you can also find some interesting discussions here, here and here as well as some hands-on walkthroughs here (section 6) and here.
14.8 Alignment statistics
We already introduced the samtools flagstat
command to report on basic alignment statistics, but a more in-depth method of inspecting your alignment is provided by Picard’s CollectAlignmentSummaryMetrics
, or the more extensive CollectMultipleMetrics
(we will see Picard again in the next section on PCR duplicate removal):
picard CollectAlignmentSummaryMetrics R=<reference.fasta> I=<input.sort.bam> O=<output.bam>
The output of these tools can again be aggregated using a tool like MulitQC. See here for an example.
14.9 Alignment visualisation
After creating and filtering our alignment files, we can visualise them in a tool like IGV (Integrative Genomics Viewer). We refer to this section of the Galaxy training materials (Hiltemann et al. 2023; Wolff, Batut, and Rasche).