12 Bioinformatics tools and file formats
Below we have created a list of the various tools that are used for the steps in this pipeline. As a disclaimer: a common occurrence in bioinformatics is that several different approaches have been developed for a particular task or analysis, leading to a large (and sometimes confusing) ecosystem of different software tools, each with their own advantages, disadvantages and learning curve. In the course we will try to focus on one representative tool for each task, but be aware that these are not the only ones and there might exist more suitable options depending on your data and use cases.
12.0.1 FASTA: Biological sequences
File extension: generally
.fasta
,.fas
or.fa
, but.fna
(FASTA nucleic acids),.faa
(FASTA amino acids) and.frn
((FASTA non-coding RNA)) are also used. After indexing withbwa
, there are additional.{amb,ann,bwt,pac,sa}
files.
12.0.2 FASTQ: Sequence reads and qualities
File extension:
.fq
or.fastq
. (generally compressed with.gzip
)
12.0.3 SAM / BAM: Sequence alignments
File extensions:
.sam
: alignment in plain text format.bam
: binary version of SAM file.bai
: index file for BAM file
12.0.4 VCF: Variant calls
File extensions:
.vcf
: variant calls in plain text format (generally compressed with.gzip
)g.vcf
: variant calls in plain text format for a single sample
12.1 Tools
- FastQC: Section 13.3
- FastQ Screen: Section 13.4
- MultiQC: Tip 13.3
- Trimmomatic: Section 13.5
- BWA: Section 14.3
- samtools: Section 14.4, Section 14.6.1 and Section 14.6
- picard: Section 14.7 and Section 14.8
- GATK: Chapter 15
- IGV: Section 14.9 and Section 15.4