12  Bioinformatics tools and file formats

Below we have created a list of the various tools that are used for the steps in this pipeline. As a disclaimer: a common occurrence in bioinformatics is that several different approaches have been developed for a particular task or analysis, leading to a large (and sometimes confusing) ecosystem of different software tools, each with their own advantages, disadvantages and learning curve. In the course we will try to focus on one representative tool for each task, but be aware that these are not the only ones and there might exist more suitable options depending on your data and use cases.

12.0.1 FASTA: Biological sequences

File extension: generally .fasta, .fas or .fa, but .fna (FASTA nucleic acids), .faa (FASTA amino acids) and .frn ((FASTA non-coding RNA)) are also used. After indexing with bwa, there are additional .{amb,ann,bwt,pac,sa} files.

12.0.2 FASTQ: Sequence reads and qualities

File extension: .fq or .fastq. (generally compressed with .gzip)

12.0.3 SAM / BAM: Sequence alignments

File extensions:

  • .sam: alignment in plain text format
  • .bam: binary version of SAM file
  • .bai: index file for BAM file

12.0.4 VCF: Variant calls

File extensions:

  • .vcf: variant calls in plain text format (generally compressed with .gzip)
  • g.vcf: variant calls in plain text format for a single sample

12.1 Tools