6 More advanced commands
You can skip this section and proceed directly to the exercises if you are already familiar with commands like grep
, sort
and cut
.
Remember that we have provided a list of helpful tips and hints in the appendix: Section A.1.
6.1 Searching in files: grep
Being able to search through (long) text files is incredibly useful in a wide range of scenarios. For this, we make use of the grep
command. It is a very powerful and complex command, with many different options to tweak its behaviour, but even just the basic version can already be a lifesaver. The basic syntax is
grep "PATTERN" <path/to/target_file>
For example, we can look for the word “evolution” in the Origin of Species:
$ grep "evolution" long.txt
evolutionists that mammals are descended from a marsupial form; and if
At the present day almost all naturalists admit evolution under some
evolutionists; but there is no need, as it seems to me, to invoke any
Everyone who believes in slow and gradual evolution, will of course
of gradual evolution, through the preservation of a large number of
a strong disbeliever in evolution, but he appears to have been so much
historian will recognise as having produced a revolution in natural
the fact would be fatal to the theory of evolution through natural
the revolution in our palæontological knowledge effected by the
opposed to the admission of such prodigious geographical revolutions
has thus been arrived at; and the belief in the revolution of the earth
subject of evolution, and never once met with any sympathetic
agreement. It is probable that some did then believe in evolution, but
evolution. There are, however, some who still think that species have
we can dimly foresee that there will be a considerable revolution in
It might not be obvious from the above snippet, but inside your own terminal, the matching words in the search results will be highlighted. As you can see, grep
will return each line that contains a match. Also note how partial matches like revolution are returned as well.
When you run grep "species" long.txt
, you will indeed find many occurrences of this word. However, we are missing all the occurrences of “Species”. Try running grep "Species" long.txt
to compare the results. Lastly, try running the command grep -i "species" long.txt
. This option will
Remember, capitalization matters to grep
(and to Unix as a whole)!
Aside from the -i
option explained in the box above, there exist several other flags to improve your search results.
Option | Effect |
---|---|
-i |
Case insensitive (i.e., ACTG = actg ) |
-n |
Also print the line number of the result |
-c |
Count the number of matches |
-5 |
Print 5 lines surrounding each match |
-e /-E /-P |
Use regular or extended regular expressions |
-r |
Recursive search through all files in a folder |
The -#
option is particularly useful to learn more about the context around you search result. You can supply any number of lines here, which will get printed both before and after each match. The example below, you can see how it helps us find the identifier of a DNA read that contains a particular sequence (AACCGGGGT
):
$ grep -3 AACCGGGGT PF0512_S47_L001_R1_001.fastq
+
CCCCCGGCFGGGGFGGGGDGCEFGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGFFGGGFGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGD+,A=FG,:<?F5FGGGD>F9BDBDE9FEDAFGAFGGB=EDF,,>@FCCF;@;D;CF;,5;DEGG84,@@,3@ED<@,,@7,;@,7EC2=E>C977=
@M05795:43:000000000-CFLMP:1:2106:16840:21815 1:N:0:47
AGCCATACCAAGACCACAATTCTGAAGAGGAAACAAAACAAAAAAAAAAAAATAATTAAAAAAAAAAAAATTTAAAATTAAAAAAAAAATTTTTTTATTAAAATAATAAATATTAATTTTTATAATATAAATAAAATCCTATTTTACCCCACCAACCGGGGTTCATCCCCGGGCTCTTATACACATTTCCTAACCCACAAAAAGTACGAACAACACGCAACCCCCCTTCTGCCTTAAAAAAAAAACAAATCAAAATACACAAATATATCGAACATACAGCAACTACAAATGAAGATGTGGT
+
CCCCCGGFGGGFGGGGCFG9@@C<FGDG8EGGGFGGGGGGGGGGGGGGGGGD,:C,96CFD,8>FG7FGG,,,,B<B9,9CFFGGCG+3,,8BF@@,=8,8,=,C,,3@,,,3,3@,DGGFC,,,,,,7,,,,7>C,,3,,,7@@@:9,6**5,,**4*1*=8,,+5>=:****3/+>2;;92++2+++4++29*/:**9*1**3*0*/*/95)1*1))))29*05)202.:96*/-@=<(7:(.)00()))))))).-,)(())--)-(((.().,.:)8..(,).).))-4))..)))(
@M05795:43:000000000-CFLMP:1:2106:14940:21820 1:N:0:47
The -c
options makes grep
return the total number of matches it found. This method of counting is useful to complement the wc
command, in case you are presented with a file that does not have such an orderly structure as the FASTQ format we saw earlier. To demonstrate, consider the penguins.csv
file, which contains morphological data on three different penguin species.1 We can count the number of Adelie penguin records via:
1 Gorman KB, Williams TD, Fraser WR (2014) Ecological Sexual Dimorphism and Environmental Variability within a Community of Antarctic Penguins (Genus Pygoscelis). PLoS ONE 9(3): e90081. doi:10.1371/journal.pone.0090081
grep -c "Adelie" penguins.csv
152
In some situations we want to search through multiple files simultaneously. This is where the -r/--recursive
flag comes in. It allows us to target a directory and search through all of its contents (including subdirectories). Let us try searching for the same DNA sequence as before, but this time targeting all the files in the unix-demo
directory:
$ grep -r "AACCGGGGT" .
./PF0512_S47_L001_R1_001.fastq:AGCCATACCAAGACCACAATTCTGAAGAGGAAACAAAACAAAAAAAAAAAAATAATTAAAAAAAAAAAAATTTAAAATTAAAAAAAAAATTTTTTTATTAAAATAATAAATATTAATTTTTATAATATAAATAAAATCCTATTTTACCCCACCAACCGGGGTTCATCCCCGGGCTCTTATACACATTTCCTAACCCACAAAAAGTACGAACAACACGCAACCCCCCTTCTGCCTTAAAAAAAAAACAAATCAAAATACACAAATATATCGAACATACAGCAACTACAAATGAAGATGTGGT
./Homo_sapiens.GRCh38.dna.chromosome.Y.truncated.fa:AATAAAACCGGGGTTGATACCACCACTTCCAGGTTCCCACATTCCAAGTCCCCTCAGCCA
./Homo_sapiens.GRCh38.dna.chromosome.Y.truncated.fa:CTGGAGTCAGGACGTGAGCCGACTTGCTTAAAAATAAATCCACATGGCTGAACCGGGGTT
./Homo_sapiens.GRCh38.dna.chromosome.Y.truncated.fa:ACAGCTAACCGGGGTTTTAGTATATGTGCCACATCTCTGTAAATGTTCACTTCCTAGGCA
./Homo_sapiens.GRCh38.dna.chromosome.Y.truncated.fa:TATGATCGTGCCACTGCACTTCAACCGGGGTGACAAAGCGAAAACCGTGTCTCTAAAAAA
.txt
files only? (Click me to expand!)
Instead of using the -r
flag, we can also rely on globbing again (see Section 5.3.2). To search for the string “needle” in all .txt
files in a particular folder, we can do the following:
grep "needle" path/to/directory_with_txt_files/*.txt
We already mentioned regular expressions in the previous section: they allow you to search for particular patterns that can match more than one exact string of text. This is tremendously useful, but we will not dive deeply into how they work during this course. If you are interested, you can check out an excellent tutorial here.
One special pattern that we will introduce is the start or end of line anchor. You can search for patterns that start at the beginning of a line by prefixing the pattern with a ^
symbol. Similarly, you can search for patterns that occur at the end of a line by using a $
symbol. For example, if we search for grep ">@" read.fastq
in a FASTQ file, we will only retrieve lines that start with the @
symbol, ignoring any @
symbols that are used elsewhere on a line.
As we saw in Section 5.1.3.1, every read in a FASTQ file is represented by four lines, the first of which is the header and which always starts with an @
symbol. So far so good, but the fourth line, which contains the quality score of each base in the sequence, can also start with an @
symbol, because @
is just another score value that could happen to be occur for the first base in the sequence.
We will see more elaborate use cases for grep
when we introduce the Unix concepts of piping and redirection.
6.2 Tabular data: cut
We already encountered tabular data (the penguin dataset in .csv
format) when talking about grep
. Tabular data files like .csv
are a very common format, and not just in bioinformatics.
.csv
files
Tabular data files are usually plain text files, where each row corresponds to a record (e.g., an individual penguin), and each column represents a particular field (e.g., species, flipper length, body mass, etc.). The columns can be separated by different field delimiters or separators. In .csv
files, these are usually commas (comma separated values), but they can also be TABS (.tsv
) or semicolons (;
).
A particularly useful Unix tool for manipulating tabular data files, is cut
. It allows us to extract particular columns from these files. The syntax is as follows:
cut [OPTIONS] target_file
Option | Effect |
---|---|
-d "," /--delimiter ";" |
Change the default delimiter (TAB) to another character like , |
-f 1 |
Select the first column |
-f 2,3 |
Select the second and third column |
-f 1-3,6 |
Select columns one through three and columns six |
--complement -f 1 |
Select all columns except for the first one |
-r |
Recursive search through all files in a folder |
6.2.1 SAM file format
Aside from .csv
and .tsv
files, there are many bioinformatics file formats that also follow a tabular lay-out. One of these is the SAM file format.
The SAM file format is a tab-delimited text file that stores information about the alignment of sequence reads to a reference genome.
When considering the steps that are taken during the variant calling analysis of sequence reads, SAM files result when the reads inside a FASTQ file are processed by an alignment tool like bwa
and minimap
. These tools take each individual read corresponding to a particular sample and try to map them to their most likely position in a reference genome.
SAM files consists of an optional header section followed by an alignment section.
The header lines always start with an @
symbol and contain information about e.g., the reference genome that was used and the sorting of the alignments. This section is not yet tab-delimited.
The alignment section contains a tab-delimited line for each sequence read that was aligned to the reference. There are 11 mandatory fields, containing information on the position of the read in the reference genome (i.e., where it was mapped), the sequence itself, its quality score ( = the quality of each base pair of the sequence during sequence calling, cf. FASTQ format), its mapping score ( = how well the read aligned to the reference genome), etc.
Col | Field | Type | Brief description |
---|---|---|---|
1 | QNAME | String | Query template NAME |
2 | FLAG | Int | bitwise FLAG |
3 | RNAME | String | References sequence NAME |
4 | POS | Int | 1- based leftmost mapping POSition |
5 | MAPQ | Int | MAPping Quality |
6 | CIGAR | String | CIGAR string |
7 | RNEXT | String | Ref. name of the mate/next read |
8 | PNEXT | Int | Position of the mate/next read |
9 | TLEN | Int | observed Template LENgth |
10 | SEQ | String | segment SEQuence |
11 | QUAL | String | ASCII of Phred-scaled base QUALity+33 |
We will not dive into the details of each of these fields for now, but if you are interested you can check out the official documentation or this useful write-up.
A typical SAM file will look something like this:
@SQ SN:ref LN:45
@SQ SN:ref2 LN:40
r001 163 ref 7 30 8M4I4M1D3M = 37 39 TTAGATAAAGAGGATACTG * XX:B:S,12561,2,20,112
r002 0 ref 9 30 1S2I6M1P1I1P1I4M2I * 0 0 AAAAGATAAGGGATAAA *
r003 0 ref 9 30 5H6M * 0 0 AGCTAA *
r004 0 ref 16 30 6M14N1I5M * 0 0 ATAGCTCTCAGC *
r003 16 ref 29 30 6H5M * 0 0 TAGGC *
r001 83 ref 37 30 9M = 7 -39 CAGCGCCAT *
x1 0 ref2 1 30 20M * 0 0 aggttttataaaacaaataa ????????????????????
x2 0 ref2 2 30 21M * 0 0 ggttttataaaacaaataatt ?????????????????????
x3 0 ref2 6 30 9M4I13M * 0 0 ttataaaacAAATaattaagtctaca ??????????????????????????
x4 0 ref2 10 30 25M * 0 0 CaaaTaattaagtctacagagcaac ?????????????????????????
x5 0 ref2 12 30 24M * 0 0 aaTaattaagtctacagagcaact ????????????????????????
x6 0 ref2 14 30 23M * 0 0 Taattaagtctacagagcaacta ???????????????????????
This example has two header lines denoting the reference genome contigs and their lengths, followed by 12 aligned reads.
Now inspect the SAM file in ./training/unix-demo/PF0302_S20.sort.sam
and try to identify the different sections that make up the file.
6.3 File sizes: du
We already saw that the ls -lh
can be used to figure out the file size of files in a particular directory. du
is another tool to do this, but it operates on individual files or directories directly. Like ls
, it also provides the -h
/--human-readable
option to return file sizes in KB/MB/GB, so it is generally recommended to always use this option. When used on a file, it will simply return its size, but when used on a directory, it will output information for all files, as well as the total file size of the entire directory (the final line of the output).
# targetting an individual file
$ du -h Homo_sapiens.GRCh38.dna.chromosome.Y.truncated.fa
5.9M Homo_sapiens.GRCh38.dna.chromosome.Y.truncated.fa
# targetting a directory
$ du -h training/data/
284M training/data/fastq
62M training/data/reference
346M training/data/
6.4 Compressed files: gzip
We already introduced the concept of file compression when talking about the FASTQ files in the training/data/fastq
directory. As a reminder, compressed files are binary files (as opposed to human-readable plain text files) that are used to reduce the file size for more efficient storage. Many of the files that we use in bioinformatics tend to be compressed. Some of the tools we use, will not work on compressed files (e.g., try to cat
a compressed file and see what happens), so we either need to 1) use specialized tools that expect compressed files as their input, or 2) decompress or extract the files first.
For gzipped (.gz
) files specifically, we can do this via the gzip
and gunzip
commands. The former allows us to create a gzip-compressed version of the file, whereas the latter will extract one back to a plain text file. The basic syntax is gzip/gunzip <path/to/file>
, but a very useful option is the -k
/--keep
flag. Without it, compressing a file would replace the uncompressed file with the new compressed one (vice versa: extracting would replace the compressed version with the extracted one), but when using the flag both files will be retained.
unix-demo
directory. By how much does its file size change? (Click me to expand!)
$ du -h PF0512_S47_L001_R1_001.fastq
67M PF0512_S47_L001_R1_001.fastq
$ gzip --keep PF0512_S47_L001_R1_001.fastq
$ du -h PF0512_S47_L001_R1_001.fastq.gz
17M PF0512_S47_L001_R1_001.fastq.gz
After compressing the FASTQ file with gzip
, it shrunk to less than a third of its original size.
Do note that there exist other types of file compression besides gzip, like .zip
/.7zip
. In unix we also often make use of tar
(which technically is not a compression tool, but a file archiver). File compression and tar can even be combined, leading to files with suffixes like .tar.gz
. This allows us to compress entire directories, instead of only individual files.
To extract these so called tarballs, we need to use the tar
command:
# extract .tar.gz archive
$ tar -xzvf tar_archive.tar.gz
tar_archive/
tar_archive/3
tar_archive/2
tar_archive/1
$ ls tar_archive
1 2 3
# create .tar.gz archive
$ tar -czvf new_archive.tar.gz <path/to/target_directory>
This command is notorious for how arcane its option flags are, but you can either try to remember it using a mnemonic (“eXtract/Compress Ze Vucking Files”, pronounced like a B-movie vampire) or the meaning of the individual flags (z
tells tar that we are using gzip compression, -v
stands for --verbose
to make the command show more information and output, -c
/x
switches between compression and extraction mode, -f
is the last option and points to the tar file) . And of course, the correct syntax is only a google/tldr search or tar --help
call away.
6.4.1 BAM file format
A final example of a binary file that you might encounter in the bioinformatic analysis of AmpliSeq sequencing is the BAM file format:
A BAM file is nothing more than a compressed, binary representation of a SAM file. It is used to reduce the file size of SAM files and also improve the speed of specific operations like sorting or retrieving information from the file.
Many bioinformatics tools can handle BAM file natively. However, similar to gzipped files, BAM files are binary and we can no longer preview them using tools like cat
and less
.
One of the tools that is often used to convert between the two types of alignment file formats is samtools
, which is maintained by the same group of people who manage the format specification for SAM/BAM files.
Aside from BAM, you might also encounter the CRAM file format. This is a more recent and even more highly optimized compressed alternative for sequence alignments, but it is not as commonly used (yet).
samtools
is a commonly used program to manage SAM/BAM files, which we will introduce later when discussing the structure of variant calling pipelines.
6.5 Downloading files: wget
wget
is a command that allows you to download files from a particular web address or URL and place them in your working directory. While there are several optional flags, in its most basic form the syntax is simply: wget URL
. This command is not only useful when automating certain tasks, but also crucial if you ever find yourself in a Unix environment that does not have a GUI at all (e.g., compute clusters or cloud servers).
./training/data/results
. (Click me to expand!)
- At the top of the page, click on Data -> Download data files.
- Search for falciparum 3D7 and then narrow down your search by selecting the most recent release and the FASTA file format. Alternatively, you can click on the Download Archive link in the top and navigate the file directory to the current release.
- The file name of the 3D7 reference genome in FASTA format is
PlasmoDB-66_Pfalciparum3D7_Genome.fasta
. - Right click the file and copy its URL to your clipboard: https://plasmodb.org/common/downloads/release-66/Pfalciparum3D7/fasta/data/PlasmoDB-66_Pfalciparum3D7_Genome.fasta.
- Create and navigate to the output directory (
mkdir -p ./training/data/results
andcd ./training/data/results
) - Download the file here using the command:
wget https://plasmodb.org/common/downloads/release-66/Pfalciparum3D7/fasta/data/PlasmoDB-66_Pfalciparum3D7_Genome.fasta
.
Two optional flags that you might find useful are: 1) -o
allows you to rename the download file, and 2) -P <path/to/directory
saves the file in a directory of your choice instead of the current working directory. Of course, these are just small convenient timesavers, since you can always cd
to a particular location and use mv
to rename the file afterwards.
Lastly, an alternative to wget
that you might encounter at some point is curl
. On the whole, it acts quite similar to wget
for the most part.
6.6 Retrieving file names: basename
basename
is a rather simple command: if you give it a long file path, it will return the final section (i.e., the file name).
# starting in the `training` directory
$ pwd
/home/pmoris/itg/FiMAB-bioinformatics/training
# get the file name for the reference genome we just downloaded
$ basename data/reference/PlasmoDB-65_Pfalciparum3D7_Genome.fasta
PlasmoDB-65_Pfalciparum3D7_Genome.fasta
We can also use this command to remove a particular suffix from a filename:
$ basename PF0512_S47_L001_R1_001.fastq .fastq
PF0512_S47_L001_R1_001
At this point in time, it might not seem particularly useful to be able to extract the file name of a file, but when we introduce the concept of for loops and bash scripting, it will become more clear why this can be so useful.
6.7 Sorting and removing duplicates: sort
and uniq
The final two commands that we will introduce are yet again tools to manipulate plain text files. The first is sort
, which does exactly that you expect it to. It can sort all the rows in a text file. Its syntax is:
sort [OPTIONS] <./path/to/file>`
There are optional flags that allow you to choose the type of order to use (e.g., numerical -n
/--numeric-sort
instead of alphabetical ), reverse the order of the sort (-r
/--reverse
) or ignore capitals (-f
/--ignore-case
).
The second command, uniq
, is used to remove duplicate lines in a file. It also offers the option to count the frequency of each unique line.
# given the following file
$ cat file_with_duplicates.txt
a
a
b
b
b
b
c
$ uniq -c file_with_duplicates.txt
2 a
4 b
1 c
These two commands are often used in conjunction, because uniq
on its own is not capable of filtering out identical lines that are not adjacent. So to truly remove all duplicate lines in a file, we would first need to sort it: sort unsorted_file.txt | uniq
. Also note that the sort
command also offers a flag -u
, which will make it return only unique lines after sorting.
In the next section, we will introduce a method of combining these commands in a more convenient way than running them one by one and without needing to create any intermediary files.
6.8 Other commands
Of course, there exist many more Unix commands than the ones we introduced here. We will end this section by briefly mentioning two that you might run into at some point awk
and sed
. Both of them allow you to search and replace patterns in text files, and with awk
you can even perform more complex operations including calculations. We will not dive into them here, but keep them in the back of your mind for the future.
6.9 Exercises
- Visit PlasmoDB again and find the download URL for the Plasmodium vivax P01 reference genome sequence in FASTA format. Download it via the command line and store it in
./training/data/reference
. - Report the file size of this reference genome in MBs.
- Find out how many lines of text the file contains.
- Search through the file for the
>
character, which is used to denote every chromosome/contig. Use a single command to count them. - Compress the FASTQ file
PF0512_S47_L001_R1_001.fastq
in theunix-demo
directory usinggzip
. - Navigate to the directory
./training/data/fastq
and in a single command, extract the forward (PF0097_S43_L001_R1_001.fastq.gz
) and reverse (PF0097_S43_L001_R2_001.fastq.gz
) of thePF0097_S43
sample, without removing the compressed files. Hint: use globbing! - Search both FASTQ files for the read fragment with identifier
@M05795:43:000000000-CFLMP:1:1101:21518:5740 2:N:0:43
using a single command. - Compare the file sizes of the two compressed and uncompressed FASTQ files.
- Extract the columns containing the island and flipper length of each penguin from the
./training/unix-demo/penguins.csv
file. - Count how often the sequence
CATCATCATCATCAT
occurs in the FASTA file./training/unix-demo/Homo_sapiens.GRCh38.dna.chromosome.Y.truncated.fa
. - Which command can be used to extract the name of the SAM file
./training/unix-demo/PF0302_S20.sort.sam
without the.sam
suffix? - Which reference genome was used to create the SAM file?
6.10 Summary
- Tabular data:
.csv
,.tsv
- The SAM/BAM file formats store sequence reads aligned to a reference genome.
Command | Result |
---|---|
grep <pattern> <path/to/file> |
Search through a (very large) file for the supplied pattern |
du <-h> <path/to/file_or_directory> |
Check how much space a file or directory occupies |
cut -f [--delimiter ","] <path/to/file> |
Extract columns from tabular data using the specified delimiter |
gzip / gunzip (-keep) <path/to/file> |
Compress or extract a gzip compressed file (.gz ) |
wget <url> |
Downloads a file from the URL to the current directory |
basename <path/to/file> |
Returns the name of the file without the path prefix |
basename <path/to/file> |
Returns the name of the file and remove the provided suffix |