Appendix A — Various Unix topics
A.1 Tips and hints
Never (never!) use spaces in your file or directory names. This will only lead to pain… Instead, use hyphens (-
) or underscores (_
) to separate words. E.g., my_first_script
and 3B207-2_S92_L001_R1_001.fastq.gz
.
Additionally, unlike in Windows, in Unix everything is case-sensitive. Thus, /home/documents
!= /home/Documents
. Be mindful of this when naming or pointing to files/directories.
Avoid unnecessary typing and just make things easy for yourself!
While typing commands in the shell, you can almost always use the tab
key for auto-completion. This will automatically type out paths, file names or known commands. If there exist multiple matches, a single press of tab
will not appear to do anything, but if you press the button twice, a list of possible options will appear on your screen. This is incredibly useful, not only for speeding things up, but also for avoiding typos when dealing with long or complex file names.
An equally useful tool is your command history. While on the shell prompt, pressing the up arrow (↑
) will bring up your most recent previous command. Pressing it again will cycle through the entire history, in reverse order. You can also search through your history by pressing ctrl+r
allows you to search through your command history. Just start typing and you will see the search try to narrow down on the command that you are looking for. Once you find it, just press enter
to run it directly or tab
to copy it to your prompt (in case you still want to change it). The search form will look like this: (reverse-i-search)`world': echo "Hello world!"
Copying and pasting might work slightly different to what you are used to, depending on the terminal application that you are using. If ctrl+c
and ctrl+v
do not appear to work, you can tryctrl+shift+c
and ctrl+shift+v
instead. Often times, the mouse middle or right click can also be used for pasting.
For the native WSL terminal specifically, you can refer to this site for more info: https://devblogs.microsoft.com/commandline/copy-and-paste-arrives-for-linuxwsl-consoles/
If a command seems to hang or get stuck, your terminal becomes unresponsive, or if you tried to print a very large text file to your screen, you can use CTRL+C
to interrupt almost any operation and regain control.
Similarly, CTRL+D is an often used shortcut for exiting/logging out (e.g., when dealing with remote servers of nested shells).
In some cases, like when using an interactive terminal program such as the text editors nano
and vim
or a text viewer like less
, you will only be able to exit them using that particular program’s shortcut keys (CTRL+X
, :
followed by q
and enter
, and Q
, for these applications respectively). For more info on terminal programs, check out Fantastic terminal programs and how to quit them.
Be careful while learning your way around the command-line. The Unix shell will do exactly what you tell it to, often without hesitation or asking for confirmation. This means that you might accidentally move, overwrite or delete files without intending to do so. For example, when creating, copying or moving files, they can overwrite existing ones if you give them the same name. Similarly, when a file is deleted, it will be removed completely, without first passing by a recycle bin.
No matter how much experience you have, it is a good idea to remain cautious when performing these types of operations.
For the purposes of learning, if you are using your own device instead of a cloud environment, we recommend that you work in a dedicated playground directory or even create a new user profile to be extra safe. And like always, backups of your important files are invaluable regardless of what you are doing.
-h
/--help
and comments are your friends.
At the beginning things will be awkward, so don’t worry about having to search for the same information multiple times. That is all part of the learning process. Moreover, being able to retrieve information when you are in need of a particular command is more useful than memorizing everything.
It can still be a good idea though to keep a list of commands that you often use, but have a difficult time committing to memory.
Many commands will also display a short help text when called with the -h
/--help
flag. For some tools, you will need to call them without any arguments to display the help. Lastly, for some tools, you can use the man <tool_name>
to open up an even more in-depth manual.
Remember that you can always write comments inside of your scripts by starting a line with #
. That way you can add a short explainer or extra info to the different sections of a script. Code that seems clear while you are writing it, has the unfortunate tendency of becoming much more confusing when you refer back to it at a later time.
#!/usr/bin/env bash
##########################################################
# Script to map fastq files to reference genome with bwa #
##########################################################
# make sure to run this script from within the directory where it is stored!
# move to the directory containing the fastq files
cd ../data/fastq
# create a directory to store the results and store the path as a variable
output_dir="../../results/bwa"
mkdir -p ${output_dir}
...
Additionally, you can break up long commands using a \
to make them easier to read. You can do this both in scripts or on the command line. E.g.,
bwa mem \
\
../reference/PlasmoDB-65_Pfalciparum3D7_Genome.fasta ${read_1} \
${sample_name}_R2_001.fastq.gz
The first website is tremendously useful for figuring out what a command and all of its options mean. Whereas the second shows you a quick summary of the most command usages of a particular command.
Use both of these to your advantage! But do not forget that most commands also have a built-in help page that can be accessed using the --help
flag (in some cases just typing the command without any arguments also shows some help information).
A.2 Overview of special syntax
The table below gives you an overview of some of the special characters that we will encounter. You do not need to memorize them, but you can always refer back to this section if you see a symbol later on and are not quite sure what its purpose is.
Symbol | Name | Uses |
---|---|---|
/ |
Forward slash | File path separator or root location/file path |
\ |
Back slash | Split long command to a new line and escape special characters (+ file path separator in Windows) |
~ |
Tilde | Shortcut for home directory in file paths |
| |
Pipe or vertical bar | Chains the output of one command to the input of another one (piping) |
# |
Hash | Part of the shebang at the top of scripts #! and used for comments in shell scripts |
$ |
Dollar sign | Used to access variables in bash |
* |
Asterisk or wildcard | Globbing operator |
> |
Greater than symbol | Redirect output of a command (>> redirect and append instead of overwriting) |
< |
Less than symbol | Redirect input to a command |
. |
Dot | In the context of a path, it represents the current working directory |
.. |
Double dot | In the context of a path, it represents the parent directory of the working directory |
A.3 Understanding superusers, root and sudo
Superusers are special users on Unix systems with additional privileges (cf. Windows administrator). By convention, the default name for a superuser on Linux is root
(don’t confuse this with the root of the filesystem /
). Superusers have access to all files and directories on the file system, including any critical components that make the system work and any files owned by other users. Moreover, several tasks like software installation and modifying system configuration require administrative privileges. It would be a security risk to run as a superuser constantly, but how can regular users install software then? Or what about modifying the permissions of a file that you (accidentally) removed your own access to? This is where the sudo
command comes into play.
sudo
stands for “superuser do” and it allows regular users, who have been added to the sudo
user group, to run individual commands as if they were the root user. A common use-case is using sudo to install new software via apt
(on Debian-like systems) or dnf
(on Fedora/CentOS): sudo apt install ncbi-blast+
.
With great power comes great responsibility, so exercise caution and think twice before running a command with sudo
. Only do so when you are sure you know what the end result will be.
A.4 What is $PATH
?
The $PATH
is a way of letting your computer know where specific tools or other special locations are stored on your file system. Unless you tell it explicitly, it won’t know where to find any new software you install. Fortunately, most methods of installing software automatically take care of this for you, but every now and then you will need to manually add things to your $PATH
. If you don’t, you will be greeted by messages like Command 'python' not found, did you mean:
.
The $PATH
is nothing more than a list of locations on your computer. Everything that is found in those locations, will become available to use directly on the CLI without having to type out its full location. Even the basic Unix commands, like ls
and cd
are only known to your shell because they are in a location that is indexed by your path.
In the following example, we will demonstrate how you can add a custom directory with scripts to your $PATH
, making them callable from anywhere.
# show the contents of PATH
echo $PATH
/home/pmoris/miniforge3/bin:/home/pmoris/miniforge3/condabin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/lib/wsl/lib:/mnt/c/Program Files (x86)/Common Files/Oracle/Java/javapath:/mnt/c/windows/system32:/mnt/c/windows:/mnt/c/windows/System32/Wbem:/mnt/c/windows/System32/WindowsPowerShell/v1.0:/mnt/c/windows/System32/OpenSSH:/mnt/c/Program Files/Git/cmd:/mnt/c/Program Files/dotnet:/mnt/c/Users/pmoris/AppData/Local/Programs/Quarto/bin:/mnt/c/Users/pmoris/AppData/Local/Microsoft/WindowsApps:/mnt/c/Users/pmoris/AppData/Local/Programs/Microsoft VS Code/bin:/snap/bin
# temporarily add directory of scripts to PATH
$ ls ~/itg/FiMAB-bioinformatics/training/scripts
call_variants.sh download_reference.sh map.sh remove_dups.sh trim.sh
$ export PATH="$PATH:~/itg/FiMAB-bioinformatics/training/scripts"
# now these scripts can be invoked directly without having to type out their full location
# e.g., map.sh works just as well as ~/itg/FiMAB-bioinformatics/training/scripts/map.sh
To make these changes permanent, you’d have to add that export
statement to your .bashrc
file (stored in your home directory). This file is run every time you launch a new shell, so that will allow the $PATH
to be modified every time during startup.
You can find more information on modifying the PATH here.
Lastly, be careful when modifying your PATH. If you mess it up, it can cause all kinds of havoc.
A.5 Dealing with file permissions
You can find an excellent explanation on file permissions here.
A.6 Working with remote machines via SSH
In some cases, you will need to work on a Linux machine that is physically located somewhere else, i.e. a remote server. Access to these is usually managed via a command-line tool called SSH (or a stand-alone GUI tool like Putty in Windows). The syntax of the ssh
command is as follows:
ssh username@domain
Where username is a name given to you by the admin of the system and domain is the address of the server (can be a URL or an IP address).
The connection is secured via SSH keys: a pair of files used for authentication stored in ~/.ssh
.
- Public key: e.g.,
id_rsa.pub
orid_ed25519.pub
, located on the remote server. - Private file: e.g.,
id_rsa
orid_ed25519.pub
, located on your own machine. *Never share this file with anyone else!**
Instructions to generate new SSH keys can be found https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent.
A common problem when connecting is that the file permissions of your keys or credential files are messed up. This can happen if you generate them in Windows and later move them to a Linux file system. To fix this, check https://superuser.com/a/215506.
More info on remote servers can be found here.
Lastly, keep in mind that when working on remote servers, it is essential to use screen
or tmux
for long-running jobs. Otherwise, they will be interrupted when you disconnect or even when the network briefly fails.
A.7 Further reading
There are still many facets of Unix that we have not covered here yet, but we hope that you can pick these up on your own if were to ever need them. For now though, the basics that we covered here, will hopefully already go a long way in helping you use Unix, read scripts written by others and teaching yourself new concepts.
A selection of more advanced topics to explore at a later time could be:
- An superb workshop on various intermediate concepts: https://genomicsaotearoa.github.io/shell-for-bioinformatics/
- Installing software on Linux machines, from source or via
apt
(Ubuntu),dnf
(Fedora) orpacman
(Arch), or through tools like conda/Miniforge and bioconda. - The
find
command: useful for finding files in your file system. - If/else conditions: allow you to execute parts of scripts if certain conditions are fulfilled.
- Command substitution and process substitution: run commands in a subshell to allow more complex ways of redirecting and piping outputs/inputs.
sed
/awk
are commands that let you do things like search and replace, calculations or other manipulations based on all the lines of a file.- the
join
command: combining columns of (multiple) tabular data files in particular ways. - Environment variables: variables that are always available, like
PATH
. - Using
screen
ortmux
to spawn persistent background terminal sessions. This allows you to run commands on a remote server and shutdown your own machine, without the remote process being interrupted. - Learn about the difference in line endings on Unix (
\n
) and Windows (\r\n
), how to switch between them (dos2unix
) and how to set a preference in your editors like RStudio. - Learn about
sudo
, allowing you to perform actions as an administrator. - Regular expressions, to power up your
grep
searches andsed
/awk
commands.