1  What is a CLI?

This section of the course will introduce you to the general concept of the Unix command line interface (CLI) - as opposed to the graphical user interface (GUI) that you are familiar with - and Bash, one of the most ubiquitous Unix shells.

1.1 Learning objectives

  • Knowledge of what a Unix shell and the CLI are and why/when they can be useful.
  • Setting up your own Unix environment.
  • Familiarity with basic bash commands for e.g., navigation, moving/copying and creating/deleting/modifying files and directories.
  • Introduction of a few more advanced commands and concepts like redirection, piping and loops.
  • First look at scripts and how they can be used in the context of DNA sequencing pipelines for variant calling.

Resources

This section of the course draws inspiration from the following resources:


1.2 What is Unix?

Unix is a family of operating systems, with one of their defining features being the Unix shell, which is both a command line interface and scripting language.

In simpler terms, shells look like what you see in the figure below and they are used to talk to computers using a CLI - i.e., through written text commands - instead of via a graphical user interface (GUI) where you primarily use a mouse cursor.

Bash shell in WSL

Bash shell in WSL

There exist many different flavours of Unix, collectively termed “Unix-like”, but the ones you will most likely encounter yourself are Linux (which itself comes in many different varieties we call distributions, e.g. Debian, Ubuntu, Fedora, Arch, etc.) and MacOS. These operating systems come with a built-in Unix shell. While Windows also comes with a command line interface (Command Prompt and PowerShell), it is not a Unix shell and thus uses different syntax and commands. We’ll dig into how you can get your hands on a Unix shell on a Windows machine in a later section. The most ubiquitous Unix shell is Bash, which comes as the default on most Linux distributions.

1.3 Why bother learning the Unix shell as a bioinformatician?

Even if you are primarily a wet lab scientist, learning the basics of working with CLIs offers a number of advantages:

  • Automation: CLIs and scripting excel at performing repetitive tasks, saving not only time, but also lowering the risk of mistakes. Have you ever tried manually renaming hundreds of files? Or adding an extra column to an Excel spreadsheet with millions of rows?
  • Reproducibility: reproducibility is key in science and by using scripts (and other tools like git, package managers and workflow systems) you can ensure that your analyses can be repeated more readily. This is in stark contrast to the point-and-click nature of GUIs.
  • Built-in tools: the Unix shell offers a plethora of tools for manipulating and inspecting large (text) files, which we often deal with in bioinformatics. E.g., DNA sequences are often stored as plain text files.
  • Availability of software: many bioinformatics tools are exclusively built for Unix-like environments.
  • Access to remote servers: Unix shells (usually bash) are the native language of most remote servers, High Performance Computing (HPC) clusters and cloud compute systems.
  • Programmatic access: CLIs and scripts allows you to interact in various ways (e.g., via APIs) with data that is stored in large on-line databases, like those hosted by NCBI or EBI.

As a concrete example of what we will be using the shell for, consider the task of processing hundreds of Plasmodium DNA sequencing reads with the goal of determining the genetic variation in these samples (e.g., the presence of SNPs). Suppose we were to do this in a GUI program, where we would open each individual sample and subject it to a number of analyses steps. Even if each step were to only require a few seconds (in reality, minutes or even hours…), this would take quite a long time and be prone to errors (and quite boring!). With shell scripting, we can automate these repetitive steps and run the analysis without requiring human input at every step. Some of the key techniques we will use for this are:

  • Navigating to directories and moving files around.
  • Looping over a set of files, calling a piece of software on each of them.
  • Extracting information from a particular location in a text file.
  • Compressing and extracting files.
  • Chaining commands: passing the output of one tool to another one. E.g., after aligning reads to a reference genome, the resulting output can be fed to the next step of the pipeline, the variant caller.
  • Etc.

1.4 Don’t get discouraged

Learning to use the shell, or learning programming languages and bioinformatics skills in general, can be daunting if you have had little experience with these types of tasks in the past. Don’t worry though, just take your time and things will become easier over time as you gain more experience.

We do not expect you to be able to memorize every single command and all of its option. Instead, it is more important to be aware of the existence of commands to perform particular tasks, and to be able to independently retrieve information on how to use them when the need arises.

Finally, the appendix (Section A.1) of the course also contains a bunch of tips and tricks to keep in mind while learning your way around the shell.

1.5 A note on terminology

You will often see the terms command line (interface), terminal, shell, bash, unix (or unix-like) being thrown around more or less interchangeably (including in this course). Most of the time, it is not terribly important to know all the minute differences between them, but you can find an overview here if you are curious: https://astrobiomike.github.io/unix/unix-intro.