RSAT tutorial - First steps in the unix shell

Less does more or less the same as more, but rather more than less.
Jacques van Helden and Denis Puthier, June 24, 2014



In this tutorial, we will address the situation where we want to connect to a Unix server and use some command lines to performed specific tasks.

This series of exercises aims at providing you with the basics of the Unix Shell environment. We will open a programmatic connection to the EnsemblGenomesq server and start to create a bioinformatics worflow.


Starting a terminal under Windows

Starting a terminal under Macintosh

Creating a working directory

After connection you will be able to work the remote server (whose name is "pedagogix"). To tell the server what you want it to do, you will need to write instructions. The basic blocks of these instructions will be Unix commands that are small programs dedicated to specific tasks.

Here we will instruct the server to display our location in the file tree then to create a directory

Retrieving species list from Ensembl

Getting information about ensembl_genomes

In the context of this training, we have developed a small python program (ensembl_genomes) that allows one to connect to the Ensembl Genomes REST API. This program fetches species-related data from ensemblGenomes DataBase, such as the list of supported species, or their annotations (genes, transcripts, proteins). Users can get informations about ensembl_genomes by using the -h argument (-help). At the moment, ensembl_genome implements two tasks: retrieve_species and retrieve_features. Users can get informations about retrieve_species and retrieve_features by using the -h argument. Note that ensembl_genomes is still under developement.

    # get information about the program
    ensembl_genomes -h
    # get information about the retrieve_species implemented task
    ensembl_genomes retrieve_species -h
    # get information about the retrieve_features implemented task
    ensembl_genomes retrieve_features -h

Retrieving species information using ensembl_genomes

We will use retrieve_species from ensembl_genomes to get the list of available genomes. Below are the available arguments. Most generally, command arguments are used to define the input and output file or to modify the way the program works.

usage: retrieve_species [-h] [-o OUTPUT] [-v {1,2,3}]
                                           [-f FILE]

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Choose the output file name containing the
                        informations about all species
  -v {1,2,3}, --verbosity {1,2,3}
                        Choose the verbosity level
                        default: None 
                        1: few informations
                        2: detailed informations
                        3: complete
  -f FILE, --file FILE  If enabled, this option will ensure the creation of a
                        separate file containing only species names (one
                        species per line).

We will instruct the program to retrieve the species list (including for each of them division, name, taxon_id, release, display_name, common_name,...) but also to store in a separate file the list of species as a single column.

## Run the task "retrieve_species"
ensembl_genomes retrieve_species -v 1 -o ensembl_data/ -f ensembl_data/species_list.txt
ls -ltr ensembl_data ## should contain 2 files

NB: argument for the ls command: -l (long, get lots of information about file

NB: argument for the ls command: -t (sort by modification time, newest first)

NB: argument for the ls command: -r (reverse order while sorting)

Inspecting the result

Unix-like system (Linux, Mac OSX, ...) propose lots of useful commands to work with files. We will see some of these basic commands that can be really helpful especially when one is working with large files.

First we will go into the ensembl_data directory using the cd command (change directory).

# changing directory
cd ensembl_data

# print the path of the current working directory

# go up one level in the file tree (that is go back to the previous directory)
cd ..

Count the number of supported species at EnsemblGenomes (i.e; the number of line in the species_list.txt or files)

## Count the number of supported species at EnsemblGenomes
wc -l ensembl_data/species_list.txt
## Today: 11,169 species !

Display the first n rows of a file.

## Check the first 10 rows of the file
head -n 10 ensembl_data/species_list.txt

Display the lst n rows of a file.

## Check the last 20 rows of the file
tail -n 20 ensembl_data/species_list.txt

Create a new file containing the species names in alphabetical order.

## Species names are retrieved in an arbitrary order. Let us sort them alphabetically
sort ensembl_data/species_list.txt > ensembl_data/species_list_sorted.txt
## The ">" character redirects the output to a file (default to screen)
        ## Check species names are well sorted
head ensembl_data/species_list_sorted.txt
tail ensembl_data/species_list_sorted.txt

Let's have a look at the species_list_sorted.txt file using a pager.

## Inspect the sorte species list in a page-per-page mode
more ensembl_data/species_list_sorted.txt
## Use 'enter' key to navigate
## Use 'q' key to quit
## "Less does more or less the same as more ... but rather more than less"
## A thought from JVH early this morning. :) DP
less ensembl_data/species_list_sorted.txt
## Use vertical arrow keys to navigate
## Use 'q' key to quit

Let's inspect the species description table. How many rows does it contains (word count -l (lines). Let's have a look at the file using less.

	## Inspect the species description table
	## How many rows
	wc -l ensembl_data/
	## Inspect the content using less
	less ensembl_data/ # q to quit

Cut columns 2,7 and 8 (name, Taxonomy ID, release number) and send the text stream to the 'less' command. Note that the pipe ("|") characters allows one to send the result of one command to another command.

	## Cut the columns 2,7 and 8
	cut -f 2,3,5  ensembl_data/ | less ## select some columns of interest

Cut the column 1 (division), then sort it, then get the list of non redundant character strings (uniq) and count their occurence (-c).

	## The number of species in each division
	cut -f1   ensembl_data/ | sort | uniq -c 

Retrieving gene models

Now that we have introduced the basics about command lines, use them to get information about the gene models of your favorite organism.

View solution| Hide solution