RSAT tutorial - Analysing bacterial operons

Less does more or less the same as more, but rather more than less.
Jacques van Helden and Denis Puthier, June 24, 2014

Contents


Introduction

Distance-based prediction of operons

The Web tool infer-operon relies on a very simple distance-based method to regroup coding sequences (CDS) into putative operons (in the sense polycistronic transcripts).

The rule is very simple: the user specifies a distance threshold (typically 55bp). Every intergenic region is then annotated as either "between-operons" or "within-operons", according to the following criteria:

Although rudimentary, this method has the merit to be very quick (a few seconds for a whole genome), and to have a reasonably good accuracy (~80%).


[back to contents]

Starting a secure shell connection to the RSAT server

Windows users

Unix users (Linux, Mac OSX)


[back to contents]

Inferring operons for all the genes of your genome of interest

Inferring operons via RSAT Web interface

  1. Open a connection to the RSAT server (http://rsat.eu/).
  2. In the left-side pannel, expand the title Genomes and gene, and open the tool infer operons.
  3. Select your organism of interest (e.g. Escherichia_coli_K_12_substr__MG1655_uid57779).
  4. For the option Genes, select all.
  5. For the option Minimum number of genes, replace the default value (2) by 1.

    Setting the minimum number of genes to 1 means that we will not only return operons (polycistronic transcripts), but also single-gene transcription units. This will allow us to perform some statistics about the number of predicted operons versus single-gene transcription units.

  6. Leave all other parameters unchanged and click GO.

After a few seconds, the Web site displays the result table. Each gene of the genome appears on a separate row, annotated by several characteristics:

  1. ID of the query gene
  2. name of the query gene
  3. name of the predicted operon leader gene
  4. gene list of the predicted operon
  5. distance from query gene to its closest upstream neighbour
  6. number of genes in the operon

Some tips:

  1. Cliking on any column header will sort the result table according to the content of this column.
  2. At the bottom of the page, unter the title Result file(s), you can right-click the link to the tab-delimited operon table, and download the file to your computer. You can then open it with a spreadsheet program (Excel, OpenOffice calc, ...) to further explore it.

Inferring operons in the Unix shell

The following command will predict the operon grouping for all the genes of Escherichia coli (using the strain K12 MG1655).

## Infer operons and single-ene transcription units for each and every gene of a given organism
$RSAT/perl-scripts/infer-operon  -org Escherichia_coli_K_12_substr__MG1655_uid57779 \
   -dist 55 -min_gene_nb 1 -return query,name,leader,operon,upstr_dist,gene_nb -all \
   -o operons_Escherichia_coli_K_12_substr__MG1655_uid57779.tab

You can then inspect the file with the usual unix commands (head, tail, less, grep, cut, ...).

Exercises

  1. Using simple unix commands or a spreadsheet, collect some statistics about the predicted operons.

    1. How many genes do you have in the selected genome?
    2. How many distinct operons did the program predict?
    3. How many genes are located in single-gene transcription units?
    4. How many genes are located in polycistronic transcription units?
    5. How many genes does the longest operon contain?
    6. After gene names, can you guess that this longest operon contains functionally related genes?

  2. Adapt the parameters in order to collect directons instead of operons, and answer the same questions as above.

    Directons are defined as maximal (i.e. non-extensible) sets of contiguous genes transcribed in the same direction.