Instructions pour les étudiants AMU

Le TP qui suit est rédigé en anglais, pour vous familiariser avec la langue utilisée sur l’ensemble des ressources bioinformatiques internationales.

Les rapports de TP peuvent cependant être rédigés en français.

Chaque étudiant devra soumettre ses résultats de deux façons

  1. Encodage de certains résultats dans un tableau partagé. Ce tableau sera alimenté au fil du déroulement du TP. Chaque étudiant\(\cdot\)e indiquera ses propres résultats sur une ligne du tableau, et cet encodage collectif permettra d’évaluer chaque résultat individuel à la lueur des autres résultats.

  2. Téléchargement d’un rapport individuel sur le site Ametice. Ce rapport sera noté, en tenant compte de la correction des résultats, mais également de la clarté de formulation des réponses

Attention : nous vous recommandons fortement de remplir ces deux documents au fil du déroulement du TP, pour vous assurer de disposer de toute l’information pour répondre aux questions sans devoir revenir en arrière ou refaire certaines parties des exercices.


Introduction

Harbison and co-workers (2004) used the ChIP-on-chip technology to identify the target genes of a hundred of transcription factors of the budding yeast Saccharomyces cerevisiae.

For some transcription factors, they performed several experiments in different culture media, in order to distinguish the impact of the environmental conditions on the regulated genes. Indeed, it is well-known that transcription factors are themselves regulated (transcriptionally and post-transcriptionally) depending on the presence of nutriments, stress conditions, phases of the cell cycle, and many other parameters of their environment.

Goal of this tutorial

In this tutorial, we will start from a cluster of target genes identifid by ChIP-on-chip for one particular transcription factor grown in one particular condition, and use a series of bioinformatics tools in order to answer two basic quesitons:

  1. What are these genes doing together?.

    Each of these genes has a particular molecular activity (enzyme, transporter, regulator, …) and is involved in one or several biological processes. Beyond these individual functions, we will investigated if they are involved collectively in sope biological process (e.g. metabolic patwhay, response to some stress, …). For this, we will use a first approach based called enrichment analysis.

  2. Can we discover over-represented motifs in their promoters?

    In yeast, transcriptional regulation is relatively simple compared to metazoa or plants. Cis-regulatory elements are located in the non-coding regions upstream of the regulated genes, with ~400bp per upstream regio. The ChIP-on-chip technology enables to detect all the regions bound by a given TF in a given condition. When analysing motifs in these promoters, our first expectation is to fint the motif specifically bound by the immunoprecipitated transcription factor. In some cases,we may also discover addtional motifs, suggesting that the gene cluster is co-regulated by some additional factors, in additin to the immunoprecipitated TF.

The goals of this tutorial extend much beyond the discovery of some nice web sites producing fancy figures after a few clicks. Indeed, we expect for students to exert their critical spirit in order to evaluate the reliability and the interest of the results, by paying a particular attention to two complementary questions:

  1. What is the statistical significance of the results ? Attempt to understand the statistics returned by the different programs and interpret the risks of false positives.

  2. What is the biological relevance of the results ? This can be done by evaluating the consistency between the results returned by the different tools, and by comparing their results with the known function of the transcription factor.

The reliability of the tools can further be tested by an empirical test that we will call negative control, which consists in submitting random sets of genes as queries. Indeed, even though each of the query genes has its own function, is involved in some particular biological process, pathway, … there is no reason for a random collection of them would be functionally related. The software tools should not return any significant result. For negative controls, a null answer (no result) is the correct answer.


Abbreviations

Abbreviation Meaning
TF Transcription factor
TFBM Transcription factor binding motif
TFBS Transcription factor binding site
GO Gene Ontology

Bioinformatics resources

Resource Description URL
Gene sets List of target genes for each factor/condition Harbison_2004_genesets.tsv
Uniprot Database of protein sequence and function http://uniprot.org
SGD Saccharomyces Genome Database http://www.yeastgenome.org/
g:Profiler A web server for functional interpretation of gene lists http://biit.cs.ut.ee/gprofiler/
YeastCyc Database of metabolic yeast pathways http://yeast.biocyc.org/
KEGG Pathway Mapper A set of tools enabling to highlight a list of query genes in the metabolic pathways of the Kyoto Encyclopaedia of Genes and Genomes http://www.genome.jp/kegg/mapper.html
RSAT Regulatory Sequence Analysis Tools http://rsat.eu/

Data

The sets of target genes are provided in a text file with tab-separated values.

Harbison_2004_genesets.tsv

This file can be opened with a spreasheet softwae (LibreOffice Calc, Microsoft Excel, …), or with a simple text editor.

Column contents:

  1. Gene identifiers (according to the convention agreed amon yeast geneticists).
  2. Transcription factor
  3. Culture medium (growth condition)
  4. Statistical significance of the binding (\(sig = -log_{10}(\text{E-value}\))

Exercises

Choice of a gene set

Each student will chose a given ChIP-chip result (i.e. a combination of one transcription factor and one culture medium) and select all the target genes.

In order to achieve a reasonable power in the subsequent analyses, we recommend to choose a dataset with a sufficent number of target genes (at least 10).

As an illustration, the teacher will use the target genes of the Met4p transcription factor, encoded by the MET4 gene, in SM culture medium. The corresponding rows are displayed below.

Example of gene set: genes whose promoter is bound by the transcripiton factor Met4p in SM culture medium.
GENE_ID factor condition significance
YAL012W MET4 SM 3.3900039
YDR065W MET4 SM 0.1927444
YER092W MET4 SM 4.2306097
YFR030W MET4 SM 2.4034924
YGL184C MET4 SM 5.4181212
YGR055W MET4 SM 0.1761979
YGR204W MET4 SM 6.4628565
YLR092W MET4 SM 0.4660093
YLR179C MET4 SM 1.1264004
YLR180W MET4 SM 1.1264004
YLR301W MET4 SM 0.4997180
YNL277W MET4 SM 1.1843924

At the beginning of the practical, students will fill up a table with their names, the chosen transcription factor, medium, and additional details will be filled in progressively.

  1. Download the table of transcription factor target genes: Harbison_2004_genesets.tsv and open it with a spreadsheet (Office Calc, Excel, …).

  2. Choose a transcription factor + culture medium of interest (at least 10 target genes), copy-paste the corresponding lines in a separate spreadsheet, which you will save on your computer.

  3. Investigate the function of your transcription factor of interest, by gathering information from SGD and Uniprot.

    In your report, write 2-3 sentences to summarize the function of your factor.

  4. In the spreadsheet containing your selected ChIP-on-chip result, select the first column (whic contains the target gene identifiers), and copy-paste it to a text editor document, which you will save as non-formatted text file (for the sake of convenience I will name this file Harbison_2004_MET4_SM_geneIDs.txt. Of course, you should adapt the name depending on the transcription factor and culture medium of your choice.

  5. Open a connection to the collective result table (https://goo.gl/G9pcSq).

  6. Start to fill up a row with your information: name, factor, culture medium, number of genes, and description of the transcription factor.


Motif analysis

Motif discovery in the promoters of a gene set

The first question we will address is whether we can discover over-represented motifs in the promoters of our gene set of interest. In principle we expect to find motifs corresponding to the transcription factor used in the ChIP-chip experiment.

  1. Open a connection to RSAT Fungi (http://fungi.rsat.eu/).
**Search Tool Box**

Search Tool Box

  1. In the search box of the left panel, search a tool named gene information, and click on it. This opens the query form of the tool Gene information, which returns information about one or more query genes (gene names, chromosomal locations, …).

  2. In the Organism textbox, type Saccharomyces cerevisiae. Note that as you start typing, the interface progressively displays the subset of organisms whose name matches the typed text.

**Organism selection** with auto-completion

Organism selection with auto-completion

  1. Paste your gene identifiers in the Gene queries box, and click GO.
**Gene information form** filled with target genes of MET4 in yeast cultured in SM medium

Gene information form filled with target genes of MET4 in yeast cultured in SM medium

  1. At the bottom of the gene-info result page, click on the button Retrieve sequences. this open a new form where the query genes have already been filled up with the results of gene-info.
**Retrieve sequences form**.

Retrieve sequences form.

  1. Check the options by clicking on the different tabs on the left of the retrieve-seq form (Mandatory inputs, Mandatory options, Advanced options). Then, open the Run analysis tab and click GO.
**Retrieve sequences form, *Run analysis* tab**.

Retrieve sequences form, Run analysis tab.

The result page displays links to two files: the query genes and the sequences. Save the resulting sequences (fasta file) on your computer.

**Retrieve sequences result page, *Results* table**.

Retrieve sequences result page, Results table.

  1. In the Next step box, click on oligo-analysis.
**Retrieve sequences result page, *Next step* table**.

Retrieve sequences result page, Next step table.

  1. As oligomer lengths, check “6” and uncheck all the other lengths1. Leave all the default parameters, and click GO.

  2. Read carefully the result page and fill up the oligo-analysis columns of the collective report: https://goo.gl/G9pcSq.

    Note: in case oligo-analysis does not return any significant result with your query promoters, do not desperate yet, there is still a possibility to run an alternative tool to discover motifs in your promoters. Indeed, some factors recognize a spaced pair of very short oligonucleotides (e.g. \(CGGn_{10}GCC\)) rather than a single oligonucleotide (e.g. \(CACGTGT\)). Such spaced motifs generally escape detection with oligo-analysis. For this reason, we developed an alternative algorithm named dyad-analysis, which detects over-represented spaced pairs of tri-nucleotides with any spacing value from 0 to 20 (by default).

  3. Repeat the same procedure using dyad-analysis instead of oligo-analysis, and compare the discovered motifs (the difference may depend on your transcription factor of interest). fill up the dyad-analysis columns of the collective report.

Negative control: motif discovery in the promoters of a random gene set

A good motif discovery program should be able to return a negative answer when there is no specific functional relationship between the submitted promoter sequences. We can test this with the RSAT program random-genes.

  1. On RSAT Fungi (http://fungi.rsat.eu/), open the menu Build control sets and click Random genes.
  2. Pick up a random set of genes with the same number of genes as the targets of your transcription factor.
  3. Save the random gene list in a separate file, we will re-use it below for the other functional analysis tools.
  4. Run the same procedure as above to discover motifs in the promoters with oligo-analysis and dyad-analysis.
  5. Fill up the columns for this negative control in the collective report table.

Summarize your motif analysis

In your personal report, comment in a few lines (max 10 lines) the results obtained with oligo-analysis and dyad-analysis in your gene set of interest (the target genes detected by ChIP-chip) and in the random gene selection.


Converting gene IDs

In the previous section, we used the RSAT tool random-genes to pick up a set of genes as negative control. For the yeast Saccharomyces cerevisiae, RSAT uses the gene identifiers from the NCBI ENTREZ database.

Annoyingly, the different tools available on the Web are not all using the same IDs, for several reasons. The heterogeneity of identifiers for biological objects (genes, proteins, molecules, …) can be cumbersome, and is currently considered as a challenge for the interoperability of bioinformatics resources.

The problem is so general that some web sites developed specialized tools dedicated to ID conversion. We will test one of these, in order to obtain suitable gene identifiers for the subsquent steps of this tutorial.

  1. Open a connection to the g:Convert tool https://biit.cs.ut.ee/gprofiler/gconvert.cgi, which is part of the of the g:Profiler Web site.

  2. In the Query box, paste the IDs returned by RSAT random-genes, which should be numbers like those below.

855536
852544
856121
851763
...
  1. Set the organism to Saccharomyces cerevisiae, select SGD_GENE as target database, make sure that the numeric input is considered as ENTREZGENE_ACC,and click Convert IDs.
  1. In order to dispose of the converted IDs for the subsequent steps of the tutorial, we would like to store them on our computer. For this, select the option Excel spreadsheet (XLSX) as output type.
**g:Convert export to an Excel table**.

g:Convert export to an Excel table.

  1. click on the Convert IDs button, and click on the link Download data ….

  2. Open the downloaded file with a spreadsheet tool (LibreOffice Calc, Excel).


Functional enrichment

  1. Open a connection to g:Profiler (http://biit.cs.ut.ee/gprofiler/).

  2. Select the organism Saccharomyces cerevisiae, and submit your list of TF target genes.

**g:Profiler query form. **.

g:Profiler query form. .

  1. Save the result page and figures on your computer.

  2. Do the same with the random gene selection.

  3. In the collective report, fill up the columns related to g:Profiler.

  4. In your personal report, summarize your interpretation of the results in one paragraph.


Metabolic pathway exploration

KEGG Pathway mapper

  1. Open a connection to the KEGG pathway mapper (http://www.genome.jp/kegg/mapper.html).

  2. Test the different tools with the list of targets, and the random gene set. In particular, try the tool Search & Color pathways.

    Note: for the organism, use the acronym sce for Saccharomyces cerevisiae.

**KEGG Search & Color Pathways: query form. **.

KEGG Search & Color Pathways: query form. .

Here is an example of result for the MET4 targets in SM medium.

**KEGG Search & Color Pathways. ** Example of colored map. Green boxes indicate enzymes for which a gene was identified in the yeast genome. Query genes are highlighted in pink (in this case, MET4 target genes in SM medium from Harbison et al., 2004).

KEGG Search & Color Pathways. Example of colored map. Green boxes indicate enzymes for which a gene was identified in the yeast genome. Query genes are highlighted in pink (in this case, MET4 target genes in SM medium from Harbison et al., 2004).

  1. Interpret the results obtained with your gene set : are your target genes rasembled into one or a few pathways? How close do they appear on the pathway maps? How does it compare with the random gene selection?

  2. Fill up the KEGG-related columns in the collective report.

  3. In your personal report, summarize your results of the KEGG pathway coloring in a few lines.

YeastCyc

  1. Open a connection to BioCyc (http://biocyc.org/).

  2. In the top right side of the page, click Change organism database (the default is the bacteria Escherichia coli).

**BioCyc organism database** (default is *Escherichia coli*)

BioCyc organism database (default is Escherichia coli)

  1. Type Saccharommyces cerevisiae and select the reference strain Saccharomyces cerevisiae S288c (Genbank entry GCF_000146045.2.
**BioCyc organism selection**

BioCyc organism selection

  1. Make sure that the selected strain appears correctly in the top-right corner.
**BioCyc yeast database**

BioCyc yeast database

  1. In the menu Metabolism, select Generate Metabolic map poster.

We will now highlight the target genes on the metablic overview map.

  1. Open the Metabolism -> Cellular Overview map (http://yeast.biocyc.org/overviewsWeb/celOv.shtml).

  2. In the top-right corner of the YeastCyc screen, click Show operations and select Highlight -> Highlight Gene(s) -> From file.

  3. In the dialog box, click Choose file, locate the gene ID text file on your computer (in my case the file is named Harbison_2004_MET4_SM_geneIDs.txt), and click Highlight.

  4. Analyse the resulting map.

    • Do the target genes appear to be involved in simliar pathways?
    • Do they appear close together or far apart on the map?
  5. Run the same analysis with the random gene selection.


References

Harbison, C, D Gordon, T Lee, N Rinaldi, K Macisaac, T Danford, N Hannett, et al. 2004. “Transcriptional regulatory code of a eukaryotic genome.” Nature 431 (7004): 99–104.


  1. This is only to simplify the discussion at the end of this tutorial, but for normal use of the tool we recommend to analyse simultaneously the hexa, hepta and octanucleotides