Le TP qui suit est rédigé en anglais, pour vous familiariser avec la langue utilisée sur l’ensemble des ressources bioinformatiques internationales.
Les rapports de TP peuvent cependant être rédigés en français.
Chaque étudiant devra soumettre ses résultats de deux façons
Encodage de certains résultats dans un tableau partagé. Ce tableau sera alimenté au fil du déroulement du TP. Chaque étudiant\(\cdot\)e indiquera ses propres résultats sur une ligne du tableau, et cet encodage collectif permettra d’évaluer chaque résultat individuel à la lueur des autres résultats.
Téléchargement d’un rapport individuel sur le site Ametice. Ce rapport sera noté, en tenant compte de la correction des résultats, mais également de la clarté de formulation des réponses
Attention : nous vous recommandons fortement de remplir ces deux documents au fil du déroulement du TP, pour vous assurer de disposer de toute l’information pour répondre aux questions sans devoir revenir en arrière ou refaire certaines parties des exercices.
Harbison and co-workers (2004) used the ChIP-on-chip technology to identify the target genes of a hundred of transcription factors of the budding yeast Saccharomyces cerevisiae.
For some transcription factors, they performed several experiments in different culture media, in order to distinguish the impact of the environmental conditions on the regulated genes. Indeed, it is well-known that transcription factors are themselves regulated (transcriptionally and post-transcriptionally) depending on the presence of nutriments, stress conditions, phases of the cell cycle, and many other parameters of their environment.
In this tutorial, we will start from a cluster of target genes identifid by ChIP-on-chip for one particular transcription factor grown in one particular condition, and use a series of bioinformatics tools in order to answer two basic quesitons:
What are these genes doing together?.
Each of these genes has a particular molecular activity (enzyme, transporter, regulator, …) and is involved in one or several biological processes. Beyond these individual functions, we will investigated if they are involved collectively in sope biological process (e.g. metabolic patwhay, response to some stress, …). For this, we will use a first approach based called enrichment analysis.
Can we discover over-represented motifs in their promoters?
In yeast, transcriptional regulation is relatively simple compared to metazoa or plants. Cis-regulatory elements are located in the non-coding regions upstream of the regulated genes, with ~400bp per upstream regio. The ChIP-on-chip technology enables to detect all the regions bound by a given TF in a given condition. When analysing motifs in these promoters, our first expectation is to fint the motif specifically bound by the immunoprecipitated transcription factor. In some cases,we may also discover addtional motifs, suggesting that the gene cluster is co-regulated by some additional factors, in additin to the immunoprecipitated TF.
The goals of this tutorial extend much beyond the discovery of some nice web sites producing fancy figures after a few clicks. Indeed, we expect for students to exert their critical spirit in order to evaluate the reliability and the interest of the results, by paying a particular attention to two complementary questions:
What is the statistical significance of the results ? Attempt to understand the statistics returned by the different programs and interpret the risks of false positives.
What is the biological relevance of the results ? This can be done by evaluating the consistency between the results returned by the different tools, and by comparing their results with the known function of the transcription factor.
The reliability of the tools can further be tested by an empirical test that we will call negative control, which consists in submitting random sets of genes as queries. Indeed, even though each of the query genes has its own function, is involved in some particular biological process, pathway, … there is no reason for a random collection of them would be functionally related. The software tools should not return any significant result. For negative controls, a null answer (no result) is the correct answer.
| Abbreviation | Meaning |
|---|---|
| TF | Transcription factor |
| TFBM | Transcription factor binding motif |
| TFBS | Transcription factor binding site |
| GO | Gene Ontology |
| Resource | Description | URL |
|---|---|---|
| Gene sets | List of target genes for each factor/condition | Harbison_2004_genesets.tsv |
| Uniprot | Database of protein sequence and function | http://uniprot.org |
| SGD | Saccharomyces Genome Database | http://www.yeastgenome.org/ |
| g:Profiler | A web server for functional interpretation of gene lists | http://biit.cs.ut.ee/gprofiler/ |
| YeastCyc | Database of metabolic yeast pathways | http://yeast.biocyc.org/ |
| KEGG Pathway Mapper | A set of tools enabling to highlight a list of query genes in the metabolic pathways of the Kyoto Encyclopaedia of Genes and Genomes | http://www.genome.jp/kegg/mapper.html |
| RSAT | Regulatory Sequence Analysis Tools | http://rsat.eu/ |
The sets of target genes are provided in a text file with tab-separated values.
This file can be opened with a spreasheet softwae (LibreOffice Calc, Microsoft Excel, …), or with a simple text editor.
Column contents:
Each student will chose a given ChIP-chip result (i.e. a combination of one transcription factor and one culture medium) and select all the target genes.
In order to achieve a reasonable power in the subsequent analyses, we recommend to choose a dataset with a sufficent number of target genes (at least 10).
As an illustration, the teacher will use the target genes of the Met4p transcription factor, encoded by the MET4 gene, in SM culture medium. The corresponding rows are displayed below.
| GENE_ID | factor | condition | significance |
|---|---|---|---|
| YAL012W | MET4 | SM | 3.3900039 |
| YDR065W | MET4 | SM | 0.1927444 |
| YER092W | MET4 | SM | 4.2306097 |
| YFR030W | MET4 | SM | 2.4034924 |
| YGL184C | MET4 | SM | 5.4181212 |
| YGR055W | MET4 | SM | 0.1761979 |
| YGR204W | MET4 | SM | 6.4628565 |
| YLR092W | MET4 | SM | 0.4660093 |
| YLR179C | MET4 | SM | 1.1264004 |
| YLR180W | MET4 | SM | 1.1264004 |
| YLR301W | MET4 | SM | 0.4997180 |
| YNL277W | MET4 | SM | 1.1843924 |
At the beginning of the practical, students will fill up a table with their names, the chosen transcription factor, medium, and additional details will be filled in progressively.
Download the table of transcription factor target genes: Harbison_2004_genesets.tsv and open it with a spreadsheet (Office Calc, Excel, …).
Choose a transcription factor + culture medium of interest (at least 10 target genes), copy-paste the corresponding lines in a separate spreadsheet, which you will save on your computer.
Investigate the function of your transcription factor of interest, by gathering information from SGD and Uniprot.
In your report, write 2-3 sentences to summarize the function of your factor.
In the spreadsheet containing your selected ChIP-on-chip result, select the first column (whic contains the target gene identifiers), and copy-paste it to a text editor document, which you will save as non-formatted text file (for the sake of convenience I will name this file Harbison_2004_MET4_SM_geneIDs.txt. Of course, you should adapt the name depending on the transcription factor and culture medium of your choice.
Open a connection to the collective result table (https://goo.gl/G9pcSq).
Start to fill up a row with your information: name, factor, culture medium, number of genes, and description of the transcription factor.
The first question we will address is whether we can discover over-represented motifs in the promoters of our gene set of interest. In principle we expect to find motifs corresponding to the transcription factor used in the ChIP-chip experiment.
Search Tool Box
In the search box of the left panel, search a tool named gene information, and click on it. This opens the query form of the tool Gene information, which returns information about one or more query genes (gene names, chromosomal locations, …).
In the Organism textbox, type Saccharomyces cerevisiae. Note that as you start typing, the interface progressively displays the subset of organisms whose name matches the typed text.
Organism selection with auto-completion
Gene information form filled with target genes of MET4 in yeast cultured in SM medium
Retrieve sequences form.
Retrieve sequences form, Run analysis tab.
The result page displays links to two files: the query genes and the sequences. Save the resulting sequences (fasta file) on your computer.
Retrieve sequences result page, Results table.
Retrieve sequences result page, Next step table.
As oligomer lengths, check “6” and uncheck all the other lengths1. Leave all the default parameters, and click GO.
Read carefully the result page and fill up the oligo-analysis columns of the collective report: https://goo.gl/G9pcSq.
Note: in case oligo-analysis does not return any significant result with your query promoters, do not desperate yet, there is still a possibility to run an alternative tool to discover motifs in your promoters. Indeed, some factors recognize a spaced pair of very short oligonucleotides (e.g. \(CGGn_{10}GCC\)) rather than a single oligonucleotide (e.g. \(CACGTGT\)). Such spaced motifs generally escape detection with oligo-analysis. For this reason, we developed an alternative algorithm named dyad-analysis, which detects over-represented spaced pairs of tri-nucleotides with any spacing value from 0 to 20 (by default).
Repeat the same procedure using dyad-analysis instead of oligo-analysis, and compare the discovered motifs (the difference may depend on your transcription factor of interest). fill up the dyad-analysis columns of the collective report.
A good motif discovery program should be able to return a negative answer when there is no specific functional relationship between the submitted promoter sequences. We can test this with the RSAT program random-genes.
In your personal report, comment in a few lines (max 10 lines) the results obtained with oligo-analysis and dyad-analysis in your gene set of interest (the target genes detected by ChIP-chip) and in the random gene selection.
In the previous section, we used the RSAT tool random-genes to pick up a set of genes as negative control. For the yeast Saccharomyces cerevisiae, RSAT uses the gene identifiers from the NCBI ENTREZ database.
Annoyingly, the different tools available on the Web are not all using the same IDs, for several reasons. The heterogeneity of identifiers for biological objects (genes, proteins, molecules, …) can be cumbersome, and is currently considered as a challenge for the interoperability of bioinformatics resources.
The problem is so general that some web sites developed specialized tools dedicated to ID conversion. We will test one of these, in order to obtain suitable gene identifiers for the subsquent steps of this tutorial.
Open a connection to the g:Convert tool https://biit.cs.ut.ee/gprofiler/gconvert.cgi, which is part of the of the g:Profiler Web site.
In the Query box, paste the IDs returned by RSAT random-genes, which should be numbers like those below.
855536
852544
856121
851763
...
g:Convert export to an Excel table.
click on the Convert IDs button, and click on the link Download data ….
Open the downloaded file with a spreadsheet tool (LibreOffice Calc, Excel).
Open a connection to g:Profiler (http://biit.cs.ut.ee/gprofiler/).
Select the organism Saccharomyces cerevisiae, and submit your list of TF target genes.
g:Profiler query form. .
Save the result page and figures on your computer.
Do the same with the random gene selection.
In the collective report, fill up the columns related to g:Profiler.
In your personal report, summarize your interpretation of the results in one paragraph.
Open a connection to the KEGG pathway mapper (http://www.genome.jp/kegg/mapper.html).
Test the different tools with the list of targets, and the random gene set. In particular, try the tool Search & Color pathways.
Note: for the organism, use the acronym sce for Saccharomyces cerevisiae.
KEGG Search & Color Pathways: query form. .
Here is an example of result for the MET4 targets in SM medium.
KEGG Search & Color Pathways. Example of colored map. Green boxes indicate enzymes for which a gene was identified in the yeast genome. Query genes are highlighted in pink (in this case, MET4 target genes in SM medium from Harbison et al., 2004).
Interpret the results obtained with your gene set : are your target genes rasembled into one or a few pathways? How close do they appear on the pathway maps? How does it compare with the random gene selection?
Fill up the KEGG-related columns in the collective report.
In your personal report, summarize your results of the KEGG pathway coloring in a few lines.
Open a connection to BioCyc (http://biocyc.org/).
In the top right side of the page, click Change organism database (the default is the bacteria Escherichia coli).
BioCyc organism database (default is Escherichia coli)
BioCyc organism selection
BioCyc yeast database
We will now highlight the target genes on the metablic overview map.
Open the Metabolism -> Cellular Overview map (http://yeast.biocyc.org/overviewsWeb/celOv.shtml).
In the top-right corner of the YeastCyc screen, click Show operations and select Highlight -> Highlight Gene(s) -> From file.
In the dialog box, click Choose file, locate the gene ID text file on your computer (in my case the file is named Harbison_2004_MET4_SM_geneIDs.txt), and click Highlight.
Analyse the resulting map.
Run the same analysis with the random gene selection.
Harbison, C, D Gordon, T Lee, N Rinaldi, K Macisaac, T Danford, N Hannett, et al. 2004. “Transcriptional regulatory code of a eukaryotic genome.” Nature 431 (7004): 99–104.
This is only to simplify the discussion at the end of this tutorial, but for normal use of the tool we recommend to analyse simultaneously the hexa, hepta and octanucleotides↩