# Instructions pour les étudiants AMU

Le TP qui suit est rédigé en anglais, pour vous familiariser avec la langue utilisée sur l’ensemble des ressources bioinformatiques internationales.

Les rapports de TP peuvent cependant être rédigés en français.

Chaque étudiant devra soumettre ses résultats de deux façons

1. Encodage de certains résultats dans un tableau partagé. Ce tableau sera alimenté au fil du déroulement du TP. Chaque étudiant$$\cdot$$e indiquera ses propres résultats sur une ligne du tableau, et cet encodage collectif permettra d’évaluer chaque résultat individuel à la lueur des autres résultats.

2. Téléchargement d’un rapport individuel sur le site Ametice. Ce rapport sera noté, en tenant compte de la correction des résultats, mais également de la clarté de formulation des réponses

Attention : nous vous recommandons fortement de remplir ces deux documents au fil du déroulement du TP, pour vous assurer de disposer de toute l’information pour répondre aux questions sans devoir revenir en arrière ou refaire certaines parties des exercices.

# Introduction

Harbison and co-workers (2004) used the ChIP-on-chip technology to identify the target genes of a hundred of transcription factors of the budding yeast Saccharomyces cerevisiae.

For some transcription factors, they performed several experiments in different culture media, in order to distinguish the impact of the environmental conditions on the regulated genes. Indeed, it is well-known that transcription factors are themselves regulated (transcriptionally and post-transcriptionally) depending on the presence of nutriments, stress conditions, phases of the cell cycle, and many other parameters of their environment.

## Goal of this tutorial

In this tutorial, we will start from a cluster of target genes identifid by ChIP-on-chip for one particular transcription factor grown in one particular condition, and use a series of bioinformatics tools in order to answer two basic quesitons:

1. What are these genes doing together?.

Each of these genes has a particular molecular activity (enzyme, transporter, regulator, …) and is involved in one or several biological processes. Beyond these individual functions, we will investigated if they are involved collectively in sope biological process (e.g. metabolic patwhay, response to some stress, …). For this, we will use a first approach based called enrichment analysis.

2. Can we discover over-represented motifs in their promoters?

In yeast, transcriptional regulation is relatively simple compared to metazoa or plants. Cis-regulatory elements are located in the non-coding regions upstream of the regulated genes, with ~400bp per upstream regio. The ChIP-on-chip technology enables to detect all the regions bound by a given TF in a given condition. When analysing motifs in these promoters, our first expectation is to fint the motif specifically bound by the immunoprecipitated transcription factor. In some cases,we may also discover addtional motifs, suggesting that the gene cluster is co-regulated by some additional factors, in additin to the immunoprecipitated TF.

The goals of this tutorial extend much beyond the discovery of some nice web sites producing fancy figures after a few clicks. Indeed, we expect for students to exert their critical spirit in order to evaluate the reliability and the interest of the results, by paying a particular attention to two complementary questions:

1. What is the statistical significance of the results ? Attempt to understand the statistics returned by the different programs and interpret the risks of false positives.

2. What is the biological relevance of the results ? This can be done by evaluating the consistency between the results returned by the different tools, and by comparing their results with the known function of the transcription factor.

The reliability of the tools can further be tested by an empirical test that we will call negative control, which consists in submitting random sets of genes as queries. Indeed, even though each of the query genes has its own function, is involved in some particular biological process, pathway, … there is no reason for a random collection of them would be functionally related. The software tools should not return any significant result. For negative controls, a null answer (no result) is the correct answer.

# Abbreviations

Abbreviation Meaning
TF Transcription factor
TFBM Transcription factor binding motif
TFBS Transcription factor binding site
GO Gene Ontology

# Bioinformatics resources

Resource Description URL
Gene sets List of target genes for each factor/condition Harbison_2004_genesets.tsv
Uniprot Database of protein sequence and function http://uniprot.org
SGD Saccharomyces Genome Database http://www.yeastgenome.org/
g:Profiler A web server for functional interpretation of gene lists http://biit.cs.ut.ee/gprofiler/
YeastCyc Database of metabolic yeast pathways http://yeast.biocyc.org/
KEGG Pathway Mapper A set of tools enabling to highlight a list of query genes in the metabolic pathways of the Kyoto Encyclopaedia of Genes and Genomes http://www.genome.jp/kegg/mapper.html
RSAT Regulatory Sequence Analysis Tools http://rsat.eu/

# Data

The sets of target genes are provided in a text file with tab-separated values.

Harbison_2004_genesets.tsv

This file can be opened with a spreasheet softwae (LibreOffice Calc, Microsoft Excel, …), or with a simple text editor.

Column contents:

1. Gene identifiers (according to the convention agreed amon yeast geneticists).
2. Transcription factor
3. Culture medium (growth condition)
4. Statistical significance of the binding ($$sig = -log_{10}(\text{E-value}$$)

# Exercises

## Choice of a gene set

Each student will chose a given ChIP-chip result (i.e. a combination of one transcription factor and one culture medium) and select all the target genes.

In order to achieve a reasonable power in the subsequent analyses, we recommend to choose a dataset with a sufficent number of target genes (at least 10).

As an illustration, the teacher will use the target genes of the Met4p transcription factor, encoded by the MET4 gene, in SM culture medium. The corresponding rows are displayed below.

Example of gene set: genes whose promoter is bound by the transcripiton factor Met4p in SM culture medium.
GENE_ID factor condition significance
YAL012W MET4 SM 3.3900039
YDR065W MET4 SM 0.1927444
YER092W MET4 SM 4.2306097
YFR030W MET4 SM 2.4034924
YGL184C MET4 SM 5.4181212
YGR055W MET4 SM 0.1761979
YGR204W MET4 SM 6.4628565
YLR092W MET4 SM 0.4660093
YLR179C MET4 SM 1.1264004
YLR180W MET4 SM 1.1264004
YLR301W MET4 SM 0.4997180
YNL277W MET4 SM 1.1843924

At the beginning of the practical, students will fill up a table with their names, the chosen transcription factor, medium, and additional details will be filled in progressively.

1. Download the table of transcription factor target genes: Harbison_2004_genesets.tsv and open it with a spreadsheet (Office Calc, Excel, …).

2. Choose a transcription factor + culture medium of interest (at least 10 target genes), copy-paste the corresponding lines in a separate spreadsheet, which you will save on your computer.

3. Investigate the function of your transcription factor of interest, by gathering information from SGD and Uniprot.

In your report, write 2-3 sentences to summarize the function of your factor.

4. In the spreadsheet containing your selected ChIP-on-chip result, select the first column (whic contains the target gene identifiers), and copy-paste it to a text editor document, which you will save as non-formatted text file (for the sake of convenience I will name this file Harbison_2004_MET4_SM_geneIDs.txt. Of course, you should adapt the name depending on the transcription factor and culture medium of your choice.

5. Open a connection to the collective result table (https://goo.gl/G9pcSq).

6. Start to fill up a row with your information: name, factor, culture medium, number of genes, and description of the transcription factor.

## Motif analysis

### Motif discovery in the promoters of a gene set

The first question we will address is whether we can discover over-represented motifs in the promoters of our gene set of interest. In principle we expect to find motifs corresponding to the transcription factor used in the ChIP-chip experiment.

1. Open a connection to RSAT Fungi (http://fungi.rsat.eu/).