ESIL :: 1ère année :: Module "Bioinformatique" :: année 2011/2012 :: Raphaël Bourgeas & Jacques van Helden

Session 1: Retrieving information from biological databases


Contents


[Back to contents]

Prerequisites

This exercise assumes that you already read the following chapters.


[Back to contents]

Introduction

In the theoretical lecture, we saw that color vision relies on specialized proteins called opsins, which are expressed in the cone cells of our retina.

We also saw that most mammals have a dichromatic vision, restricted to two types of opsins:

Trichromatic vision appeared in "old-world" primates after a duplication leading to three opsin-coding genes:

The goal of this exercise is to compare the protein sequences of Human and some other Mammals tounderstand the evolutionary relationships between those proteins.


[back to contents]

Exercise: Genome browsers

Context

We will use three genome browsers to identify the genomic location of the opsin-coding genes in various organsms, and try to understand the relationship between the taxonomic position of an organism, its number of opsin-coding genes, and its di- tri- or tetrachromatic vision.

Goals of this exercise

Tips

  1. The + button on the right side of the ECR genome brower allows you to visualize additional organisms.
  2. In the query box of the genome browsers used here, you can either type the coordinates of a chromosomal region (e.g. chr10:88414314-88426605) or simply a gene name (e.g. "opsin"). In the latter case, the browser presents you a list of matching genes, and you can pick up your preferred one.
  3. In the UCSC genome browser, you can select tracks of annotation to be displayed with various levels of detail. Make sure to adapt the visualization mode to your purposes, in order to display the relevant information without diluting it in too many details.

Tutorial

  1. In the query box of the ECR genome browser, select Human for the option base genome, enter opsin and click Submit. In the list of Refseq genes, select the medium-wave-sensitive (MWS) opsin.
  2. Click on the link Instructions to get the meaning of the color code (blue, green, red, ...) of the profiles.
  3. Come back to the map of the opsin gene, and try to interpret the profiles.
  4. Click on the + icon on the right of the window, above the imagesrepesenting the animal species. Make sure that the following genomes are displayed: chimpanzee, mouse, dog, cow, frog, rhesus macaca, tetraodon. Inspect the conservation level of the MWS-opsin gene in these species.
  5. In a separate tab, display the same region with the UCSC genome browser.
  6. Identify the 2 closest neighbours on both sides of the SW-opsin coding gene. Is their function related to color vision?
  7. Visualize the region surrounding the gene coding for the medium-wave-sensitive opsin. Identify its 2 neighbours on boths sides. Are they related to vision ?
  8. Analyze the conservation tracks of the ECR and UCSC genome browsers (for UCSC genome browser, use the "full" display for "Conservation" under "Comparative Genomics"). Do you identify conservation of the LW and MW opsin genes in the genomes of other Mammalia (mouse, doc, elephant)?
  9. Come back to the home page of the UCSC genome browser. Select the mouse genome and identify the MW-sensitive opsin gene. Zoom out to inspect the two neighbour genes on each side of Opn1mw. How does it compare to the homologous region in Human ?
  10. Do the same exercise with the following organisms: Chimp, Dog, Chicken, Zebrafish (note: some genomes are incompletely sequenced or poorly annotated; try to get the best of the gene annotations for each of them).

Questions

  1. On the left side of the Human red opsin gene (OPN1LW) there are two genes with exactly the same name : OPN1MW. Find in the gene annotations the reason for this.
  2. Summarize your observations in a table showing which opsins were found in which organism.
  3. Interpret the results of this table: which organisms are likely to have a di, tri- or tetra-chromatic vision ?
  4. (Advanced, to be discussed during practical): when the Human genome is taken as reference, the conservation tracks of ECR and UCSC show an apparent conservation of both LW and MW-sensitive opsins in mouse or dog ? Is this a good indication to support the hypothesis that these Mammals have a trichromatic color?

[back to contents]

Exercise: NCBI nucleotide DB

Context

The National Center for Biotechnology Information (NCBI) is a server providing access to biomedical and genomic information. You can query his databases via the Entrez portal to get informations about opsins, and more specifically the red opsin.

Goals of this exercise

Tips

  1. The red color is a long wave color.
  2. Logicals operators "AND" and "OR" are case sensitive, you have to use them higher case otherwise they won't be considered as logicals operators.
  3. Do not mix up homology and similarity.
  4. If you feel difficulties with the Entrez query interface, you can follow the chapter Retrieving information from the NCBI with Entrez of the tutorial on databases, and then come back to this exercise.

Questions

  1. Saerch the human red-opsin gene in the Entrez query form. Note that this tool applies the search to a wide variety of databases. Test various ways to formulate your query and compare the number of selected entries ? How many results do you obtain ? Which database is usefull to get the genes ?
  2. Improve your query using logicals operators, and a more refined query. The more keywords you use, the less results you will have. Which query gave you less than ten genes ? What is the gene ID for the red human opsin ?
  3. What are the informations given about the red opsin in the gene sheet ?
  4. From the entrez query window, find the category which allow you to find genes of the same family. Which opsin did you find ? And why don't you find the other one ?

[back to contents]

Exercise: Uniprot

Context

Goals of this exercise

Tips

  1. Use the Advanced search to formulate structured queries, or to further filter the results of a previous query.
  2. For some fields, the Uniprot query box supports automatic completion: when you tye the beginning of a word, all possible completions are displayed. This is particularly useful for fields with restricted content (e.g. organism name, taxonomy) of with a controlled vocabulary (e.g. Gene Ontology term).

Questions

  1. How many proteins are there in the protein knowledge base (UniprotKB) ? Which proportion of these entries has been reviewed by a human annotator ?
  2. How many human sequences can be found in the database? How many proteins match the name "opsin" ? How many human opsins can you find ?
      Rather than typing all keywords in the query box, try to formulate the query in a smart and precise way. In the report, indicate how younperformed the query.
  3. Read the annotations to get the function of each Human opsin-coding protein, and the name of the corresponding gene. Which proteins are involved in color vision ? What is the function of the other opsin proteins ?

[back to contents]

Exercise: PDB

Context

The characterisation of tri-dimensional structure of proteins bring us a lot of knowledgies about them, about their functions, their localisations, their interactions with other proteins or other macro and micro-molecules. You will now visualize a protein using Jmol, a java applet integrated in the Protein Data Bank (PDB). The PDB is a bank of protein's structures, containing informations about both the proteins and the way they were characterised.

The thioredoxins proteins are well known as regulator of redox potential within cells by their reductor activity. Here, we will have a look on the Thioredoxin 2 of Escherichia Coli (PDB ID : 1THX).

Goals of this exercise

Tips

  1. You will find important information in the 1THX page, notably about the caracterization of the structure.
  2. To visualize the structure, use the "View in Jmol" button under the picture of the protein structure.
  3. Proteins are already colored according to their secondary structure.
  4. You can (have to !) use the Jmol menu, under the vizualisation window.
  5. if you right-clic on the jmol applet, you will have additionnal functionnality ; among other things, you'll find informations about the feature of the protein in the header of the file. You can display it in the menu : "Afficher / Entête du fichier".

Questions

  1. Informations about the protein : how many amino acids does it contains ? Which method were used to obtain the structure, and how much is the resolution ?
  2. Informations about the structure : What structural features are present in this protein. What is this oligomerization state ?
  3. Information about the activity : The electron transfer is governed by a disulfur bond. Display it and find the amino acids involved in it.

[Back to contents]

Resources

This tutorial will be based on the following Web resources.

Acronym Type Description+URL
EMBL Nucleic sequences The EMBL Nucleic Sequence Database (EBI - UK)
http://www.ebi.ac.uk/embl/
Genbank Nucleic sequences Genbank (NCBI - USA)
http://www.ncbi.nlm.nih.gov/Genbank/
DDBJ Nucleic sequences DDBJ - DNA Data Bank of Japan
http://www.ddbj.nig.ac.jp/
UniProt Protein sequences UniProt - the Universal Protein Resource
http://www.uniprot.org/
PDB 3D structure of macromolecules PDB - The Protein Data Bank
http://www.rcsb.org/pdb/
EnsEMBL Genome browser EnsEMBL Genome Browser (Sanger Institute + EBI)
http://www.ensembl.org/
UCSC Genome browser UCSC Genome Browser (University California Santa Cruz - USA)
http://genome.ucsc.edu/
ECR Genome browser ECR Browser
http://ecrbrowser.dcode.org/
Integr8 Comparative genomics Integr8 - access to complete genomes and proteomes
http://www.ebi.ac.uk/integr8/
Prosite Protein domains Prosite - protein domains, families and functional sites
http://www.expasy.ch/prosite/
Pfam Protein domains PFAM - Protein families represented by multiple sequence alignments and hidden Markov models (HMMs) (Sanger Institute - UK)
http://pfam.sanger.ac.uk/
CATH Protein domains CATH - Protein Structure Classification
http://www.cathdb.info/
InterPro Protein domains InterPro (EBI - UK)
http://www.ebi.ac.uk/interpro/
GO Gene ontology Gene Ontology Database
http://www.geneontology.org/
Entrez Multi-database A collection of biomolecular databases maintained at the NCBI (USA), accessible via an interface called Entrez.
http://www.ncbi.nlm.nih.gov/Entrez/
SRS Data warehouse A collection of biomolecular databases maintained at the European Institute for Bioinformatics (EBI, UK), accessible via an interface called SRS
http://srs.ebi.ac.uk/

Raphaël Bourgeas (IMR, Université de Provence) & Jacques van Helden (TAGC, Université de la Méditerranée).