ESIL :: 1ère année :: Module "Bioinformatique" :: année 2012/2013 :: Jacques van Helden

Searching proteins in databases, by keywords (annotations) and by sequence similarity


[Back to contents]

Olfactory receptors

Olfactory receptors constitue the largest gene families in mammalian genomes. The number of receptors shows important variations between species (see Figure 1 from Keller et al., 2012). Two recent studies assessed the inter-individual variability of olfactory receptor (OR) and vomeronasal (V1R) genes in mouse and human, respectively. The goal of this exercise is to collect information about the number of olfactory receptors in different species. For this, we will use different databases, formulate simple queries first, and try to refine them in order to obtain suitable results.

OR proteins annotated in databases

  1. Formulate a simple (naive) query by typing a few keywords in the query box, in order to select human olfacoty receptors. How many results to you obtain? How many of them have been reviewed?
  2. Clear the query, and use the Advanced search interface in order to address the same query in a structured way: select all entries belonging to the organism "Homo sapiens", and whose protein name matches "olfactory receptor". How many entries to you obtain? How do you explain the difference?
  3. We will now compare the number of olfactory receptors annotated in diverse genomes. Each student should
  4. After all students have obtained the results, we will compare the strategies used to formulate the queries (technical validity), and analyze the number of receptors found in the respective species (biological interpretation).


  1. The scientific names of an organism can be obtained from the Taxonomy databases at NCBI (
  2. In Uniprot, first try a naive query by entering a few keywords in the query box. After this, used the Advanced search to formulate a structured query.
View solution| Hide solution

Getting OR proteins by similarity searches

We will now adopt a radically different approach: instead of relying on gene and protein annotations, we will use the sequence of a specific olfactory receptor (for example Human Olfactory receptor 1F1) as "seed" for a sequence similarity search. The principle is to compare the query sequence with all the sequences of a reference database, in our case, the peptidic sequences obtained by translating all the genes identified in the genome of an organism of interest (for example the cat Felis catus.

  1. Get the peptidic sequence of the Human Olfactory receptor 1F1 (accession number O43749 in Uniprot) in fasta format.
  2. Open a connection to NCBI BLAST (, and use the tool protein blast to collect all proteins similar to this sequence in the non-redundant protein sequences database.
  3. In a separate window, run the same query but restrict the search to your organism of interest.
  4. We will no try another approach, which consist in comparing the sequence of our protein (OR1F1) with the DNA genomic sequences of the organism of interest. Identify the appropriate tool among those proposed at NCBI BLAST (, and run it. Do you obtain the same number of results ? Discuss the differences (if any).
View solution| Hide solution
[Back to contents]


  1. Read the record of the protein coded by the gene ARO1 of the budding yeast Saccharomyces cerevisiae, and try to understand the relationship between the organization and the function of this protein.
  2. In the MetaCyc database, analyze the metabolic steps in which this protein is involved.
  3. Use the NCBI BLAST ( tool to search similar sequences in the subset of the non-redundant protein sequences databases belonging to the bacteria Escherichia coli K12 (TAXID 83333).
  4. Interpret the resulting matches, in terms of coverage, identities, similarities, and "expect" statistics.
View solution| Hide solution
[Back to contents]

Random expectation

  1. With the sowftare suite Regulatory Sequence Analysis Tools (RSAT, generate at random a protein sequence of the same length as the yeast protein Aro1p. Adapt the background model in order to generate a sequence with dipeptide frequencies similar to those of the yeast Saccharomyces cerevisiae.
  2. Use NCBI BLAST ( to search similar proteins
    • in the proteome of Escherichia coli, as above
    • in UniprotKb/Swiss-Prot
    • in the non-redundant protein sequence (nr) database

How many hits do you obtain ? Compare the numbers obtained by the different students. Did you a priori expect to find any hit at all ? How good are the different indications returned by BLAST (alignment length, similarity, identifi, expect ?

View solution| Hide solution
[Back to contents]

Urate oxidase

  • Search the non-redundant database for similarities with the mouse urate oxidase (Uniprot Accession P25688)
  • On the NCBI BLAST server, use protein blast to search similar sequences in the "Reference proteins (Refseq)" of Homo sapiens. How many hits do you obtain ? How significant are their scores ?
  • Use the same peptidic sequence (P25688) as query to scan the 6-frames translated reference genome of Homo sapiens, with the tool tblastn
  • View solution| Hide solution
    Jacques van Helden ( TAGC, Aix-Marseille Université).