Practicals - Biomolecular databases


[back to contents]


This is the practical session for the chapter

[back to contents]


This tutorial will be based on the following Web resources.

Acronym Type Description+URL
EMBL Nucleic sequences The EMBL Nucleic Sequence Database (EBI - UK)
Genbank Nucleic sequences Genbank (NCBI - USA)
DDBJ Nucleic sequences DDBJ - DNA Data Bank of Japan
UniProt Protein sequences UniProt - the Universal Protein Resource
PDB 3D structure of macromolecules PDB - The Protein Data Bank
EnsEMBL Genome browser EnsEMBL Genome Browser (Sanger Institute + EBI)
UCSC Genome browser UCSC Genome Browser (University California Santa Cruz - USA)
ECR Genome browser ECR Browser
Integr8 Comparative genomics Integr8 - access to complete genomes and proteomes
Prosite Protein domains Prosite - protein domains, families and functional sites
Pfam Protein domains PFAM - Protein families represented by multiple sequence alignments and hidden Markov models (HMMs) (Sanger Institute - UK)
CATH Protein domains CATH - Protein Structure Classification
InterPro Protein domains InterPro (EBI - UK)
GO Gene ontology Gene Ontology Database
Entrez Multi-database A collection of biomolecular databases maintained at the NCBI (USA), accessible via an interface called Entrez.
SRS Data warehouse A collection of biomolecular databases maintained at the European Institute for Bioinformatics (EBI, UK), accessible via an interface called SRS

[back to contents]

A quick tour of selected databases

The number of biomolecular databases is growing so fast that it is impossible to give a balanced survey of all the existing resources. We selected here a few databases on the basis of various criteria (popularity, ease of access, ...) to illustrate the type of information that can be retrieved from them.

As a matter of exercise, we propose to browse some databases in order to grab information about one particular protein. Each student can do the same analysis with some protein of interest to him/her. If you are out of inspiration, you can for example run the exercise with the Drosophila protein Ubx.


Choose a protein for which you have some prior knowledge (e.g. the protein Ubx from Drosophila melanogaster, and try to extract all the information relevant to this protein in the databases listed in the table of biomolecular databases above.

Next steps

In the exercise above, we saw that each database an provide us with a piece of information about some aspects of our protein of interest:

Note that this is just a very small sample of the information that can be obtained via the hundreds of biomolecular databases distributed around the world.

We will now consult two Web servers (NCBI Entrez and EBI SRS) that provide an integrated access to multiple databases, thereby facilitating the consultation of multiple aspects regarding a protein of interest.

[back to contents]

Retrieving information from the NCBI with Entrez

Entrez is a retrieval system for searching several linked databases stored at the NCBI (National Computational Bioinfology Institute of the United States).


During this tutorial, we will learn to use the interface of NCBI Entrez to retrieve a protein of interest. As will be seen, a simple formulation of the query generally returns too many hits, and the desired answer may be lost in hundreds or thousands of other records. We will see how to use advanced search options in order to refine the query.

Quick panorama of the databases

A naive query to the protein database

Logical operators

Imposing constraints on a specific field

Specifying constraints on multiple fields

Browsing a protein entry

Saving the protein sequence in FASTA format

Getting the query history

[back to contents]

Retrieving information from various databases with SRS

[back to contents]

Additonal exercises

[back to contents]

More info

[back to contents]

Jacques van Helden (