ESIL :: 1ère année :: Module "Bioinformatique" :: année 2011/2012 :: Raphaël Bourgeas & Jacques van Helden

Session 1: Similarity search in sequence databases


Contents


Introduction

BLAST (Altschul et al. 1990, 1997) is without contest one of the most popular bioinformatics tools worldwide. The tool relies on an efficient algorithm that allows to scan huge databases to detect all sequences presenting a significant level of similarity with a query sequences.

BLAST is available on many Web sites, and equipped with user-friendly interfaces. It is however crucial to understand the parameters of the program in order to interpret the results correctly.

The goal of this practical is to get familiar with the various modalities of BLAST (blastp, blastn, ...), and to learn the interpretation of the results.


Resources

Name Category Description + Link
UniProt Protein sequences UniProt - the Universal Protein Resource.
Uniprot Web site includes a BLAST server.
http://www.uniprot.org/
BLAST @ NCBI Sequence similarity search
http://www.ncbi.nlm.nih.gov/blast/

[back to contents]

Searching a protein database with a query protein (blastp)

Context

The program blastp scans a database of peptidic sequences with a peptidic sequence as query. There are several advantages to compare proteins with proteins.

Goals of this exercise

Tips

  1. The NCBI BLAST server allows you to restrict a search to a taxon of your choice, by entering the taxon name or its ID in the taxonomy database (taxid).

Questions

  1. Retrieve from Uniprot the Human short-wave sensitive opsin.
  2. In the NCBI BLAST server, retrieve all similar proteins in Primates. Select the first non-human match, and analyze the matches. Discuss the meaning of the matching scores, and show how they relate with the alignment.
  3. Match the same protein against insects. Compare the resulting scores and alignments with those obtained in primates.
  4. Match the same protein against Fungi. Compare the resulting scores and alignments with those obtained in primates.
  5. Match the same protein against all proteins of Homo sapiens. Analyze the function of the most distant matches. Are they related to color vision?

[back to contents]

Searching nucleic acids databases with a nucleic acids query (blastn)

Context

The program blastn searches a nucleic acids database with a nucleic acids query. Although the most widely used similarity searches are based on blastp (see blastn exercise), the search of simlilarities between nucleic sequences can be useful for various purposes.

  1. Search similarities for non-coding sequences (DNA, RNA).
  2. Anlayse silent mutations between closely related coding DNA sequences.
  3. Identify pseudo-genes in a genome.
  4. ... (you can probably find other cases)

Goals of this exercise

Tips

Questions

  1. Open a connection to the NCBI Gene database, and identify the genomic DNA sequence of the Human short-wavelength sensitive opsin.
  2. With the NCBI BLAST server, search for similar sequences in primate genomes. Compare the results with those obtained with blast in the previous exercise.

[back to contents]

Searching nucleic acids databases with a peptidic sequence as query (tblastn)

Do cats see colors ?

Goals of this exercise

Tips

Questions

  1. Retrieve the peptidic sequence (fasta format) of the Human blue-sensitive opsin.
  2. Open a connection to the NCBI BLAST server. Using the tool tblastn, search similarities in all sequences of the cat (Felis catus).
  3. Open a connection to the UCSC Genome Broswer, click BLAT, select the Cat genome and submit the Human opsin peptidic sequence. How many matches do you find ? Analyze the location of the matches, their genomic context.

[back to contents]

Empirical validation of the E-value

Context

Goals of this exercise

Tips

Questions

  1. Using the tool "random sequence" of the RSAT software suite (http://rsat.ulb.ac.be/rsat/), generate a peptidic sequences of 350 bp with trinucleotide frequencies calibrated on Human protein sequences.
  2. With the NCBI BLAST server, search for similar sequences in the non-redundant protein database. How many matches do you obtain ? How do you understand this ?
  3. Run the same query with an E-value ("Expect") threshold of 100. How many matches do you obtain ? Is this surprizing ?
  4. Run the same query with an E-value ("Expect") threshold of 0.0001. How many matches do you obtain? Dos it correspond to your expectation?

Bibliographic references

  1. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. (1990). Basic local alignment search tool. J Mol Biol 215, 403-10.
  2. Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25, 3389-402.

Raphaël Bourgeas (IMR, Université de Provence) & Jacques van Helden (TAGC, Université de la Méditerranée).