Practicals - Sequence similarity search

Contents

  1. Introduction
  2. Resources
  3. Running BLAST from SRS
  4. Running BLAST from Entrez
  5. Exercises
[back to contents]

Introduction

We will use BLAST to retrieve sequences on the basis of their similarity with a query sequences.

[back to contents]

Resources

We will run BLAST from two servers :

  1. SRS, at the European Bioinformatics Institute
  2. Entrez, at the National Center for Bioinformatics.
[back to contents]

Running BLAST from SRS

Scanning the Swiss-Prot database with a query protein

As a study case, we will scan all protein sequences of the Swiss-Prot database to search similarities with the E.coli enzyme Homoserine O-succinyltransferase (gene metA).

  1. Connect the SRS server at EBI.
  2. In the tab Libraries, select the database UniProtKB/Swiss-Prot, and ask the Standard query form.
  3. Select the protein from Escherichia coli (field Organism name) whose Description contains "Homoserine O-succinyltransferase" (see practical on databases).
  4. The result table should display the entry META_ECOLI. Click on this record; this will lauch the complete Swiss-Prot entry. In the left box (Entry options), click on the button Save, and store the sequence on your hard drive, in fasta format (we will re-use it later), by executing the following steps.
  5. Come back to the page with the META_ECOLI record.
  6. On the left panel (Entry options), there is an option Launch analysis tools, with a pop-up menu. This pop-up menu allows you to send the selected sequence(s) as input for different programs.

  7. Select NCBI BlastP and click Launch.

  8. You can see that the sequence has automatically been pasted in the BlastP form. In the pop-up menu Databases, select UniprotKB/Swiss-Prot
  9. Increase the Number of hits and alignments to show to 500.

  10. Increase the Number of best hits from a region to keep to 500.

  11. Important: by default, the BLAST interface sets the upper thresholld of E-value to 10. In our opinion, such an option should generally be avoideed, because it means that we expect 10 false positives per search, on average. We thus recommend to change the E-value option to a much lower value (e.g. 1.0E-10).
  12. Leave all options unchanged and run the program (by clicking Launch in the left panel).

  13. The time required for processing the query depends on the server load (usually a few seconds are sufficient). Click on the results link to see the status of your query. Reload this status page periodically, until the query is ready. At that moment, a link will appear, allowing you to view the result.
  14. When the search result is ready, click on it. This will display a table listing all the hits found during the BLAST search. This table only shows the 30 most significant hits. In order to see more results, you can use the Show option in the Display options on the left of the window. For example, you can ask to show 500 results per page. This will not take too much time, because the results are displayed in a synthetic way, with the summary table.

Interpreting the result

The result table gives you a synthetic view of the blast results. The last columns indicate the statistical parameters of the match: E-value, percentage of identity, match length.

The first column (BLASTP) of the synthetic table gives links towards the pairwise aligments of your query against each matching protein, respectively. Not surprizingly, the top result is a hit of your querty protein against itself (since your query protein was part of the Swiss-Prot database). In order to see a less trivial alignemnt, go down in the list and select a match with an intermediate level of identity (e.g. ~70% identity). Click on the corresponding link in the left column of the table. Here is an example of one alignment resulting from BLAST.

The first row of the BLAST result gives a short description of the protein from the database that matches our query protein.

The next lines indicate various macthing scores.

In the example above, we can consider that the protein META_AERHH from Aeromonas hydrophila presents a highly significant match (E-value=e-118) with the protein META_ECOLI from Escherichia coli K12. The alignment shows 67% of identity and 79% of similarity ("positives") over a total length of 299.

The probability to observe such a significant match by chance is so low that we can frankly consider that these two proteins are very likely to be homologous (i.e. originate from some common ancestor).

Saving the matching sequences in FASTA format

After having performed the BLAST search, one generally wants to store the resulting sequences in FASTA format, because this format can be taken as input for many other bioinfirmatics tools. For example, the FASTA format can be loaded to ClustalX in order to obtain a multiple alignment from the collection of sequences collected with BLAST (remember: BLAST is a pairwise alignment program: the query sequence is aligned with each sequence of the database separately).

  1. In the box Result options on the left side of the BLAST summary table, click on Save.

  2. Select the maximum number of sequences to save
  3. Select the option Save with view Alignment in FastA.

  4. Click Save and save the result in some file on your computer.

Notes on the saved files

  1. The files saved with the option Alignment in FastA contain a succession of pairwise alignments between the query protein and the database proteins. The gaps are indicated in the file by hyphens.
  2. The SRS server allows you to launch the sequences directly in clustalW, in order to perform a multiple alignment. The program clustalX is howeve more convenient for displaying the result of a multiple alignment, we thus recommend to store the BLAST result on your computer, and then to load it in the program clustalX, as explained in the tutorial on sequence alignment.

Searching similarities in non-redundant collections

The query above returned a large number of hits, many of which were very highly significant. This significance is manifest because the E-values are very low (e.g. 1.0E-147), and the percentages of identities are very high (many proteins have >95% identity with our query).

The high number of matches comes from the fact that homologous proteins can be found in a multitude of sequenced bacterial genomes that are now available. This is very convenient to search for putative orthologs of a given protein in different genomes, but sometimes one would like to obtain some more distant hits, in order to detect similarities with proteins playing distinct functions. Since a few years, the exponential increase of available sequences has a paradoxical consequence: any BLAST search will return in the top of the list several dozens (or hundreds) of proteins having a very high similarity, and, since the number of hits is restricted, the low-similarity matches will be missed.

Fortunately, the UniprotKB curators managed to circumvent this problem by creating collections of non-redundant proteins. If you want to search for more distant homologs,it is thus recommended to use these non-redundant collections. For this, you can come back to the BLAST query form, and select as Database one of the Uniprot Cluster collections

[back to contents]

Running BLAST from Entrez

Getting the query sequence

  1. Connect the Entrez server at NCBI.

  2. Select the Protein database.

  3. Retrieve Homoserine O-succinyltransferase from the strain Escherichia coli K12 with the following query:
    Homoserine O-succinyltransferase AND Escherichia coli K12[organism]
    (don't forget that the AND must be in uppercases)

  4. Select the entry (it should be P07623) by checking the check box on the left of the identifier.

  5. At the top of the record, there is a button Send to followed by a pop-up menu. In the pop-up menu, select Clipboard, and click Send to. This will allow you to retrieve the record later.

  6. Open now the record of the selected protein. Select the sequence (at the bottom of the record) and copy it.

Matching one protein sequence to a database with BlastP

  1. Connect the NCBI BLAST server.
  2. Select the tool Protein-protein BLAST (blastp) tool.
  3. Paste the sequences in the Query sequence box.
  4. On the bottom of the form, click the link Algorithm parameters. By default, the threshold on E-value (Expect threshold) is set to 10. As explained above, this is generally not a good idea, because it means that we expect 10 false matches per search on average. We recommend to set this option to a much lower value (e.g. Expect threshold = 1.0E-10).
  5. Choose UniProt as database.
  6. Leave all other options unchanged and submit the query.
  7. The result is more or less similar as what we obtained with SRS, but in addition we have a graphical sumary, showing each match as ahorizontal bar, with a link to the corresponding alignments.

Exporting the sequences in FASTA format

After having performed the BLAST search, we would ike to store the resulting sequences in a FASTA file. This will allow us to use these sequences as input for other progam, for instance ClustalX in the tutorial on sequence alignment.

  1. In the BLAST result page, locate the sub-title Alignments (just after the list of Descriptions)
  2. You can eiher select some sequences of your choice, or select all the BLAST results. For this, click the check box Select All besides the title Alignments.
  3. Click now on the link Get selected sequences. This will open a page in Entrez, with a Summary view of the sequences detected with BLAST.
  4. On the top of the symmary page, select the option Display FASTA. Beware, by default only 20 sequences are shown.
  5. In order to see all sequences, set the option Show to 500.

Using the BLinks facilities at NCBI

In the preceding tutorial, we performed a BLAST search ourselves in order to understand the basic steps for scanning a database with some query sequence. However, since our query sequence was already stored in the NCBI database, we could have avoided this effort by using the BLink facility at NCBI.

The BLinks can be simply obtained by clicking the link BLink besides a given protein. We give a short example.

  1. Connect the Entrez Web site
  2. In the query box, type the following text.
  3. On the right side of the identifier P04386, click on the link BLink. This displays a list of all the BLAST hits detected by the BLAST server at NCBI. On the top of the form, you can see the number of hits in the main taxonomic groups (Archaea, Bacteria, metazoa, Fungi, ...). Surprizingly, albeit Gal4p belongs to a protein family qualified of "Fungal-type", we find some matches in Bacteria, Vertebrate, and other taxa.
  4. by clicking on a taxon (e.g. Bacteria), we can select among 3 options : Only, Exclude and Include. In order to display the bacterial matches, click on Bacteria and select Only Bacteria. Select the first match and click on its raw score. This should display an alignment like this.

How do you interpret this result ? How many residue does it cover ? What are the percentages of identities and similarities ? How many matches of this type would be expected by chance ? Is the match significant ?

Interpretation of the results

The BLink results should be interpreted with caution, because they were run with the default parameters, which are not appropriate for all types of analyses. For example, the BLinks from the yeast protein Gal4p include many matches in Fungi (which is normal), but also a few matches in Bacteria. These bacterial matches have a quite high E-value (Expect=0.2), with a quite low percentage of identity (24%) and similarity (41%). These are thus more likely to be spurious matches than relevant results. If you are interested by a particular protein,it is thus generally a good idea to perform your own BLAST search with an appropriate choice of parameters.

[back to contents]

Exercises

  1. In SRS, use Blastp to retrieve all the sequences from the EMBL database showing a significant similarity with the gene thrA from Escherichia coli (this gene codes for the enzyme aspartokinase I/homeserine dehydrogenase I). Analyze the resulting matches. Interpret the E-value and the pairwise alignments.

  2. Perform the same query using the peptidic sequence for this protein. Compare the results obtained with the searches based on peptidic and nucleic acid sequences, respectively.

  3. Perform the same query using PSI-BLAST. Compare the results.

  4. In the NCBI server, run BlastP with the peptidic sequence of the aspartokinase I/homeserine dehydrogenase I (in the NCBI database, this protein has the identifier NP_414543). Analyze the hit map.

  5. Retrieve from the NCBI the sequence with gene identifier 639675. Use BLAST to identify similar sequences in the non-redundant database. How good are the matching scores ? How do you interpret the result ?

  6. Retrieve from SRS the sequence with identifier GROU_DROME. Search similar proteins in UniProt. How many results do you obtain ? Does the function of the matched protein relate to that of your query ? [back to contents]


    Jacques van Helden (van-helden.j@univmed.fr)