We will use BLAST to retrieve sequences on the basis of their similarity with a query sequences.[back to contents]
The result table gives you a synthetic view of the blast results. The last columns indicate the statistical parameters of the match: E-value, percentage of identity, match length.
The first column (BLASTP) of the synthetic table gives links towards the pairwise aligments of your query against each matching protein, respectively. Not surprizingly, the top result is a hit of your querty protein against itself (since your query protein was part of the Swiss-Prot database). In order to see a less trivial alignemnt, go down in the list and select a match with an intermediate level of identity (e.g. ~70% identity). Click on the corresponding link in the left column of the table. Here is an example of one alignment resulting from BLAST.
>SW:META_AERHH A0KFJ6 Homoserine O-succinyltransferase OS=Aeromonas hydrophila subsp. hydrophila (strain ATCC 7966 / NCIB 9240) GN=metA PE=3 SV=1 Length = 317 Score = 425 bits (1093), Expect = e-118 Identities = 201/299 (67%), Positives = 237/299 (79%) Query: 1 MPIRVPDELPAVNFLREENVFVMTTSRASGQEIRPLKVLILNLMPKKIETENQFLRLLSN 60 MPI++PD+LPA L +EN+FVMT SRA Q IRPL+VLILNLMPKKIETE Q +R+LSN Sbjct: 1 MPIKIPDQLPAAEVLGQENIFVMTESRAVTQNIRPLRVLILNLMPKKIETEIQLMRMLSN 60 Query: 61 SPLQVDIQLLRIDSRESRNTPAEHLNNFYCNFEDIQDQNFDGLIVTGAPLGLVEFNDVAY 120 SPLQVD+ LLRID RES+NTP HL NFY +FE ++ N+DG+I+TGAPLGLVEF +V Y Sbjct: 61 SPLQVDVDLLRIDDRESKNTPQAHLENFYHDFEQVRGNNYDGMIITGAPLGLVEFEEVVY 120 Query: 121 WPQIKQVLEWSKDHVTSTLFVCWAVQAALNILYGIPKQTRTEKLSGVYEHHILHPHALLT 180 WP+I +++EWS HVTSTLF+CWAVQAAL LYG+ KQT EKLSGVY HH L H L Sbjct: 121 WPRIVEIIEWSHQHVTSTLFLCWAVQAALKALYGMEKQTHGEKLSGVYRHHRLDEHEPLL 180 Query: 181 RGFDDSFLAPHSRYADFPAALIRDYTDLEILAETEEGDAYLFASKDKRIAFVTGHPEYDA 240 RGFDD F+APHSRYA F LIR +TDL+I AE+ E YL A+KD R FVTGHPEYDA Sbjct: 181 RGFDDEFVAPHSRYAAFDGDLIRAHTDLQIFAESAEAGVYLAATKDCRQVFVTGHPEYDA 240 Query: 241 QTLAQEFFRDVEAGLDPDVPYNYFPHNDPQNTPRASWRSHGNLLFTNWLNYYVYQITPY 299 TL E+ RD+ AGL+P +P NY+P NDP TPRASWRSHG+LLF+NWLNYYVYQ+T Y Sbjct: 241 LTLDGEYQRDLAAGLEPVIPVNYYPDNDPTRTPRASWRSHGHLLFSNWLNYYVYQLTSY 299
The first row of the BLAST result gives a short description of the protein from the database that matches our query protein.
The next lines indicate various macthing scores.
The raw score (S=1093 in our example) appears in parentheses after the bit score. The raw score is the sum of the scores of aligned residues according to the substitution matrix used for the search (BLOSUM62 in our case). This score can hardly be interpreted, because it depends on the length of the alignment.
S' = (lambda * S - ln(K)) / ln(2)
This normalized score is expressed in bits (due to the division by ln(2) in the formula). convenient because it can be used to compare the results of various BLAST searches.
The score labelled Expect is the E-value. It is obtained cfrom the bit score.
E = N / 2S'
where N is the size of the search space, which is more or less equal to the total number of residues in the database (n) multiplied by the size of the query protein (m).
The E-value represents the number of matches that would be expected by chance in the whole database search, for a given level of similarity. In our case, the expectation is much much lower than 1 (Expect=e-118), indicating that the match is very very significant. The E-value is the most intuitive way to estimate the significance of a BLAST match. Matches with an E-value superior to 1 should not be considered as significant, since more than one match is expect by chance. In other terms, if we generate a random sequence and match it against a database with an E-value threshold of 10, we expect 10 false positives per search.
In the example above, we can consider that the protein META_AERHH from Aeromonas hydrophila presents a highly significant match (E-value=e-118) with the protein META_ECOLI from Escherichia coli K12. The alignment shows 67% of identity and 79% of similarity ("positives") over a total length of 299.
The probability to observe such a significant match by chance is so low that we can frankly consider that these two proteins are very likely to be homologous (i.e. originate from some common ancestor).
After having performed the BLAST search, one generally wants to store the resulting sequences in FASTA format, because this format can be taken as input for many other bioinfirmatics tools. For example, the FASTA format can be loaded to ClustalX in order to obtain a multiple alignment from the collection of sequences collected with BLAST (remember: BLAST is a pairwise alignment program: the query sequence is aligned with each sequence of the database separately).
In the box Result options on the left side of the BLAST summary table, click on Save.
Select the option Save with view Alignment in FastA.
The query above returned a large number of hits, many of which were very highly significant. This significance is manifest because the E-values are very low (e.g. 1.0E-147), and the percentages of identities are very high (many proteins have >95% identity with our query).
The high number of matches comes from the fact that homologous proteins can be found in a multitude of sequenced bacterial genomes that are now available. This is very convenient to search for putative orthologs of a given protein in different genomes, but sometimes one would like to obtain some more distant hits, in order to detect similarities with proteins playing distinct functions. Since a few years, the exponential increase of available sequences has a paradoxical consequence: any BLAST search will return in the top of the list several dozens (or hundreds) of proteins having a very high similarity, and, since the number of hits is restricted, the low-similarity matches will be missed.
Fortunately, the UniprotKB curators managed to circumvent this problem by creating collections of non-redundant proteins. If you want to search for more distant homologs,it is thus recommended to use these non-redundant collections. For this, you can come back to the BLAST query form, and select as Database one of the Uniprot Cluster collections
Homoserine O-succinyltransferase AND Escherichia coli K12[organism](don't forget that the AND must be in uppercases)
In the preceding tutorial, we performed a BLAST search ourselves in order to understand the basic steps for scanning a database with some query sequence. However, since our query sequence was already stored in the NCBI database, we could have avoided this effort by using the BLink facility at NCBI.
The BLinks can be simply obtained by clicking the link BLink besides a given protein. We give a short example.
GAL4[Gene Name] AND "Saccharomyces cerevisiae"[Organism]
Score = 43.1 bits (100), Expect = 0.20 Identities = 97/396 (24%), Positives = 165/396 (41%), Gaps = 51/396 (12%) Query 187 LKTDPNNNGFFGDGSLLCILRSIGF---KPENYTNSNVNRLPTMITDRYTLASRSTTSRL 243 LK D NN+ F +L IL +GF K +YTNS +N L +T + T+ + S L Sbjct 47 LKKDKNNHHFLLRKAL--ILDRLGFYYRKNTDYTNS-LNCLQQSLTIKETIGETESIS-L 102 Query 244 LQSYLNNFHPYCPIVHSPTLMMLYNNQIEIASKDQWQILFNCILAIGAWCIEGESTDIDV 303 SYL + Y + N + A K + IL + + C ST D Sbjct 103 TYSYLGSL--YLRKNDRERAIKFLNKAMGFAEKYNHDEI-PYILNLYSNCY---STFKD- 155 Query 304 FYYQNAKSHLTSKVFESGSIILVT----ALHLLSRYTQWRQKTNTSYNFHSFSIRMAISL 359 Y+ K + S SI + AL LLSRY+Q N S + S ++++AI+ Sbjct 156 --YEKGKIYALKAYKHSDSINDIKEKTYALTLLSRYSQKNTDYNASIEYSSRALKLAIAS 213 Query 360 G-------LNRDLPSSFSDSSILEQRRRIWWSVYSWEIQLSLLYGRSIQLSQNTISFPSS 412 G N++L ++ LE+ ++ Y ++LS G +++ +S ++ Sbjct 214 GDIIFIELCNKELGYAYRK---LEEPKKA-LDYYLKSLELSKKIGFKNRIANRYLSVSNA 269 Query 413 VDDVQRTTTGPTIYHGIIETARLLQVFTK----IYELDKTVTAEKSPICAKKCLMICNEI 468 D+ T Y+ + + ++ + K ELD ++ + K L+ E Sbjct 270 YTDLNEHETA-FYYYRLYKRQQIKNLNIKNIRDFAELDADYKYKREKL--KDSLLFVQE- 325 Query 469 EEVSRQAPKFLQMDISTTALTNLLKEHPWLSFTRFELKWKQLSLIIYVLR--DFFTNFTQ 526 K + I T + N +K+ W+ F L L IIY++R F T+ + Sbjct 326 -------KKLSEAKIETLSAENRIKKQ-WMLFGGIGL--LVLFSIIYLIRLQKFATSKQK 375 Query 527 KKSQLEQDQNDHQSYEVKRCSIMLSDAAQRTVMSVS 562 + Q QD + Q E R + L D+ + ++ +S Sbjct 376 LQQQFSQDLINEQEKERSRLARELHDSIGQKLIFLS 411 CPU time: 0.04 user secs. 0.05 sys. secs 0.09 total secs.
How do you interpret this result ? How many residue does it cover ? What are the percentages of identities and similarities ? How many matches of this type would be expected by chance ? Is the match significant ?
The BLink results should be interpreted with caution, because they were run with the default parameters, which are not appropriate for all types of analyses. For example, the BLinks from the yeast protein Gal4p include many matches in Fungi (which is normal), but also a few matches in Bacteria. These bacterial matches have a quite high E-value (Expect=0.2), with a quite low percentage of identity (24%) and similarity (41%). These are thus more likely to be spurious matches than relevant results. If you are interested by a particular protein,it is thus generally a good idea to perform your own BLAST search with an appropriate choice of parameters.[back to contents]