SIB-BLAST is a novel algorithm developed to overcome the model corruption problem that occurs frequently in the later iterations of PSI-BLAST searches.The algorithm compares resultant hits from iteration two and the final iteration of a PSI-BLAST search, calculates the figure of merit for each "overlapped" hit and re-ranks the hits according to their figure of merit. The premise of the algorithm is based on the observation that the profile, namely, the position specific scoring matrix (PSSM), in the first two rounds of a PSI-BLAST search, is the least corrupted since it is comprised mostly of close homologs. These profiles are used to search for more distant homologs, which are used to generate subsequent PSSMs. As more distant homologs are incorporated into the PSSM, non-homologous sequences frequently get included also, thus leading to model corruption. Hence, "benchmarking" hits from later iteration against earlier round when the model is least corrupted should improve the accuracy of a PSI-BLAST search.
SIB-BLAST is comprised of the following steps:
SIB-BLAST requires three inputs from the user:
SIB-BLAST output is a rank-ordered list of putative homologs of the query sequence found in the non-redundant protein database. For the test sequence described above it is:
Explanation of output parameters:
GI number: this is the sequence identification number assigned to each sequence record processed by NCBI.
E-value at round 2: this is the E-value (Expectation value) reported by PSI-BLAST for round 2 of the search.
E-value at last round: this is the E-value reported by PSI-BLAST for the last iteration (defined by user).
Figure of Merit: this is the "combined" E-value of iteration 2 and the last iteration.
Based on empirical analysis of the Aravind model test set (Aravind and Koonin, 1999), an empirical threshold at which errors appeared at an accelerated rate is at a FOM of ~ 10-8. However, it should be noted that this FOM threshold is expected to be dependent on the database size.
An explanation of the FOM can be found in the manuscript accompanying this web server and in the following publication: M.M. Lee, M. Chan, and R. Bundschuh, Simple is beautiful: a straightforward approach to improve the delineation of true and false positives in PSI-BLAST searches, Bioinformatics 24 (2008) 1339-1343. |
Please note that result files will be deleted after one week of submission. |
Aravind, L. and Koonin, E.V. (1999) Gleaning non-trivial structural, functional and evolutionary information about proteins by iterative database searches. J Mol Biol, 287, 1023-1040.
Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V. and Altschul, S.F. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res, 29, 2994-3005.