PSI-Blast
---------
The blastpgp program can do an iterative search in which
sequences found in one round of searching are used to build
a score model for the next round of searching. In this usage,
the program is called Position-Specific Iterated BLAST, or PSI-BLAST.
As explained in the accompanying paper, the BLAST algorithm is
not tied to a specific score matrix. Traditionally, it has been
implemented using an AxA substitution matrix where A is the alphabet size.
PSI-BLAST instead uses a QxA matrix, where Q is the length of the query
sequence; at each position the cost of a letter depends on the position
w.r.t. the query and the letter in the subject sequence.
The position-specific matrix for round i+1 is built from a constrained
multiple alignment among the query and the sequences found with
sufficiently low e-value in round i. The top part of the output for
each round distinguishes the sequences into: sequences found
previously and used in the score model, and sequences not used in the
score model. The output currently includes lots of diagnostics
requested by users at NCBI. To skip quickly from the output of
one round to the next, search for the string "producing", which is
part of the header for each round and likely does not appear elsewhere
in the output. PSI-BLAST "converges" and stops if all sequences
found at round i+1 below the e-value threshold were already in
the model at the beginning of the round.
There are several blastpgp parameters specifically for PSI-BLAST:
-j is the maximum number of rounds (default 1; i.e., regular BLAST)
-h is the e-value threshold for including sequences in the
score matrix model (default 0.001)
-c is the "constant" used in the pseudocount formula specified in the
paper (default 10)
The -C and -R flags provide a "checkpointing" facility whereby
a score model can be stored and later reused.
-C stores the query and frequency count ratio matrix in a
file
-R restarts from a file stored previously.
When using -R, it is required that the query specified on the command line
match exactly the query in the restart file.
The checkpoint files are stored in a byte-encoded (not human readable)
format, so as to prevent roundoff error between writing and reading
the checkpoint.
Users who also develop their own sequence analysis software may wish
to develop their own scoring systems. For this purpose the code
in posit.c that writes out the checkpoint can be easily adapated to
write out scoring systems derived by other algorithms in such
a way that PSI-BLAST can read the files in later.
The checkpoint structure is general in the sense that it can handle
any position-specific matrix that fits in the Karlin-Altschul
statistical framework for BLAST scoring.
The -B flag provides a way to jump start PSI-BLAST from a master-slave
multiple alignment computed outside PSI-BLAST. The multiple alignment
must include the query sequence as one of the sequences, but it need
not be the first sequence. The multiple alignment must be specified
in a format that is derived from Clustal, but without some headers and
trailers. See example below. The rules are also described by the
following words. Suppose the multiple alignments has N sequences. It
may be presented in 1 or more blocks, where each block presents a
range of columns from the multiple alignment. E.g., the first block
might have columns 1-60, the second block might have columns 61-95,
the third block might have columns 96-128. Each block should have N
rows, 1 row per sequence. The sequences should be in the same order
in every block. Blocks are separated by 1 or more blank lines.
Within a block there are no blank lines, and each line consists of 1
sequence identifier followed by some white space followed by
characters (and gaps) for that sequence in the multiple alignment. In
each column, all letters must be in upper case, or all letters must be
in lower case. Upper case means that this column is to be given
position-specific scores. Lower-case means to use the underlying
matrix (specified by -M) for this column; e.g., if the query sequence
has an 'l' residue in the column, then the standard scores for
matching an L are used in the column.
A sample usage would be:
blastpgp -i seq1 -B align1 -j 2 -d nr
where seq1 is the query
align1 is the alignment file
-j 2 indicates to do 2 rounds
-d nr indicates to use the nr database
The example files
seq1
align1
copied below were kindly supplied by L. Aravind from a paper
he and Chris Ponting published in Protein Science:
Aravind L, Ponting CP, Homologues of 26S proteasome subunits
are regulators of transcription and translation, Protein Science
7(1998) 1250-1254.
L. Aravind (aravind@ncbi.nlm.nih.gov) was the first user
and helped define how -B should work. Y. Wolf (wolf@ncbi.nlm.nih.gov)
helped design a more flexible input format for the alignments.
If you like how -B works, let them know.
If you do not like how -B works, complain to
A. Schaffer(schaffer@helix.nih.gov) who did the implementation.
seq1
----
> 26SPS9_Hs
IHAAEEKDWKTAYSYFYEAFEGYDSIDSPKAITSLKYMLLCKIMLNTPEDVQALVSGKLALRYAGRQTEA
LKCVAQASKNRSLADFEKALTDYRAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKL
SKADVERKLSQMILDKKFHGILDQGEGVLIIFDEPP
align1
------
26SPS9_Hs IHAAEEKDWKTAYSYFYEAFEGYdsidspkaitslkymllckimlntpedvqalvsgklalryagrqtealkcvaqasknr
F57B9_Ce LHAADEKDFKTAFSYFYEAFEGYdsvdekvsaltalkymllckvmldlpdevnsllsaklalkyngsdldamkaiaaaaqk
YDL097c_Sc ILHCEDKDYKTAFSYFFESFESYhnltthnsyekacqvlkymllskimlnliddvknilnakytketyqsrgidamkavae
YMJ5_Ce LYSAEERDYKTSFSYFYEAFEGFasigdkinatsalkymilckimlneteqlagllaakeivayqkspriiairsmadafr
FUS6_ARATH KNYIRTRDYCTTTKHIIHMCMNAilvsiemgqfthvtsyvnkaeqnpetlepmvnaklrcasglahlelkkyklaarkfld
COS41.8_Ci SLDYKLKTYLTIARLYLEDEDPVqaemyinrasllqnetadeqlqihykvcyarvldyrrkfleaaqrynelsyksaihet
644879 KCYSRARDYCTSAKHVINMCLNVikvsvylqnwshvlsyvskaestpeiaeqrgerdsqtqailtklkcaaglaelaarky
YPR108w_Sc IHCLAVRNFKEAAKLLVDSLATFtsieltsyesiatyasvtglftlertdlkskvidspellslisttaalqsissltisl
eif-3p110_Hs SKAMKMGDWKTCHSFIINEKMNGkvw-------------------------------------------------------
T23D8.4_Ce SKAMLNGDWKKCQDYIVNDKMNQkvw-------------------------------------------------------
YD95_Sp IYLMSIRNFSGAADLLLDCMSTFsstellpyydvvryavisgaisldrvdvktkivdspevlavlpqnesmssleacinsl
KIAA0107_Hs LYCVAIRDFKQAAELFLDTVSTFtsyelmdyktfvtytvyvsmialerpdlrekvikgaeilevlhslpavrqylfslyec
F49C12.8_Hs LYRMSVRDFAGAADLFLEAVPTFgsyelmtyenlilytvitttfaldrpdlrtkvircnevqeqltggglngtlipvreyl
Int-6_Mm KFQYECGNYSGAAEYLYFFRVLVpatdrnalsslwgklaseilmqnwdaamedltrlketidnnsvssplqslqqrtwlih
26SPS9_Hs sladfekaltdy-----------------------------------------------------------------------------------
F57B9_Ce rslkdfqvafgsf----------------------------------------------------------------------------------
YDL097c_Sc aynnrslldfntalkqy------------------------------------------------------------------------------
YMJ5_Ce krslkdfvkalaeh---------------------------------------------------------------------------------
FUS6_ARATH vnpelgnsyneviapqdiatygglcalasfdrselkqkvidninfrnflelvpdvrelindfyssryascleylasl------------------
COS41.8_Ci eqtkalekalncailapagqqrsrmlatlfkdercqllpsfgilekmfldriiksdemeefar--------------------------------
644879 kqaakclllasfdhcdfpellspsnvaiygglcalatfdrqelqrnvissssfklflelepqvrdiifkfyeskyasclkmldem----------
YPR108w_Sc yasdyasyfpyllety-------------------------------------------------------------------------------
eif-3p110_Hs -----------------------------------------------------------------------------------------------
T23D8.4_Ce -----------------------------------------------------------------------------------------------
YD95_Sp ylcdysgffrtladve-------------------------------------------------------------------------------
KIAA0107_Hs rysvffqslavv-----------------------------------------------------------------------------------
F49C12.8_Hs esyydchydrffiqlaale----------------------------------------------------------------------------
Int-6_Mm wslfvffnhpkgrdniidlflyqpqylnaiqtmcphilrylttavitnkdvrkrrqvlkdlvkviqqesytykdpitefveclyvnfdfdgaqkk
26SPS9_Hs ----RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERKLSQMILDKKFHGILDQGEGVLIIFDEPP
F57B9_Ce ----PQELQMDPVVRKHFHSLSERMLEKDLCRIIEPYSFVQIEHVAQQIGIDRSKVEKKLSQMILDQKLSGSLDQGEGMLIVFEIAV
YDL097c_Sc ----EKELMGDELTRSHFNALYDTLLESNLCKIIEPFECVEISHISKIIGLDTQQVEGKLSQMILDKIFYGVLDQGNGWLYVYETPN
YMJ5_Ce ----KIELVEDKVVAVHSQNLERNMLEKEISRVIEPYSEIELSYIARVIGMTVPPVERAIARMILDKKLMGSIDQHGDTVVVYPKAD
FUS6_ARATH ----KSNLLLDIHLHDHVDTLYDQIRKKALIQYTLPFVSVDLSRMADAFKTSVSGLEKELEALITDNQIQARIDSHNKILYARHADQ
COS41.8_Ci ----QLMPHQKAITADGSNILHRAVTEHNLLSASKLYNNIRFTELGALLEIPHQMAEKVASQMICESRMKGHIDQIDGIVFFERRET
644879 ----KDNLLLDMYLAPHVRTLYTQIRNRALIQYFSPYVSADMHRMAAAFNTTVAALEDELTQLILEGLISARVDSHSKILYARDVDQ
YPR108w_Sc ----ANVLIPCKYLNRHADFFVREMRRKVYAQLLESYKTLSLKSMASAFGVSVAFLDNDLGKFIPNKQLNCVIDRVNGIVETNRPDN
eif-3p110_Hs ----DLFPEADKVRTMLVRKIQEESLRTYLFTYSSVYDSISMETLSDMFELDLPTVHSIISKMIINEELMASLDQPTQTVVMHRTEP
T23D8.4_Ce ----NLFHNAETVKGMVVRRIQEESLRTYLLTYSTVYATVSLKKLADLFELSKKDVHSIISKMIIQEELSATLDEPTDCLIMHRVEP
YD95_Sp ----VNHLKCDQFLVAHYRYYVREMRRRAYAQLLESYRALSIDSMAASFGVSVDYIDRDLASFIPDNKLNCVIDRVNGVVFTNRPDE
KIAA0107_Hs ----EQEMKKDWLFAPHYRYYVREMRIHAYSQLLESYRSLTLGYMAEAFGVGVEFIDQELSRFIAAGRLHCKIDKVNEIVETNRPDS
F49C12.8_Hs ----SERFKFDRYLSPHFNYYSRGMRHRAYEQFLTPYKTVRIDMMAKDFGVSRAFIDRELHRLIATGQLQCRIDAVNGVIEVNHRDS
Int-6_Mm lrecESVLVNDFFLVACLEDFIENARLFIFETFCRIHQCISINMLADKLNMTPEEAERWIVNLIRNARLDAKIDSKLGHVVMGNNAV