Sciences and methods > Biosciences > Programs > FASTA > FASTA documentation
 
Tehdyt toimenpiteet

FASTA documentation

COPYRIGHT NOTICE

Copyright 1988, 1991, 1992, 1994, 1995, 1996, 1999 by William R. Pearson and the University of Virginia. All rights reserved. The FASTA program and documentation may not be sold or incorporated into a commercial product, in whole or in part, without written consent of William R. Pearson and the University of Virginia. For further information regarding permission for use or reproduction, please contact: David Hudson, Assistant Provost for Research, University of Virginia, P.O. Box 9025, Charlottesville, VA 22906-9025, (804) 924-6853

The FASTA program package

Introduction

This documentation describes the version 3 of the FASTA program package (see W. R. Pearson and D. J. Lipman (1988), "Improved Tools for Biological Sequence Analysis", PNAS 85:2444-2448 (Pearson and Lipman, 1988); W. R. Pearson (1996) "Effective protein sequence comparison" Meth. Enzymol. 266:227-258 (Pearson, 1996); Pearson et. al. (1997) Genomics 46:24-36 (Zhang et al., 1997); Pearson, (1999) Meth. in Molecular Biology 132:185-219 (Pearson, 1999). Version 3 of the FASTA packages contains many programs for searching DNA and protein databases and one program (prss3) for evaluating statistical significance from randomly shuffled sequences. Several additional analysis programs, including programs that produce local alignments, are available as part of version 2 of the FASTA package, which is still available.

This document is divided into three sections: (1) A summary overview of the programs in the FASTA3 package; (2) A guide to installing the programs and databases; (3) A guide to using the FASTA programs. The revision history of the programs can be found in the readme.v30..v33, files. The programs are very easy to use, so if you are using them on a machine that is administered by someone else, you can skip section (2) and focus on (1) and (3) to learn how to use the programsIf you are installing the programs on your own machine, you will need to read section (2) carefully.

1. An overview of the FASTA programs

Although there are a large number of programs in this package, they belong to three groups: (1) "Conventional" Library search programs: FASTA3, FASTX3, FASTY3, TFASTA3, TFASTX3, TFASTY3, SSEARCH3; (2) Programs for searching with short fragments: FASTS3, FASTF3, TFASTS3, TFASTF3; (3) Statistical significance: PRSS3. Programs that start with fast search protein databases, while tfast programs search translated DNA databases. Table I gives a brief description of the programs.


              Table I. Comparison programs in the FASTA3 package

-------------------------------------------------------------------------------
fasta3             Compare a protein sequence to a protein sequence database or
                   a  DNA  sequence  to a DNA sequence database using the FASTA
                   algorithm (Pearson and Lipman, 1988, Pearson, 1996).  Search
                   speed and selectivity are controlled with the ktup(wordsize)
                   parameter.  For protein comparisons, ktup =  2  by  default;
                   ktup  =1 is more sensitive but slower.  For DNA comparisons,
                   ktup=6 by default; ktup=3 or ktup=4 provides  higher  sensi-
                   tivity;  ktup=1  should  be  used  for oligonucleotides (DNA
                   query lengths < 20).

ssearch3           Compare a protein sequence to a protein sequence database or
                   a  DNA  sequence to a DNA sequence database using the Smith-
                   Waterman algorithm (Smith and Waterman, 1981).  ssearch3  is
                   about 10-times slower than FASTA3, but is more sensitive for
                   full-length protein sequence comparison.

fastx3/ fasty3     Compare a DNA sequence to a protein  sequence  database,  by
                   comparing  the  translated  DNA sequence in three frames and
                   allowing gaps  and  frameshifts.   fastx3  uses  a  simpler,
                   faster algorithm for alignments that allows frameshifts only
                   between codons; fasty3 is slower but produces better  align-
                   ments  with  poor  quality sequences because frameshifts are
                   allowed within codons.

tfastx3/ tfasty3   Compare a protein sequence to a DNA sequence database,  cal-
                   culating  similarities  with  frameshifts to the forward and
                   reverse orientations.

tfasta3            Compare a protein sequence to a DNA sequence database,  cal-
                   culating similarities (without frameshifts) to the 3 forward
                   and three reverse reading frames.  tfastx3 and  tfasty3  are
                   preferred    because    they   calculate   similarity   over
                   frameshifts.

fastf3/tfastf3     Compares an ordered peptide mixture, as would be obtained by
                   Edman degredation of a CNBr cleavage of a protein, against a
                   protein (fastf) or DNA (tfastf) database.

fasts3/tfasts3     Compares set of short peptide fragments,  as  would  be  ob-
                   tained from mass-spec. analysis of a protein, against a pro-
                   tein (fasts) or DNA (tfasts) database.
-------------------------------------------------------------------------------

2. Installing FASTA and the sequence databases

2.1. Obtaining the libraries

The FASTA program package does not include any protein or DNA sequence libraries. Protein databases are available on CD- ROM from the PIR and EMBL (see below), or via anonymouse FTP from many different sources. As this document is updated in the fall of 1999, no DNA databases are available on CD-ROM from the major sequence databases: Genbank at the National for Biotechnology Information (www.ncbi.nlm.nih.gov and ftp://ncbi.nlm.nih.gov) and EMBL at the European Bioinformatics Institute (www.ebi.ac.uk). However, the databases are available via anonymous FTP from both sites.

2.1.1. The GENBANK DNA sequence library

Because of the large size of DNA databases, you will probably want to keep DNA databases in only one, or possibly two, formats. The FASTA3 programs that search DNA databases - fasta3, tfastx/y3, and tfasta3 can read DNA databases in Genbank flatfile (not ASN.1), FASTA, GCG/compressed-binary, BLAST1.4 (pressdb), and BLAST2.0 (formatdb) formats, as well as EMBL format. If you are also running the GCG suite of sequence analysis programs, you should use GCG/compressed-binary format or BLAST2.0 format for your fasta3 searches. If not, BLAST2.0 is a good choice. These files are considerably more compact than Genbank flat files, and are preferred. The NCBI does not provide software for converting from Genbank flat files to Blast2.0 DNA databases, but you can use the Blast formatdb program to convert ASN.1 formated Genbank files, which are available from the NCBI ftp site.

The NCBI also provides the nr, swissprot, and several EST databases that are used by BLAST in FASTA format from: ftp://ncbi.nlm.nih.gov/blast/db. These databases are updated nightly.

2.1.2. The NBRF protein sequence library

You can obtain the PIR protein sequence database (Barker et al., 1998) from:

    National  Biomedical Research Foundation
    Georgetown  University  Medical  Center
    3900 Reservoir Rd, N.W.
    Washington, D.C. 20007

or via ftp from nbrf.georgetown.edu or from the NCBI (ncbi.nlm.nih.gov/repository/PIR). The data in the ascii directory is in PIR Codata format, which is not widely used. I recommend the PIR/VMS format data (libtype=5) in the vms directory.

2.1.3. The EBI/EMBL CD-ROM libraries

The European Bioinformatics Institute (EBI) distributes both the EMBL DNA database and the SwissProt database on CD-ROM (Bairoch and Apweiler, 1996), and they are available from:

    EMBL-Outstation  European Bioinformatics Institute
    Wellcome Trust Genome Campus,
    Hinxton Hall
    Hinxton,
    Cambridge CB10 1SD
    United Kingdom
    Tel: +44 (0)1223 494444
    Fax: +44 (0)1223 494468
    Email: DATALIB@ebi.ac.uk

In addition, the SWISS-PROT protein sequence database is available via anonymous FTP from ftp://ftp.expasy.ch/databases/swiss-prot/ (also see www.expasy.ch).

2.2. Finding the libraries: FASTLIBS

The major problem that most new users of the FASTA package have is in setting up the program to find the databases and their library type. In general, if you cannot get fasta3 to read a sequence database, it is likely that something is wrong with the FASTLIBS file. A common problem is that the database file is found, but either no sequences are read, or an incorrect number of entries is read. This is almost always because the library format (libtype) is incorrect. Note that a type 5 file (PIR/VMS format) can be read as a type 0 (default FASTA) format file, and the number of entries will be correct, but the sequence lengths will not.

All the search programs in the FASTA3 package use the environment variable FASTLIBS to find the protein and DNA sequence libraries. The FASTLIBS variable contains the name of a file that has the actual filenames of the libraries. The fastlibs file included with the distribution on is an example of a file that can be referred to by FASTLIBS. To use the fastlibs file, type:

    setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh)
    or
    export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)

Then edit the fastlibs file to indicate where the protein and DNA sequence libraries can be found. If you have a hard disk and your protein sequence library is kept in the file /usr/lib/aabank.lib and your Genbank DNA sequence library is kept in the directory: /usr/lib/genbank, then fastgbs might contain:

    NBRF Protein$0P/usr/lib/seq/aabank.lib 0
    SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5
    GB Primate$1P@/usr/lib/genbank/gpri.nam
    GB Rodent$1R@/usr/lib/genbank/grod.nam
    GB Mammal$1M@/usr/lib/genbank/gmammal.nam
    ^   1    ^^^^       4                   ^     ^
              23                             (5)

The first line of this file says that there is a copy of the NBRF protein sequence database (which is a protein database) that can be selected by typing "P" on the command line or when the database menu is presented in the file /usr/lib/seq/aabank.lib.

Note that there are 4 or 5 fields in the lines in fastgbs. The first field is the description of the library which will be displayed by FASTA; it ends with a '$'. The second field (1 character), is a 0 if the library is a protein library and 1 if it is a DNA library. The third field (1 character) is the character to be typed to select the library.

The fourth field is the name of the library file. In the example above, the /usr/lib/seq/aabank.lib file contains the entire protein sequence library. However the DNA library file names are preceded by a '@', because these files (gpri.nam, grod.nam, gmammal.nam) do not contain the sequences; instead they contain the names of the files which contain the sequences. This is done because the GENBANK DNA database is broken down in to a large number of smaller files. In order to search the entire primate database, you must search more than a dozen files.

In addition, an optional fifth field can be used to specify the format of the library file. Alternatively, you can specify the library format in a file of file names (a file preceded by an '@'). This field must be separated from the file name by a space character (' ') from the filename. In the example above, the aabank.lib file is in Pearson/FASTA format, while the swiss.seq file is in PIR/VMS format (from the EMBL CD-ROM). Currently, FASTA can read the following formats:

    0 Pearson/FASTA (>SEQID - comment/sequence)
    1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)
    2 NBRF CODATA (ENTRY/SEQUENCE)
    3 EMBL/SWISS-PROT (ID/DE/SQ)
    4 Intelligenetics (;comment/SEQID/sequence)
    5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)
    6 GCG (version 8.0) Unix Protein and DNA (compressed)
    11 NCBI Blast1.3.2 format  (unix only)
    12 NCBI Blast2.0 format  (unix only, fasta32t08 or later)

In particular, this version will work with the EMBL and PIR VMS formats that are distributed on the EMBL CD-ROM. The latter format (PIR VMS) is much faster to search than EMBL format. This release also works with the protein and DNA database formats created for the BLASTP and BLASTN programs by SETDB and PRESSDB and with the new NCBI search format. If a library format is not specified, for example, because you are just comparing two sequences, Pearson/FASTA (format 0) is used by default. To specify a library type on the command line, add it to the library filename and surround the filename and library type in quotes:

    fasta3 query.file "/seqdb/genbank/gbpri1.seq 1"

You can specify a group of library files by putting a '@' symbol before a file that contains a list of file names to be searched. For example, if @gmam.nam is in the fastgbs file, the file "gmam.nam" might contain the lines:

    </seqdb/genbank
    gbpri1.seq 1
    gbpri2.seq 1
    gbpri3.seq 1
    gbpri4.seq 1
    gbrod.seq 1
    gbmam.seq 1

In this case, the line beginning with a '<' indicates the directory the files will be found in. The remaining lines name the actual sequence files. So the first sequence file to be searched would be:

    /usr/lib/genbank/gbpri.seq

The notation "<PIRNAQ:" might be used under the VAX/VMS operating system. Under UNIX, the trailing '/' is left off, so the library directory might be written as "</usr/seqlib".

The FASTA programs can search a database composed of different files in different sequence formats. For example, you may wish to search the Genbank files (in GenBank flat file format) and the EMBL DNA sequence database on CD-ROM. To do this, you simply list the names and filetypes of the files to be searched in a file of filenames. For example, to search the mammalian portion of Genbank, the unannotated portion of Genbank, and the unannotated portion of the EMBL library, you could use the file:

    </usr/lib/DNA
    gbpri.seq 1
    #  (this '#' causes the program to display the size of the library)
    gbrod.seq 1
    ...
    gbmam.seq 1
    ...
    gbuna.seq 1
    ...
    unanno.seq 5
    #

You do not need to include library format numbers if you only use the Pearson/FASTA version of the PIR protein se- quence library. If no library type is specified, the program assumes that type 0 is being used.

Test the setup by running FASTA. Enter the sequence file 'mgstm1.aa' when the program requests it (this file is included with the programs). The program should then ask you to select a protein sequence library. Alternatively, if you run the TFASTA program and use the mgstm1.aa query sequence, the program should show you a selection of DNA sequence libraries. Once the fastgbs file has been set up correctly, you can set FASTLIBS=fastgbs in your AUTOEXEC.BAT file, and you will not need to remember where the libraries are kept or how they are named.

3. Using the FASTA Package

3.1. Overview

The FASTA sequence comparison programs all require similar information, the name of a query sequence file, a library file, and the ktup parameter. All of the programs can accept arguments on the command line, or they will prompt for the file names and ktup value.

To use FASTA, simply type:

    FASTA

and you will be prompted for :

  • the name of the test sequence file
  • the name of the library file
  • and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences) (ktup of 2 is about 5 times faster than ktup = 1)

The program can also be run by typing

    FASTA test.aa /lib/bigfile.lib ktup (1 or 2)

Included with the package are several test files. To check to make certain that everything is working, you can try:

    fasta musplfm.aa prot_test.lib
and
    tfastx mgstm1.aa gst.nlib

3.2. Sequence files

The fasta3 programs know about three kinds of sequence files (four under VMS): (1) plain sequence files - files that contain nothing but sequence residues - can only be used as query sequences. (2) FASTA format files. These are the same as plain sequence files, each sequence is preceded by a comment line with a '>' in the first column. (3) distributed sequence libraries (this is a broad class that includes the NBRF/PIR VMS and blocked ascii formats, Genbank flat-file format, EMBL flat-file format, and Intelligenetics format. All of the files that you create should be of type (1) or (2). FASTA format files (ones with a '>' and comment before the sequence) are preferred, because they can be used as query or library sequence files by all of the programs.

I have included several sample test files, *.aa and *.seq as well as two small sequence libraries, prot_test.lib and gst.nlib. The first line may begin with a '>' by a comment. Spaces and tabs (and anything else that is not an amino-acid code) are ignored.

Library files should have the form:

    >Sequence name and identifier
    A F A S Y T .... actual sequence.
    F S S       .... second line of sequence.
    >Next sequence name and identifier

This is often referred to as "FASTA" or format. You can build your own library by concatenating several sequence files. Just be sure that each sequence is preceded by a line beginning with a '>' with a sequence name.

The test file should not have lines longer than 120 characters, and sequences entered with word processors should use a document mode, with normal carriage returns at the end of lines.

A different format is required to specify the ordered peptide mixture for fastf3/tfastf3. For example:

    >mgstm1
    MGCEN,
    MIDYP,
    MLLAY,
    MLLGY

indicates m in the first position of all three peptides (as from CNBr), G, I, L (twice) in the second position (first cycle), C,D,L (twice) in the third position, etc. The commas (,) are required to indicate the number of fragments in the mixture, but there should be no comma after the last residue.

For the fasts3/tfasts3 program, the format is the same, except that there is no requirement for the peptides to be the same length.

4. Statistical Significance

All the programs in the FASTA3 package attempt to calculate accurate estimates of the statistical significance of a match. For fasta3, ssearch3, and fastx3/y3, these estimates are very accurate (Pearson, 1998, Zhang et al., 1997).. Altschul et al. (Altschul et al., 1994) provides an excellent review of the statistics of local similarity scores. Local sequence similarity scores follow the extreme value distribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) where u = ln(Kmn)/lambda and m,m are the lengths of the query and library sequence. This formula can be rewritten as: 1 - exp(-Kmn exp(-lambda x), which shows that the average score for an unrelated library sequence increases with the logarithm of the length of the library sequence. The fasta3 programs use simple linear regression against the the log of the library sequence length to calculate a normalized "z- score" with mean 50, regardless of library sequence length, and variance 10. (Several other estimation methods are available with the -z option.) These z-scores can then be used with the extreme value distribution and the poisson distribution (to account for the fact that each library sequence comparison is an independent test) to calculate the number of library sequences to obtain a score greater than or equal to the score obtained in the search. The original idea and routines to do the linear regression on library sequence length were provided Phil Green, U. Washington. This version uses a slightly different strategy for fitting the data than those originally provided by Dr. Green.

The expected number of sequences is plotted in the histogram using an "*". Since the parameters for the extreme value distribution are not calculated directly from the distribution of similarity scores, the pattern of "*'s" in the histogram gives a qualitative view of how well the statistical theory fits the similarity scores calculated by the programs. For fasta3, if optimized scores are calculated for each sequence in the database (the default), the agreement between the actual distribution of "z-scores" and the expected distribution based on the length dependence of the score and the extreme value distribution is usually very good. Likewise, the distribution of ssearch3 Smith- Waterman scores typically agrees closely with the actual distribution of "z-scores." The agreement with unoptimized scores, ktup=2, is often not very good, with too many high scoring sequences and too few low scoring sequences compared with the predicted relationship between sequence length and similarity score. In those cases, the expectation values may be overestimates.

With version 33t01, all the FASTA programs also report a "bit" score, which is equivalent to the bit score reported by BLAST2. The FASTA33/BLAST2 bit score is calculated as: (lambda*S - ln K)/ln 2, where S is the raw similarity score, lambda and K are statistical parameters estimated from the distribution of unrelated sequence similarity scores. The statistical signficance of a given bit score depends on the lengths of the query and library sequences and the size of the library, but a 1 bit increase in score corresponds to a 2-fold reduction in expectation; a 10-bit increase implies 1000-fold lower expectation, etc.

The statistical routines assume that the library contains a large sample of unrelated sequences. If this is not true, then statistical parameters can be estimated by using the -z 11-15, options. -z options greater than 10 calculate a shuffled similarity score for each library sequence, in addition to the unshuffled score, and estimate the statistical parameters from the scores of the shuffled sequences. If there are fewer than 20 sequences in the library, the statistical calculations are not done.

For protein searches, library sequences with E() values ;lt& 0.01 for searches of a 10,000 entry protein database are almost always homologous. Frequently sequences with E()-values from 1 - 10 are related as well, but unrelated sequences ( 1 - 10 per search) will have scores in this renage as well. Remember, however, that these E() values also reflect differences between the amino acid composition of the query sequence and that of the "average" library sequence. Thus, when searches are done with query sequences with "biased" amino-acid composition, unrelated sequences may have "significant" scores because of sequence bias. PRSS3 can address this problem by calculating similarity scores for random sequences with the same length and amino acid composition.

5. Options

Command line options are available to change the scoring parameters and output display. Command line options must preceed other program arguments, such as the query and library file names.

5.1. Command line options

-a   (fasta3, ssearch3 only) show both sequences in their
     entirety.

-A   force Smith-Waterman alignments for fasta3 DNA sequences.
     By default, only fasta3 protein sequence comparisons use
     Smith-Waterman alignments.

-B   Show normalized score as a z-score, rather than a bit-score
     in the list of best scores.

-b # Number of sequence scores to be shown on output.  In the
     absence of this option, fasta (and tfasta and ssearch)
     display all library sequences obtaining similarity scores
     with expectations less than 10.0 if optimized score are
     used, or 2.0 if they are not. The -b option can limit the
     display further, but it will not cause additional sequences
     to be displayed.

-c # Threshold score for optimization (OPTCUT).  Set "-c 1" to
     optimize every sequence in a database.

-E # Limit the number of scores and alignments shown based on the
     expected number of scores.  Used to override the expectation
     value of 10.0 used by default.  When used with -Q, -E 2.0
     will show all library sequences with scores with an
     expectation value <= 2.0.

-d # Maximum number of alignments to be displayed.  Ignored if
     "-Q" is not used.

-F # Limit the number of scores and alignments shown based on the
     expected number of scores. "-E #" sets the highest E()-value
     shown; "-F #" sets the lowest E()-value. Thus, "-F 0.0001"
     will not show any matches or alignments with E() < 0.0001.
     This allows one to skip over close relationships in searches
     for more distant relationships.

-f   Penalty for the first residue in a gap (-12 by default for
     proteins, -16 for DNA, -15 for FAST[XY]/TFAST[XY]).

-g   Penalty for additional residues in a gap (-2 by default for
     proteins, -4 for DNA, -3 for FAST[XY]/TFAST[XY]).

-h   Penalty for frameshift (fastx3/y3, tfastx3/y3 only).

-H   Omit histogram.

-i   Invert (reverse complement) the query sequence if it is DNA.
     For tfasta3/x3/y3, search the reverse complement of the
     library sequence only.

-j # Penalty for frameshift within a codon (fasty3/tfasty3 only).

-l file
     Location of library menu file (FASTLIBS).

-L   Display more information about the library sequence in the
     alignment.

-M low-high
     Range of amino acid sequence lengths to be included in the
     search.

-m # Specify alignment type: 0, 1, 2, 3, 4, 5, 6, 9, 10

           -m 0          -m 1          -m 2          -m 3        -m 4

         MWRTCGPPYT   MWRTCGPPYT    MWRTCGPPYT                 MWRTCGPPYT
         ::..:: :::     xx  X       ..KS..Y...    MWKSCGYPYT   ----------
         MWKSCGYPYT   MWKSCGYPYT


In addition  -m 10 is a new, parseable format for use with other
programs.  See the file"readme.v20u4" for a more complete
description.

-m 5 provides a combination of -m 4 and -m 0. -m 6 provides -m 5
plus HTML formatting. -m 9 provides percent identify and coordinates
with the initial list of high scores, as well as conventional -m 0
alignments.

-M low-high
     Include library sequences (proteins only) with lengths
     between low and high.

-n   Force the query sequence to be treated as a DNA sequence.
     This is particularly useful for query sequences that contain
     a large number of ambiguous residues, e.g. transcription
     factor binding sites.

-O   Send copy of results to "filename."  Helpful for
     environments without STDOUT (mostly for the Macintosh).

-o   Turn off default optimization of all scores greater than
     OPTCUT. Sort results by "initn" scores (reduces the accuracy
     of statistical estimates).

-p   Force query to be treated as protein sequence.

-Q,-q
     Quiet - does not prompt for any input.  Writes scores and
     alignments to the terminal or standard output file.

-r   Specify match/mismatch scores for DNA comparisons.  The
     default is "+5/-4". "+3/-2" can perform better in some
     cases.

-R file
     Save a results summary line for every sequence in the
     sequence library.  The summary line includes the sequence
     identifier, superfamily number (if available) position in
     the library, and the similarity scores calculated.  This
     option can be used to evaluate the sensitivity and
     selectivity of different search strategies (Pearson, 1995,
     Pearson, 1998).

-s file
     Specify the scoring matrix file.  fasta3 uses the same
     scoring matrices as Blast1.4/2.0.  Several scoring matrix
     files are included in the standard distribution.  For
     protein sequences: codaa.mat - based on minimum mutation
     matrix; idnaa.mat - identity matrix; pam250.mat - the PAM250
     matrix developed by Dayhoff et al. (Dayhoff et al., 1978);
     pam120.mat - a PAM120 matrix.  The default scoring matrix is
     BLOSUM50 ("-s BL50"). Other matrices available from within
     the program are: PAM250/"-s P250", PAM120/"-s P120",
     PAM40/"-s P40", PAM20/"-s P20", MDM10 - MDM40/"-s M10 - M40"
     (MDM are modern PAM matrices from Jones et al. (Jones et
     al., 1992),), BLOSUM50, 62, and 80/"-s BL50", "-s BL62", "-s
     BL80".

-S   Treat lower-case characters in the query or library
     sequences as "low-complexity" ("seg"-ed) residues.
     Traditionally, the "seg" program (Wootton and
     Federhen, 1993) is used to remove low complexity regions in
     DNA sequences by replacing the residues with an "X".  When
     the "-S" option is used, the FASTA33 programs provide a
     potentially more informative approach.  With "-S", lower
     case characters in the query or database sequences are
     treated as "X"'s during the initial scan, but are treated as
     normal residues during the final alignment display.  Since
     statistical significance is calculated from the similarity
     score calculated during the library search, when the lower
     case residues are "X"'s, low complexity regions will not
     produce statistically significant matches.  However, if a
     significant alignment contains low complexity regions, their
     alignmen is shown.  With "-S", lower case characters may be
     included in the alignment to indicate low complexity
     regions, and the final alignment score may be higher than
     the score obtained during the search.

     The pseg program can be used to produce databases (or query
     sequences) with lower case residues indicating low
     complexity regions using the command:

         pseg database.fasta -z 1 -q  > database.lc_seg

     (seg can also be used with some post processing, see
     readme.v33tx.)

-w # Line length (width) = number (<200)

-x # Specify the penalty for a match to an 'X', independently of the
     PAM matrix.  Particularly useful for fastx3/fasty3, where
     termination codons are encoded as 'X'.

-X   Specifies offsets for the beginning of the query and library
     sequence.  For example, if you are comparing upstream
     regions for two genes, and the first sequence contains 500
     nt of upstream sequence while the second contains 300 nt of
     upstream sequence, you might try:

         fasta -X "-500 -300" seq1.nt seq2.nt

     If the -X option is not used, FASTA assumes numbering starts with
     1.  (You should double check to be certain the negative numbering
     works properly.)

-y   Set the width of the band used for calculating "optimized"
     scores.  For proteins and ktup=2, the width is 16.  For
     proteins with ktup=1, the width is 32 by default.  For DNA
     the width is 16.

-z -1,0,1,2,3,4,5
     -z -1 turns off statistical calculations. z 0 estimates the
     significance of the match from the mean and standard
     deviation of the library scores, without correcting for
     library sequence length.  -z 1 (the default) uses a weighted
     regression of average score vs library sequence length; -z 2
     uses maximum likelihood estimates of Lambda and K; -z 3 uses
     Altschul-Gish parameters (Altschul and Gish, 1996); -z 4 - 5
     uses two variations on the -z 1 strategy. -z 1 and -z 2 are
     the best methods, in general.

-z 11,12,14,15
     estimate the statistical parameters from shuffled copies of
     each library sequence.  This doubles the time required for a
     search, but allows accurate statistics to be estimated for
     libraries comprised of a single protein family.

-1   sort output by init1 score (for compatibility with FASTP -
     do not use).

-3   translate only three forward frames
For example:
    fasta -w 80 -a seq1.aa seq.aa

would compare the sequence in seq1.aa to that in seq2.aa and display the results with 80 residues on an output line, showing all of the residues in both sequences. Be sure to enter the options before entering the file names, or just enter the options on the command line, and the program will prompt for the file names.

(November, 1997) In addition, it is now possible to provide the fasta programs with the query sequence (fasta, fasty, ssearch, tfastx), or two sequences (prss, lalign, plalign) from the unix "stdin" stream. This makes it much easier to set up FASTA or PRSS WWW pages. To specify that stdin be used, rather than a file, the file name should be specified as '-' or '@' (the latter file name makes it possible to specify a subset of the sequence). Thus:

    cat query.aa | fasta -q @:25-75 s

would take residues 25-75 from query.aa and search the 's' library (see the discussion of FASTLIBS). If DNA sequences are to be read from stdin, the '-n' option must be used, as fasta cannot check for DNA queries when stdin is used.

5.2. Environment variables

Because the current version of the program allows the user to set virtually every option on the command line (except the ktup, which must be set as the third command line argument), only the FASTLIBS environment variable is routinely used.

FASTLIBS
     specifies the location of the file which contains the list
     of library descriptions, locations, and library types (see
     section on finding library files).

6. Frequently Asked Questions

  • (1) Which program should I use? See Table I.
  • (2) How do I search with both DNA strands with fasta3 and fastx3? With version 32 of the FASTA program package, all searches that use DNA queries (e.g. fasta3, fastx3/y3) examine both strands. To revert to earlier FASTA behavior - only looking at the forward or reverse strand - use -3 to search only the forward strand and -i -3 to search only the reverse strand.
  • (3) When I search Genbank - the program reports: 0 residues in 0 sequences. This typically happens because the program does not know that you are searching a Genbank flatfile database and is looking for a FASTA format database. Be certain to specify the library type ("1" for Genbank flatfile) with the database name.
  • (4) What is the difference between fastx3 and fasty3 (or tfastx3 and tfasty3). [t]fastx3 uses a simpler codon based model for alignments that does not allow frameshifts in some codon positions (see ref. (Zhang et al., 1997)). tfastx3 is about 30% faster, but tfasty3 can produce higher quality alignments in some cases.
  • (5) When I run fasta3 -q, I don't see any (or very little) output, but I get lots of scores when I run interactively.P With the -Q option, the number of high scores displayed is limited by the -E # cutoff, which is 10.0 for protein comparisons, 2.0 for DNA comparisons, and 5.0 for translated DNA:protein comparisons. In interactive mode (without -Q), by default you see 20 high scores, regardless of E() value.
  • (6) What is ktup - All of the programs with fast in their name use a computer science method called a lookup table to speed the search. For proteins with ktup=2, this means that the program does not look at any sequence alignment that does not involve matching two identical residues in both sequences. Likewise with DNA and ktup = 6, the initial alignment of the sequences looks for 6 identical adjacent nucleotides in both sequences. Because it is less likely that two identical amino-acids will line up by chance in two unrelated proteins, this speeds up the comparison. But very distantly related sequences may never have two identical residues in a row but will have single aligned identities. In this case, ktup = 1 may find alignments that ktup=2 misses.
  • (7) Sometimes, in the list of best scores, the same sequence is shown twice with exactly the same score. Sometimes, the sequence is there twice, but the scores are slightly different. When any of the fasta3 programs searches a long sequence, it breaks the sequence up into overlapping pieces. The length of the piece depends on the length of the query and the particular program being used (it can also be controlled with the -N #### option). Since the pieces overlap by the length of the query sequence (or 3*query_length for fastx/y3 and tfasta/x/y3), if the highest scoring alignment is at the end of one piece, it will be scored again at the beginning of the next piece. If the alignment is not be completely included in the overlap region, one of the pieces will give a higher score than the other. These duplications can be detected by looking at the coordinates of the alignment. If either the beginning or end coordinate is identical in two alignments, the alignments are at least partially duplicates.

As always, please inform me of bugs as soon as possible.


William R. Pearson
Department of Biochemistry
Box 440, Jordan Hall
U. of Virginia
Charlottesville, VA

wrp@virginia.EDU

7. References

Altschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C.
(1994). Issues in searching molecular sequence databases. Nature
Genet. 6,119-129.

Altschul, S. F. and Gish, W. (1996). Local alignment statistics.
Methods Enzymol. 266,460-480.

Bairoch, A. and Apweiler, R. (1996). The Swiss-Prot protein
sequence data bank and its new supplement TrEMBL. Nucleic Acids.
Res. 24,21-25.

Barker, W. C., Garavelli, J. S., Haft, D. H., Hunt, L. T.,
Marzec, C. R., Orcutt, B. C., Srinivasarao, G. Y., Yeh, L. S. L.,
Ledley, R. S., Mewes, H. W., Pfeiffer, F., and Tsugita, A.
(1998). The PIR-International Protein Sequence Database. Nucleic
Acids Res 26,27-32.

Dayhoff, M., Schwartz, R. M., and Orcutt, B. C. (1978). A model
of evolutionary change in proteins. In Atlas of Protein Sequence
and Structure, vol. 5, supplement 3. M. Dayhoff, ed. (Silver
Spring, MD: National Biomedical Research Foundation), pp.
345-352.

Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). The
rapid generation of mutation data matrices from protein
sequences. Comp. Appl. Biosci. 8,275-282.

Pearson, W. R. and Lipman, D. J. (1988). Improved tools for
biological sequence comparison. Proc. Natl. Acad. Sci. USA
85,2444-2448.

Pearson, W. R. (1995). Comparison of methods for searching
protein sequence databases. Prot. Sci. 4,1145-1160.

Pearson, W. R. (1996). Effective protein sequence comparison.
Methods Enzymol. 266,227-258.

Pearson, W. R. (1998). Empirical statistical estimates for
sequence similarity searches. J. Mol. Biol. 276,71-84.

Pearson, W. R. (1999). Flexible similarity searching with the
FASTA3 program package. In Bioinformatics Methods and Protocols,
S. Misener and S. A. Krawetz, ed. (Totowa, NJ: Humana Press), pp.
185-219.

Smith, T. F. and Waterman, M. S. (1981). Identification of common
molecular subsequences. J. Mol. Biol. 147,195-197.

Wootton, J. C. and Federhen, S. (1993). Statistics of local
complexity in amino acid sequences and sequence databases.
Comput. Chem. 17,149-163.

Zhang, Z., Pearson, W. R., and Miller, W. (1997). Aligning a DNA
sequence with a protein sequence. J. Computational Biology
4,339-349.