Sciences and methods > Biosciences > Programs > BLAST > Using FGI to run BLAST jobs
 
Tehdyt toimenpiteet

Using FGI to run BLAST jobs


You can use the Finnish Grid Initiative (FGI) distributed computing environment for running large BLAST sequence similarity search tasks. In FGI, a BLAST search, consisting of large number of query sequences is split into a number of sub-tasks that are submitted for execution to the FGI clusters around Finland. Splitting and submitting the tasks to remote servers requires some time, so in the case of small queries ( say less that 1000 query sequences) running the jobs in the servers of CSC is usually faster. However if the query sets consists of tens of thousands of sequences or more, the capacity of FGI can be fully utilized and the through-put times are often shorter than what the clusters of CSC alone could provide.



Getting grid access


To be able to use grid resources for BLAST searches, you should have:

  1.  A valid grid certificate installed in the hippu.csc.fi server.
  2.  Membership of fgi.csc.fi Virtual Organization


For detailed instructions, see the document below:



    Grid-BLAST @ CSC


    BLAST jobs can be submitted to FGI using gb (Grid Blast) command in Hippu.csc.fi. The gb command works in the same way as the pb command used to submit BLAST jobs to the local batch queues in Vuori or Murska. The gb command can be used with any BLAST+ command (The older BLAST versions can't be used with the gb command). The gb command will automatically split your query sequences into smaller subjobs and then submit these BLAST searches to be executed in the FGI-grid. The script continues to follow the progress of the subjobs and finally retrieves the results from the remote servers to the given output file.

    The gb command must be kept running until all the subjobs have been processed. In the case of very large BLAST jobs, this may take several days. Keeping the interactive terminal connection working for tens on hours may be difficult. Because of that it is recommended that long grid BLAST jobs are submitted using a virtual terminal session, started with the screen command. (More infromation about screen: http://www.rackaid.com/resources/linux-screen-tutorial-and-how-to/)


    Submitting a grid BLAST job


    Let's assume we have a set of 26 000 fasta formatted nucleotide sequences in a file called queryseq.fasta. The file locates in the $METARK directory. In this example we compare this sequence set against nr database with blastx command.

    First, login to hippu.csc.fi and open a virtual terminal with screen command:

    screen

    Then set up the ARC environment and update your grid proxy certificate:

    module load nordugrid-arc 
    grid-proxy-init -rfc -valid 72:00

    The command above we keeps your certificate valid for 72 hours ( three days)

    Then setup BLAST+ environment

    module load blast+

    and move to your $METAWRK directory

    cd $METAWRK

    Now you can launch your search with command:

    gb blastx -query queryseq.fasta -db nr -evalue 0.0001 -out result.table -outfmt 6

    The command first copies the selected database to the grid. In the case of large databases like embl_others or nt this copying process may take several minutes. Then it splits the query sequence set into several subjobs and submits them to the FGI-grid environment. In this example the job is splitted to 6587 subjobs. The gb command checks the status of the subjobs once in a minute. If i the submission or execution of a subjob is failed, the subjob will automatically be re-submitted to another FGI cluster. The gb command does not submit all the jobs at once. Instead it follows the load of the FGI clusters and tries avoid overloading the batch queue systems of the clusters.


    Below is a status report of the gb command:

    2010-09 11:46 INFO                      host  new submitted queuing running finished failed success failure
    2010-09 11:46 INFO        opaali.phys.jyu.fi 0 6 13       7 3 0 5 0
    2010-09 11:46 INFO            kiniini.csc.fi 0 4 18       3 3 0 9 0
    2010-09 11:46 INFO  korundi.grid.helsinki.fi 0 20 0       6 26 0 57 0
    2010-09 11:46 INFO                       N/A 6149 0  0       0        0      0       0 0
    2010-09 11:46 INFO ametisti.grid.helsinki.fi 0 21 0      33 35 0 55 0
    2010-09 11:46 INFO         pythia.tcs.hut.fi 16 0 2       0 0 0 0 0
    2010-09 11:46 INFO            kvartsi.hut.fi 2 16 0      26 13 2 37 0
    2010-09 11:46 INFO                     TOTAL 6167 67 33      75 80 2 163 0

    This report shows that six different clusters are used to process the BLAST jobs. Of the 6587 subjobs. 163 have been successfully completed. 2 have failed and wait to be automatically resubmitted. 80 jobs are finished and wait for post processing, 75 subjobs are running, 33 subjobs are waiting in the batch queues, 67 subjobs have just been submitted to the batch queues and 6167 are still waiting for processing.


    The results will be written to the output file only after all the subjobs have been processed. As already mentioned above, the gb command should be kept running until all the subjobs have been processed. For large BLAST analysis jobs this may take several days. In this example we started the grid BLAST job in a virtual terminal, launched with screen command. Due to that, we can leave the virtual terminal running to the background and come back later on to see how the job has progressed. To leave the virtual terminal, press:

     Ctrl-a-d

    Now you are back in your “real” terminal. To re-connect your virtual terminal, give command:

     screen -r

    Before you log out from Hippu, you should check, which Hippu-server, (hippu1 or hippu2 ) you are using. The virtual terminal can be reconnected only from the server it was launched. Thus if your virtual terminal and grid BLAST job are running in the hippu2 server you must login to hippu2.csc.fi in order to reconnect the virtual terminal with the screen -r command.


    Refreshing your grid proxy

    In the example above, set the grid-proxy to be valid for three days (72 hours). If your grid BLAST job need longer execution time you should update the grid-proxy before it expires. You can do this while the grid blast job is running in the background. For example, let's assume that the grid BLAST job, submitted above, have been running already for two days and it seems to need at least two more days to be completed. Then you should update the grid proxy. To do that, first login to the server where your grid BLAST is running. For example:

    ssh hippu2.csc.fi

    And update the grid proxy with commands:

    module load nordugrid-arc     
    grid-proxy-init -rfc -valid 72:00



    Killing grid BLAST job

    If you wish to stop a gb run you have submitted you need to first stop the gb command by pressing:

    Ctrl-c

    You can check, what grid jobs you have pending or running with command:

    ngstat -a

    Yiu can use commands ngkill and ngclean to stop and remove the subjobs that already have been submitted to grid. This you can do by by giving commands:

    ngkill -a 
    ngclean -a

    Note that the above commands remove all your grid jobs! To remove just one grid job you should run the commands using syntax

    ngkill job_name
    ngclean job_name

    For example:

    ngclean gsiftp://opaali.phys.jyu.fi:2811/jobs/2948612858260241452598222