You can use the Finnish Grid Initiative (FGI) distributed computing environment for running large BLAST sequence similarity search tasks. In FGI, a BLAST search, consisting of large number of query sequences is split into a number of sub-tasks that are submitted for execution to the FGI clusters around Finland. Splitting and submitting the tasks to remote servers requires some time, so in the case of small queries ( say less that 1000 query sequences) running the jobs in the servers of CSC is usually faster. However if the query sets consists of tens of thousands of sequences or more, the capacity of FGI can be fully utilized and the through-put times are often shorter than what the clusters of CSC alone could provide.
Getting grid access
To be able to use grid resources for BLAST searches, you should have:
- A valid grid certificate installed in the hippu.csc.fi server.
- Membership of fgi.csc.fi Virtual Organization
For detailed instructions, see the document below:
Grid-BLAST @ CSC
BLAST jobs can be submitted to FGI using gb (Grid Blast) command in Hippu.csc.fi. The gb command works in the same way as the pb command used to submit BLAST jobs to the local batch queues in Vuori or Murska. The gb command can be used with any BLAST+ command (The older BLAST versions can't be used with the gb command). The gb command will automatically split your query sequences into smaller subjobs and then submit these BLAST searches to be executed in the FGI-grid. The script continues to follow the progress of the subjobs and finally retrieves the results from the remote servers to the given output file.
The gb command must be kept running until all the subjobs have been processed. In the case of very large BLAST jobs, this may take several days. Keeping the interactive terminal connection working for tens on hours may be difficult. Because of that it is recommended that long grid BLAST jobs are submitted using a virtual terminal session, started with the screen command. (More infromation about screen: http://www.rackaid.com/resources/linux-screen-tutorial-and-how-to/)
Submitting a grid BLAST job
Let's assume we have a set of 26 000 fasta formatted nucleotide sequences in a file called queryseq.fasta. The file locates in the $METARK directory. In this example we compare this sequence set against nr database with blastx command.
First, login to hippu.csc.fi and open a virtual terminal with screen command:
screen
Then set up the ARC environment and update your grid proxy certificate:
module load nordugrid-arc
grid-proxy-init -rfc -valid 72:00
The command above we keeps your certificate valid for 72 hours ( three days)
Then setup BLAST+ environment
module load blast+
and move to your $METAWRK directory
cd $METAWRK
Now you can launch your search with command:
gb blastx -query queryseq.fasta -db nr -evalue 0.0001 -out result.table -outfmt 6
The command first copies the selected database to the grid. In the case of large databases like embl_others or nt this copying process may take several minutes. Then it splits the query sequence set into several subjobs and submits them to the FGI-grid environment. In this example the job is splitted to 6587 subjobs. The gb command checks the status of the subjobs once in a minute. If i the submission or execution of a subjob is failed, the subjob will automatically be re-submitted to another FGI cluster. The gb command does not submit all the jobs at once. Instead it follows the load of the FGI clusters and tries avoid overloading the batch queue systems of the clusters.
Below is a status report of the gb command:
2010-09 11:46 INFO host new submitted queuing running finished failed success failure
2010-09 11:46 INFO opaali.phys.jyu.fi 0 6 13 7 3 0 5 0
2010-09 11:46 INFO kiniini.csc.fi 0 4 18 3 3 0 9 0
2010-09 11:46 INFO korundi.grid.helsinki.fi 0 20 0 6 26 0 57 0
2010-09 11:46 INFO N/A 6149 0 0 0 0 0 0 0
2010-09 11:46 INFO ametisti.grid.helsinki.fi 0 21 0 33 35 0 55 0
2010-09 11:46 INFO pythia.tcs.hut.fi 16 0 2 0 0 0 0 0
2010-09 11:46 INFO kvartsi.hut.fi 2 16 0 26 13 2 37 0
2010-09 11:46 INFO TOTAL 6167 67 33 75 80 2 163 0
This report shows that six different clusters are used to process the BLAST jobs. Of the 6587 subjobs. 163 have been successfully completed. 2 have failed and wait to be automatically resubmitted. 80 jobs are finished and wait for post processing, 75 subjobs are running, 33 subjobs are waiting in the batch queues, 67 subjobs have just been submitted to the batch queues and 6167 are still waiting for processing.
The results will be written to the output file only after all the subjobs have been processed. As already mentioned above, the gb command should be kept running until all the subjobs have been processed. For large BLAST analysis jobs this may take several days. In this example we started the grid BLAST job in a virtual terminal, launched with screen command. Due to that, we can leave the virtual terminal running to the background and come back later on to see how the job has progressed. To leave the virtual terminal, press:
Ctrl-a-d
Now you are back in your “real” terminal. To re-connect your virtual terminal, give command:
screen -r
Before you log out from Hippu, you should check, which Hippu-server, (hippu1 or hippu2 ) you are using. The virtual terminal can be reconnected only from the server it was launched. Thus if your virtual terminal and grid BLAST job are running in the hippu2 server you must login to hippu2.csc.fi in order to reconnect the virtual terminal with the screen -r command.
Refreshing your grid proxy
In the example above, set the grid-proxy to be valid for three days (72 hours). If your grid BLAST job need longer execution time you should update the grid-proxy before it expires. You can do this while the grid blast job is running in the background. For example, let's assume that the grid BLAST job, submitted above, have been running already for two days and it seems to need at least two more days to be completed. Then you should update the grid proxy. To do that, first login to the server where your grid BLAST is running. For example:
ssh hippu2.csc.fi
And update the grid proxy with commands:
module load nordugrid-arc
grid-proxy-init -rfc -valid 72:00
Killing grid BLAST job
If you wish to stop a gb run you have submitted you need to first stop the gb command by pressing:
Ctrl-c
You can check, what grid jobs you have pending or running with command:
ngstat -a
Yiu can use commands ngkill and ngclean to stop and remove the subjobs that already have been submitted to grid. This you can do by by giving commands:
ngkill -a
ngclean -a
Note that the above commands remove all your grid jobs! To remove just one grid job you should run the commands using syntax
ngkill job_name
ngclean job_name
For example:
ngclean gsiftp://opaali.phys.jyu.fi:2811/jobs/2948612858260241452598222