Sciences and methods > Computational Fluid Dynamics > CFD software
 
Tehdyt toimenpiteet

Serial Batch Jobs

Any work that demands a larger amount of CPU time (above a few minutes) as well as memory has to be run as a batch-job through a resource management system.

Serial Runs

Sun's Grid Engine (SGE)  is installed on corona.csc.fi. The LSF-HPC (Load Sharing Facility for High Performance Computing) runs on murska.csc.fi.

These systems manage and assigns resources. Due to the different architectures, the batch job files for Fluent runs differ a little bit between these two platforms.

Batch Job File for Serial Run on SGE

The batch job file includes the necessary information for the SGE to schedule the request and assign sufficient resources to the task(s). For a single processor (i.e., serial) run with Fluent 6 on corona.csc.fi, the batch job script could have the following contents:

#!/bin/csh
#$ -S /bin/csh
#$ -M myname@mydomain.fi
#$ -m b,e
#$ -l h_rt=00:30:00, s_rt=00:20:00, h_vmem=500M
#$ -o /fs/metawrk/myid/fluent.$JOB_ID.out -j y
source /p/appl/fluid/Fluent.Inc/use/fluent.csh
cd /fs/metawrk/myid/
fluent -g -i journalfile.jou 2d

Explanation for the different lines:

#!/bin/csh sets the shell the batch job is executed in to csh
#$ -S /bin/csh forces csh to be used
#$ -M myname@mydomain.fi sets email address for notifications
#$ -m e,b sends an e-mail notification upon the beginning (option b) and the end (option e) of the batch job to the user
#$ -l h_rt=00:30:00, s_rt=00:20:00, h_vmem=500M sets 30 minutes hard- (option h_rt) and 20 minutes soft limit (option s_rt) for the real time consumption (i.e., wall clock time) as well as the hard limit of 500 Megabytes1) for the total memory consumption (option h_vmem))
#$ -o /fs/metawrk/myid/fluent.$JOB_ID.out -j y defines the file for output. In order to identify the job, the SGE-variable $JOB_ID, which contains the actual job-id (which is also to be inquired via the qstat-command) is included in the filename. the option -y concatenates the standard output with the standard error into this single file
source /p/appl/fluid/Fluent.Inc/use/fluent.csh sets environment variables needed by Fluent (same as the use fluent command in an interactive shell)
cd /fs/metawrk/myid/ changes to the directory where the input files reside. In this case it is the path to the meta-work-directory $METAWRK2)
fluent -g -i journalfile.jou 2d launches two-dimensional Fluent 6 run. input.jou is a journal file that contains Fluent text menu commands (see below). The option -g prohibits any graphical output of Fluent
1)k Multiplies the value by 1000; K Multiplies the value by 1024. m Multiplies the value by 106. M Multiplies the value by 1024 times 1024
2)mind that only the directories associated with the environment variables $HOME and $METAWRK are common to both nodes, corona1.csc.fi and corona2.csc.fi


A typical journal file for serial runs could look like the following example:

file/read-case casefile
solve/initialize/initialize-flow
solve/iterate 3000
file/confirm-overwrite no
file/write-data casefile.dat
exit

Explanation for the different lines:

file/read-case casefile
reads the earlier prepared casefile (casefile.cas)

solve/initialize/initialize-flow initializes the field variables (needed to launch the simulation

 solve/iterate 3000
performs up to 3000 iteration (less, if converged earlier)

 file/confirm-overwrite no
will overwrite existing files without asking (which would hang the batch-job)

 file/write-data casefile.dat
writes data of result into casefile.dat

 exit properly exits Fluent


Batch-job Input on LSF-HPC (murska.csc.fi)

The syntax for a batch-job file on murska.csc.fi slightly differs from the SGE scripts. The similar example script presented for SGE above reads as follows:

#!/bin/csh 

#BSUB -L /bin/csh
#BSUB -M 448576
#BSUB -W 00:10
#BSUB -n 1
#BSUB -e
my_output_err_%J
#BSUB -o
my_output_%J

module load fluent/fluent.inc

fluent 3d -g -i
journalfile.jou > fluent.out

The different entries are:

#!/bin/csh declare shell that applies
#BSUB -L /in/csh
execution shell
#BSUB -M 448576
max. memory/core in bytes
#BSUB -W 00:10
max. wall clock time spent with job
#BSUB -n 1
one core only, i.e., serial run
#BSUB -e my_err_%J
declare file where stderr is stored (the %J at
the end includes the job-ID in the filename)
#BSUB -o my_output_%J declare file where stdio is stored (the %J
at the end includes the job-ID in the filename)
module load fluent/fluent.inc
loads setup to run Fluent
fluent 3d -g -i journalfile.jou > fluent.out
launches 3d version of fluent and runs journalfile.jou, redirecting output into fluent.out

Submitting, Monitoring and Deleting Jobs on SGE

On corona.csc.fi jobs are submitted using the command
qsub jobfilename.sh
where jobfilename.sh is the file containing the batch-job script.

Thereafter, the job can be monitored using the command
qstat -u myuser-id
where myuser-id has to be replaced with the corresponding login-id. In order to view all jobs in the queue, simply drop the give the command without any option. The output then looks similar to this example:

job-ID  prior   name             user       state submit/start at     queue  
-----------------------------------------------------------------------------
2140105 1.00000 jobfilename.sh myuser-id r 09/29/2006 15:10:48
In order to remove a job from the queue, the user has to enter the command
qdel job-ID
where job-ID stands for the number given to the job by SGE (in the example above that would be 2140105).

To analyse the needed resources of a job that has already finished, the user can give the command
qacct -j job-ID
Therafter statistics on the job are displayed. This is for instance a good method to check, whether the memory consumption exceeded the given limit, as SGE does not provide any direct indication of such an error in the output-file.

Submitting, Monitoring and Deleting Jobs on LSF

On murska.csc.fi, the job is submitted giving the ommand
bsub < jobfilename.sh
where jobfilename.sh is the file containing the batch-job script. It is essential to insert the re-direction symbol < between the bsub-command and the filename containing the LSF commands.

Thereafter, the job (as a matter of fact, all jobs submitted by you) can be monitored using the command
bjobs
Jobs can be taken out of the queue or execution using the command
bkill JOBID
where JOBID is the run's numerical identifier, which can be inquired applying the bjobs-command.

Zwinger Thomas Thomas.Zwinger at csc.fi