Any work that demands a larger amount of CPU time (above a few minutes) as well as memory has to be run as a batch-job through a resource management system.
Serial Runs
Sun's Grid Engine (SGE) is installed on corona.csc.fi. The LSF-HPC (Load Sharing Facility for High Performance Computing) runs on murska.csc.fi.
These systems manage and assigns resources. Due to the different architectures, the batch job files for Fluent runs differ a little bit between these two platforms.
Batch Job File for Serial Run on SGE
The
batch job file includes the necessary information for the SGE to
schedule the request and assign sufficient resources to the task(s).
For a single processor (i.e., serial) run with Fluent 6 on
corona.csc.fi, the batch job script could have the following contents:
#!/bin/csh |
Explanation for the different lines:
| #!/bin/csh | sets the shell the batch job is executed in to csh |
| #$ -S /bin/csh | forces csh to be used |
| #$ -M myname@mydomain.fi | sets email address for notifications |
| #$ -m e,b | sends an e-mail notification upon the beginning (option b) and the end (option e) of the batch job to the user |
| #$ -l h_rt=00:30:00, s_rt=00:20:00, h_vmem=500M | sets 30 minutes hard- (option h_rt) and 20 minutes soft limit (option s_rt) for the real time consumption (i.e., wall clock time) as well as the hard limit of 500 Megabytes1) for the total memory consumption (option h_vmem)) |
| #$ -o /fs/metawrk/myid/fluent.$JOB_ID.out -j y | defines the file for output. In order to identify the job, the SGE-variable $JOB_ID, which contains the actual job-id (which is also to be inquired via the qstat-command) is included in the filename. the option -y concatenates the standard output with the standard error into this single file |
| source /p/appl/fluid/Fluent.Inc/use/fluent.csh | sets environment variables needed by Fluent (same as the use fluent command in an interactive shell) |
| cd /fs/metawrk/myid/ | changes to the directory where the input files reside. In this case it is the path to the meta-work-directory $METAWRK2) |
| fluent -g -i journalfile.jou 2d | launches two-dimensional Fluent 6 run. input.jou is a journal file that contains Fluent text menu commands (see below). The option -g prohibits any graphical output of Fluent |
2)mind that only the directories associated with the environment variables $HOME and $METAWRK are common to both nodes, corona1.csc.fi and corona2.csc.fi
A typical journal file for serial runs could look like the following example:
file/read-case casefile
solve/initialize/initialize-flow
solve/iterate 3000
file/confirm-overwrite no
file/write-data casefile.dat
exit
Explanation for the different lines:
| file/read-case casefile |
reads the earlier prepared casefile (casefile.cas) |
|
| solve/initialize/initialize-flow | initializes the field variables (needed to launch the simulation |
|
| solve/iterate 3000 |
performs up to 3000 iteration (less, if converged earlier) |
|
| file/confirm-overwrite no |
will overwrite existing files without asking (which would hang the batch-job) |
|
| file/write-data casefile.dat |
writes data of result into casefile.dat |
|
| exit | properly exits Fluent |
Batch-job Input on LSF-HPC (murska.csc.fi)
The syntax for a batch-job file on murska.csc.fi slightly differs from the SGE scripts. The similar example script presented for SGE above reads as follows:
#!/bin/csh
#BSUB -L /bin/csh
#BSUB -M 448576
#BSUB -W 00:10
#BSUB -n 1
#BSUB -e my_output_err_%J
#BSUB -o my_output_%J
module load fluent/fluent.inc
fluent 3d -g -i journalfile.jou > fluent.out
The different entries are:
| #!/bin/csh | declare shell that applies |
| #BSUB -L /in/csh |
execution shell |
| #BSUB -M 448576 |
max. memory/core in bytes |
| #BSUB -W 00:10 |
max. wall clock time spent with job |
| #BSUB -n 1 |
one core only, i.e., serial run |
| #BSUB -e my_err_%J |
declare file where stderr is stored (the %J at the end includes the job-ID in the filename) |
| #BSUB -o my_output_%J | declare file where stdio is stored (the %J at the end includes the job-ID in the filename) |
| module load fluent/fluent.inc |
loads setup to run Fluent |
| fluent 3d -g -i journalfile.jou > fluent.out |
launches 3d version of fluent and runs journalfile.jou, redirecting output into fluent.out |
Submitting, Monitoring and Deleting Jobs on SGE
On corona.csc.fi jobs are submitted using the commandqsub jobfilename.shwhere jobfilename.sh is the file containing the batch-job script.
Thereafter, the job can be monitored using the command
qstat -u myuser-idwhere myuser-id has to be replaced with the corresponding login-id. In order to view all jobs in the queue, simply drop the give the command without any option. The output then looks similar to this example:
job-ID prior name user state submit/start at queueIn order to remove a job from the queue, the user has to enter the command
-----------------------------------------------------------------------------
2140105 1.00000 jobfilename.sh myuser-id r 09/29/2006 15:10:48
qdel job-IDwhere job-ID stands for the number given to the job by SGE (in the example above that would be 2140105).
To analyse the needed resources of a job that has already finished, the user can give the command
qacct -j job-IDTherafter statistics on the job are displayed. This is for instance a good method to check, whether the memory consumption exceeded the given limit, as SGE does not provide any direct indication of such an error in the output-file.
Submitting, Monitoring and Deleting Jobs on LSF
On murska.csc.fi, the job is submitted giving the ommandbsub < jobfilename.shwhere jobfilename.sh is the file containing the batch-job script. It is essential to insert the re-direction symbol < between the bsub-command and the filename containing the LSF commands.
Thereafter, the job (as a matter of fact, all jobs submitted by you) can be monitored using the command
bjobsJobs can be taken out of the queue or execution using the command
bkill JOBIDwhere JOBID is the run's numerical identifier, which can be inquired applying the bjobs-command.
| Zwinger Thomas | Thomas.Zwinger at csc.fi |