The batch job queing and scheduling system used on Murska is the Load Sharing Facility for High Performance Computing (LSF-HPC) which is integrated with the job launching, executing and monitoring system Simple Linux Utility for Resource Management (SLURM).
This chapter describes how to use LSF-HPC and SLURM, including their most important commands and their options, how to submit jobs, how to monitor, display and get information about hosts, queues and jobs, and how to manage jobs and remove them from the queue. Batch job script examples are given.
All batch jobs must be submitted to compute nodes via the queuing mechanism of LSF-HPC using the command bsub, see Submitting jobs: bsub about its options and usage. Usage of the job launching commands srun (SLURM) and shortly also mpirun (HP-MPI) are described in this Chapter. It's possible but generally not recommended to handle also interactive jobs via LSF-HPC.
NB! Remember to use -M ... option with bsub to specify the needed amount of memory for your job to prevent swapping and performance regressions. The option -M is mandatory. A batch job start script (esub) sets automatically the LSF-SLURM external schedule option -ext "SLURM[constraint=constraint]", where constraint is smallmem, mediummem, bigmem or hugemem or their combination, according to the setting of -M, that is acccording to requested memory. There is no need to use this external schedule option anymore by yourself.
Available queues
The command bqueues displays available queues and some of their proporties. These may change from time to time. The following queues were available for customers when this chapter was written:
serial : 1 core / 4h/7d def/max runtime / not interactive
parallel : 256 cores / 4h/2d def/max runtime / not interactive
interactive : 32 cores / 1h/4h def/max runtime / interactive
longrun : 128 cores / 8h/21d def/max runtime / not interactive
NB! In the longrun queue you run at your own risk. If a batch job in that queue stops prematurely no compensation is given for lost cpu time!
The chapter is divided to the following subsections:
1 General information
2 Commands (submitting and deleting jobs)
2.1 Submitting jobs: bsub
3 Serial batch jobs
4 Parallel jobs
5 Monitoring and displaying jos and system status